Journal of Big Data

Journal of Big Data Cover Image

Featured Collections on Computationally Intensive Problems in General Math and Engineering

This two-part special issue covers computationally intensive problems in engineering and focuses on mathematical mechanisms of interest for emerging problems such as Partial Difference Equations, Tensor Calculus, Mathematical Logic, and Algorithmic Enhancements based on Artificial Intelligence. Applications of the research highlighted in the collection include, but are not limited to: Earthquake Engineering, Spatial Data Analysis, Geo Computation, Geophysics, Genomics and Simulations for Nature Based Construction, and Aerospace Engineering. Featured lead articles are co-authored by three esteemed Nobel laureates: Jean-Marie Lehn, Konstantin Novoselov, and Dan Shechtman.

Open Special Issues

Advancements on Automated Data Platform Management, Orchestration, and Optimization Submission Deadline: 30 September 2024 

Emergent architectures and technologies for big data management and analysis Submission Deadline: 1 October 2024 

View our collection of open and closed special issues

  • Most accessed

Optimization-based convolutional neural model for the classification of white blood cells

Authors: Tulasi Gayatri Devi and Nagamma Patil

Advanced RIME architecture for global optimization and feature selection

Authors: Ruba Abu Khurma, Malik Braik, Abdullah Alzaqebah, Krishna Gopal Dhal, Robertas Damaševičius and Bilal Abu-Salih

Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms

Authors: Ghada Mostafa, Hamdi Mahmoud, Tarek Abd El-Hafeez and Mohamed E. ElAraby

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Authors: Muhammad Mujahid, EROL Kına, Furqan Rustam, Monica Gracia Villar, Eduardo Silva Alvarado, Isabel De La Torre Diez and Imran Ashraf

Advancing machine learning with OCR2SEQ: an innovative approach to multi-modal data augmentation

Authors: Michael Lowe, Joseph D. Prusa, Joffrey L. Leevy and Taghi M. Khoshgoftaar

Most recent articles RSS

View all articles

A survey on Image Data Augmentation for Deep Learning

Authors: Connor Shorten and Taghi M. Khoshgoftaar

Big data in healthcare: management, analysis and future prospects

Authors: Sabyasachi Dash, Sushil Kumar Shakyawar, Mohit Sharma and Sandeep Kaushik

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Authors: Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J. Santamaría, Mohammed A. Fadhel, Muthana Al-Amidie and Laith Farhan

Deep learning applications and challenges in big data analytics

Authors: Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar, Naeem Seliya, Randall Wald and Edin Muharemagic

Short-term stock market price trend prediction using a comprehensive deep learning system

Authors: Jingyi Shen and M. Omair Shafiq

Most accessed articles RSS

Aims and scope

Latest tweets.

Your browser needs to have JavaScript enabled to view this timeline

  • Editorial Board
  • Sign up for article alerts and news from this journal
  • Follow us on Twitter

Annual Journal Metrics

2022 Citation Impact 8.1 - 2-year Impact Factor 5.095 - SNIP (Source Normalized Impact per Paper) 2.714 - SJR (SCImago Journal Rank)

2023 Speed 56 days submission to first editorial decision for all manuscripts (Median) 205 days submission to accept (Median)

2023 Usage  2,559,548 downloads 280 Altmetric mentions

  • More about our metrics
  • ISSN: 2196-1115 (electronic)

METHODS article

Scientific data management in the age of big data: an approach supporting a resilience index development effort.

\nLinda C. Harwell

  • 1 National Health and Environmental Effects Research Laboratory, Gulf Ecology Division, Office of Research and Development, U.S. Environmental Protection Agency, Gulf Breeze, FL, United States
  • 2 Student Services Contractor, Oak Ridge Associated Universities, Oak Ridge, TN, United States
  • 3 Student Services Contractor, University of West Florida, Pensacola, FL, United States

The increased availability of publicly available data is, in many ways, changing our approach to conducting research. Not only are cloud-based information resources providing supplementary data to bolster traditional scientific activities (e.g., field studies, laboratory experiments), they also serve as the foundation for secondary data research projects such as indicator development. Indicators and indices are a convenient way to synthesize disparate information to address complex scientific questions that are difficult to measure directly (e.g., resilience, sustainability, well-being). In the current literature, there is no shortage of indicator or index examples derived from secondary data with a growing number that are scientifically focused. However, little information is provided describing the management approaches and best practices used to govern the data underpinnings supporting these efforts. From acquisition to storage and maintenance, secondary data research products rely on the availability of relevant, high-quality data, repeatable data handling methods and a multi-faceted data flow process to promote and sustain research transparency and integrity. The U.S. Environmental Protection Agency recently published a report describing the development of a climate resilience screening index which used over one million data points to calculate the final index. The pool of data was derived exclusively from secondary sources such as the U.S. Census Bureau, Bureau of Labor Statistics, Postal Service, Housing and Urban Development, Forestry Services and others. Available data were presented in various forms including portable document format (PDF), delimited ASCII and proprietary format (e.g., Microsoft Excel, ESRI ArcGIS). The strategy employed for managing these data in an indicator research and development effort represented a blend of business practices, information science, and the scientific method. This paper describes the approach, highlighting key points unique for managing the data assets of a small-scale research project in an era of “big data.”

Introduction

The current literature shows that there is growing support from the scientific community for using secondary or “found” data in both theoretical and applied research ( Niemeijer and de Groot, 2008 ; Hampton et al., 2013 ; Davis-Kean et al., 2015 ). The “big data” environment has proven to be fertile ground for nurturing innovation in indicator research and development. Easily accessible secondary data has given rise to new big data technologies that can potentially increase the production of robust and reproducible indicator products ( Madin et al., 2007 ; Mooney and Winstanley, 2007 ; Demchenko et al., 2013 ; Jha et al., 2015 ). The concept of big data has been described in many ways. However, no single statement serves as the de facto definition. De Mauro et al. (2015) proposes an ontologically derived definition based on an analysis of existing big data definitions. The authors suggest that “Big Data represents the Information assets characterized by such a High Volume, Velocity, and Variety to require specific Technology and Analytical Methods for its transformation into Value.” This description seems aptly relevant as it emphasizes the enormity of the public access landscape as well as the tools needed to work with big data effectively.

The “information highway” moves over 35 terabits of data per minute (roughly 1.1 billion double-sided print pages of information every 60 s). New and upgraded submarine fiber optic routes have increased data transfer capacity by 32% annually for the last 5 years to support the growing digital load ( Submarine Telecoms, 2017 , p. 17). In no small measure, the research community has contributed to the proliferation of big data. Many funding organizations now require that data generated through publicly-funded research be made openly available if legally and ethically possible. In the United States (U.S.), all federal agencies investing in research must support increased access to published research and resulting scientific data ( Holdren, 2013, February 22 ). This continuous inflow of freely accessible research products offers some broad reaching benefits not the least of which is simply increasing research visibility ( Piwowar et al., 2007 ). For indicator research and development, big data are playing an essential role in filling long-standing data gaps in quantifying complex, multi-dimensional concepts such as sustainability, resilience, and well-being measures ( Smith et al., 2013 ; Cutter et al., 2014 ; OECD, 2017 ; Buck et al., 2018 ; Summers et al., 2018 ; Wendling et al., 2018 ; Helliwell et al., 2019 ).

The wealth of accessible information can be both rewarding and challenging for science, especially in finding ways to manage it. Scientific data management (SDM) has historically been a challenge for research. A two-part commentary, “ How to Manage Data Badly Part 1 and 2” ( Hale, 1999 , 2000 ), highlighted existing issues surrounding the management of research data in the field of ecology. Although the publication described the lack of SDM in the context of a single science discipline, the message resonated universally as few people could disagree with the observations regarding the poor state of SDM practices 20 years ago. Since then, data and information sciences have taken center stage as organizations seek to build more robust and efficient ways to collect, process, manage and curate big data ( Gray et al., 2005 ; Sansone et al., 2018 ). New technologies and expert solutions are emerging to assist both private and public sectors in managing big data ( Pilat and Fukasaku, 2007 ; Cox and Pinfield, 2014 ; Simms et al., 2016 ; Borycz and Carroll, 2018 ).

“Big science” research (i.e., high throughput, long-term or high value) are often provided with enough resources to support the technology and expertise needed to implement well-designed SDM and curation frameworks ( Crowston and Qin, 2011 ; Berman and Cerf, 2013 ). On the other hand, “small science” projects (i.e., small team, short-term or exploratory research) often lack adequate SDM funding even though small-scale research can collectively generate more data than their “big science” counterparts ( Crowston and Qin, 2011 ). Individual researchers often bear the responsibility for managing the data assets in smaller-scale science, yet many do not have practical data management experience or access to relevant personnel to process, document, and, eventually, curate big data-driven research adequately ( Lynch, 2008 ; Borgman, 2012 ). As research funding ebbs and flows, smaller-scale efforts are increasingly turning to big data to support research. Without sufficient SDM support, big data collection and processing activities alone can quickly overwhelm a project, making it difficult to curate reproducible science ( Lowndes et al., 2017 ). With a growing universe of open research and the ease with which the data may be acquired, it seems imperative that research institutions invest in building the capacity for all research efforts to plan and execute robust SDM, regardless of the size or perceived value ( Everyone Needs a Data-Management Plan, 2018 )

There is a growing demand for science-based indicators ( Nardo et al., 2005 ) and indicator research is well-suited for big data. By design, indicators and indices (summarized indicators) are intended for a public audience. With the advent of the open access initiatives, SDM planning guidelines and tools are abundant, yet many of these resources lack the details and a common set of standards to be meaningful ( Dietrich et al., 2012 ). Research data and the processes to manage them are iterative and “mature” over time as the research progresses ( Crowston and Qin, 2011 ; Digital Curation Center, http://www.dcc.ac.uk/ ). For large-scale or high-volume research efforts, highly automated and detailed SDM policies may be most appropriate, but for smaller research activities, a more straightforward infrastructure that can evolve as the data mature may be the most beneficial ( Link et al., 2017 ).

In 2017, the U.S. Environmental Protection Agency (EPA) published the conceptual framework and demonstration of the Climate Resilience Screening Index (CRSI) ( Summers, J. K. et al., 2017 ; Summers, K. et al., 2017 ; Summers et al., 2018 ). EPA researchers were tasked with developing and demonstrating a composite index that could characterize the resilience of the U.S. in the context of potential natural hazard exposures—in a 12-month time frame and using existing resources. The CRSI framework is hierarchical ( Figure 1 ). The overall index is informed by five domain sub-indices that are described by twenty indicators which are comprised of 117 metrics. To be most useful, CRSI needed to be applicable to different geographical, population, and temporal scales using the same cultivated data set. A diverse ecosystem of secondary data representing 120 unique data values were collected for 3135 U.S. counties in 2000–2015 time-period to quantify metrics.

www.frontiersin.org

Figure 1 . The CRSI conceptual framework ( Summers, K. et al., 2017 ). Lines extending left and right of domain labeled boxes depict a theoretical range of socio-economic and ecological recoverability factors that may influence the overall CRSI measure. Black arrows relate to indicators and color, diamond-ended lines are assigned to domains highlighted by the same color.

The development of composite indices to describe complex ideas is not new. The Better Life Index (BLI) ( OECD, 2017 ), Environmental Performance Index (EPI) ( Wendling et al., 2018 ), Human Development Index (HDI) ( United Nations Development Programme, 2018 ), and Ocean Health Index (OHI) ( Halpern et al., 2012 ) are a few notable examples. A composite index is a communication tool that uses a collection of individual metrics or indicators to translate data into information that describes a multi-dimensional concept ( Nardo et al., 2005 ). A common trait shared across the example indices and CRSI is the use and synthesis of economic, social and ecological secondary data. BLI, EPI, HDI, and OHI offer reference materials, tools and data in a readily accessible format (i.e., websites and web-services) to help others reproduce the featured indices. All four indicator research efforts are exemplar cases of transparent and reproducible research in the end-stage or mature phase of the full SDM cycle. The CRSI research, on the other hand, is still “young” in the data maturation continuum and many of the SDM systems are still evolving. Project researchers rather than data professionals are responsible for planning and implementing SDM. Most CRSI team members lack practical SDM experience. The researchers are generally familiar with the premise of SDM but not the common vernacular or specific considerations associated with secondary resources. Like many research institutions, SDM planning and open access research are not new subjects at the U.S. EPA, although details vary widely from one research project to another.

The perceived apathy toward indicator research SDM and curation appears to be a recurring theme. Early stages in big data SDM in particular are prone to be hectic and disorganized since processes have yet to stabilize ( Crowston and Qin, 2011 ). What is lacking in the current SDM literature is a portrait of SDM-life before all the data decisions have been made and SDM processes are in flux. This paper describes the CRSI SDM approach which offers an inside peek at SDM from the “small-science” perspective. Highlighted are key strategies that have proven helpful for managing the big data assets of CRSI and addressing potential challenges that can impede successful research outcomes.

The CRSI SDM Concept

SDM in the CRSI effort is an inclusive process where all researchers are expected to participate in data collection, assessment, processing, and storage. The SDM infrastructure is adapted from past practices described in Hale et al. (2003) which emphasizes a culture of “data sharing.” Additional queues from Zook et al. (2017) helped inform CRSI SDM requirements for capturing the copyright information ( Carroll, 2015 ), data provenance ( Carlson and Anderson, 2007 ), and data ethics ( Floridi and Taddeo, 2016 ; Vayena and Tasioulas, 2016 ) that are especially important to address when data are made publicly accessible. U.S. EPA SDM guidelines recommend that a suite of 10 topics should be addressed for thorough data asset management planning ( Table 1 ). Since principal investigators lead and provide oversight in research projects, it seems natural that improving SDM outcomes begins with education and hands-on experience for researchers. The CRSI SDM is a relatively simple framework that embraces “better data management through partnerships” concepts ( Hale et al., 2003 ), adapted for a small, co-located team. At its core, the CRSI SDM environment is as much a training platform as it is an assemblage of data management practices. The objectives of this “learn as you go” SDM ethos is to adequately execute research asset management while increasing the SDM knowledge and capabilities of the research personnel. Governance of data collection, processing, and curation is integrated into the science conversation, so the language of research curation becomes as natural to the researchers as the science. The SDM of the CRSI effort represents a collaborative process in which all researchers have ownership.

www.frontiersin.org

Table 1 . Elements addressed in the scientific data management (SDM) plan for the Climate Screening Resilience Index research.

Data Collection

Every member of the team participated in the literature, secondary data, and metadata collection. A literature review was conducted to describe the state of resilience indicator science to provide the rationale for the development of the index and to identify existing resilience indicator efforts that could inform the research. Publications related to any resilience indicator or index concepts including hazard exposures, natural disasters, infrastructure, quality of life and governance were considered as potential sources of contextual data for CRSI. Based on the completed literature review, each researcher searched the internet for sources of publicly available data to identify and collect as candidate secondary data relevant for quantifying CRSI indicators. Supplementary information such as licensing documents, disclaimers, data catalogs, and users' guides, was also collected along with secondary data.

Data Acceptance

Data collection is, of course, at the core of indicator development. Exploring big data can result in many secondary data resources, some representing alternative choices for the same data. Procedural guidelines were developed to help minimize bias and improve selection relevancy during the literature and secondary data collection process. To the extent possible, these criteria served as the first-level evaluation for determining the potential suitability of secondary data for use in CRSI calculations. If a set of data appeared relevant but did not meet every criterion, then a team consensus informed the final determination on acceptability. The following ( Table 2 ) briefly describes each criterion.

www.frontiersin.org

Table 2 . Data acceptance criteria used to identify and select secondary.

Assessing CRSI Data Quality and Suitability

There is a persistent assumption that data retrieved from a credible source are suitable for a research effort out of hand ( Boyd and Crawford, 2012 ). Cai and Zhu (2015) provide thoughtful insight regarding the challenges of examining the quality and suitability of big data. While reviewing data can be straightforward, the suitability of the data for the research is a bit more subjective and requires a way to conceptualize the data in the context of intended use. While random subsets of data were manually reviewed for quality and errors, a 100% assessment is nearly impossible with extensive sets of data. Descriptive statistics were most helpful for assessing the quality and suitability of the secondary data for CRSI. A full complement of summaries was generated for each component of the CRSI framework including the metrics. Histograms and other visualizations assisted researchers with examining data for anomalies and use-case weaknesses.

Tools for Literature and Data Acquisition/Processing

Publish or Perish software ( Harzing, 2007 ) was used to assist with identifying literature for review. Clearly defined keywords and phrases were used to search well-established literature repositories (e.g., Scopus, Web of Science, JSTOR). Responsibilities for conducting the literature review were distributed across the research team. Each publication was evaluated for relevance to the CRSI research. Electronic publication files were downloaded and maintained in a literature repository. Manual literature searches were conducted to help fill any literature gaps resulting from the software-driven prioritization.

For many, collected literature simply contribute to the reference list in publications. However, in SDM, the decision choices related to including or excluding a published work for the research, is data . To that end, researches provided a summary associated with each review using a template as an outline. The outline captured information that could be used to drive queries to produce literature-related statistics or reporting. Citations along with review summaries were eventually uploaded to a Microsoft (MS) Access (2016) database.

There is a movement that is rapidly spreading within the research community—the use of open-source tools for processing big data (e.g., R-Project, https://www.r-project.org/ ; Python, https://www.python.org/ ; Apache Spark, https://spark.apache.org/ ). Unfortunately, the skill sets available for processing CRSI data ranged from practically non-existent to programming in multiple languages. Each researcher used their tool of choice for processing data. While this decision lacked robust technical standardization, it offered a timely solution for completing data collection and processing by helping to distribute the data processing load. Allowing each researcher to work with the tool most familiar to them also helped reduce data processing errors. SAS, R-Project, SPSS, MS Excel, ESRI ArcGIS, and Python were the dominant software packages used for processing the data. A suite of secondary data was assigned to specific individuals based on their level of data handling experience. Each researcher was responsible for formatting, standardizing and harmonizing their selection of secondary data as well as documenting the processing methods.

Organization and Storage of CRSI Data Resources

Research data and other materials were physically stored on a centralized network server housed within the U.S. EPA. Hierarchically-nested subdirectories or folders contained all information consisting of raw data, processed data, final research results, and supplementary information. The physical storage structures that comprised the framework mirrored the different components of the CRSI research. This arrangement offered a convenient way to compartmentalize the various stages of the research data assets. Additionally, associating file structure features with components of the research made it easier for researchers to locate specific pieces of information. Figure 2 shows the CRSI data storage layout.

www.frontiersin.org

Figure 2 . Illustration of the CRSI file structure layout. Each block represents a separate subdirectory or folder. All elements organized under the “Data” block form the primary data construct.

CRSI Data Construct

Central to the file storage structure was the CRSI data construct. The data construct is a remnant of past practices that has worked well-across different research efforts. Data assets were partitioned relative to their processed status. The directory naming conventions were consistent with past and concurrent research activities helping to maintain data organization consistency. Also, the data construct made it convenient for managing access permissions and enforcing data policies, e.g., use constraints, sensitive data access, and original data preservation. Apart from raw geospatial data (Section Geospatial Data), the CRSI data construct was used for the handling of raw, processed, and production (research results) data. As depicted in Figure 1 , the D1 directory warehoused the raw secondary data in the form provided by the source along with pertinent documentation (e.g., metadata, data dictionaries, users' guides). Once all secondary data were collected and vetted, the original downloaded files were held sequestered while a copy operated as the functional data platform for the remaining phases of data processing. The D2 directory housed processed data (e.g., standardized) that were accessed repetitively for CRSI data quality assessments and analyses. Data quality assessment results and software code files related to data processing or qualifying were maintained in the D2 directory as well. The D3 structure held the CRSI results in comma-delimited ( * .csv) format. Files produced in software-specific form (e.g., * .sas7bdat, * .xlsx) were maintained as an additional layer of data recoverability. Information housed in the D3-level structure consisted of demonstration results, model inputs, and map products.

Geospatial Data

Geospatial processing was used to derive natural environment and natural hazard values based on the Multi-Resolution Land Characteristics (MRLC) Consortium's National Land Cover Data Set ( Homer et al., 2015 ), both with and without additional secondary data overlays. Secondary data collected for geoprocessing were archived in their original form. Base maps and data downloads were migrated to a file geodatabase construct for geospatial processing where secondary data were rendered as feature classes. A file-based geodatabase was used for managing and querying the collection of CRSI-related spatial data. A file geodatabase organizes data physically in a directory or folder structure rather than in a single personal database file such as those used with MS Access. Individual data files are accessed directly using geospatial software such as ESRI ArcGIS (Version 10.5), the application used for CRSI. For this effort, the use of a file geodatabase served multiple purposes:

• Eliminated the constraints of individual file sizes that are associated with other GIS conventions (shapefiles).

• Allowed for the use of a standardized coordinate system to ensure all imported data would be uniformly projected, without further intervention.

• Kept related data together and organized during processing.

Values generated from geospatial processing were treated as “found data” and folded into the D1 portion of the data construct. Any further standardization or normalization treatment of these data followed the same protocols as all other sources of secondary data.

CRSI Data Security and Data Operations Continuity

Existing enterprise-wide information security protocols served as the primary access and data security defense for CRSI. However, these measures could not safeguard data from inadvertent deletions, modifications, or misplacements caused by well-intentioned “insiders” (team researchers)—particularly in the early stages of the research when processes are chaotic, and data are most vulnerable. More specific data security steps were taken to safeguard the CRSI research assets internally. A menu-driven access portal developed in the MS Access data to serve as a conduit between the research team and CRSI data. Querying capability that mapped demonstration results (D3 level data) to relevant D2 and D1 data and supplementary information were developed. A series of reference tables linked data records stored in the database to data resources only available outside the database (e.g., raw secondary data), including information about data origin and evolution (data provenance). Pre-defined queries driven by interactive menus maintained within the database provided a way for the research team to navigate CRSI research assets while minimizing potential data mishaps. In addition, a bibliographic index of literature was created to act as an electronic card catalog for the literature repository. Indexed references for both accepted and rejected publications could be queried to return summary information created during the literature review. Additionally, secondary-data sources were linked to relevant publications so researchers could cross-reference materials from either data point or an article.

The inclusive SDM environment inherently provided as a continuity of operations mechanism. Other practices fostered knowledge exchange including SDM discussions during team briefings and planning sessions as well as SDM specific peer-to-peer training. The SDM plan, implementation of SDM plan, routine research communication, and team interactions collectively created a sustainable knowledge management paradigm.

Example Outcomes from Highlighted SDM Processes

This section offers some “results” associated with the CRSI data environment. Example CRSI data characteristics and quality assessments are presented. Additionally, a general overview describing the database design is briefly described.

Characteristics of the Reviewed Literature and Secondary-Data

Literature summaries showed that 369 publications met at least one keyword or key phrase criterion. Approximately 20% of the literature reviewed had a direct bearing on the development of the CRSI framework. Another 4% of CRSI references indirectly informed the conceptualization of CRSI while 76% lacked vital factors of interest or were duplicative.

Over 1.3 million secondary data values retrieved from thirty-seven unique data providers ( Table 3 ) served as the basis for constructing CRSI. These data were comprised of annual collections of available information from 2000 to 2015 for 3135 counties of the U.S. A complement of 383,713 averaged secondary-data measures supported final CRSI calculations was derived from the average of values for each data set across all available years, resulting in. These data represented a range of science disciplines (e.g., meteorology, geology, economics, geography, social science, ecology). Information documenting the intent, scope, quality, and refresh frequency was captured for each set of secondary data sets as well as attribution and copyright requirements.

www.frontiersin.org

Table 3 . List of secondary data sources used in the CRSI indicator development research.

Geospatially-derived secondary data were not available for eight boroughs in Alaska nor could these data be imputed with any reasonable level of confidence. Natural environment metrics (e.g., land types, soil productivity, coastal condition, natural hazards) were translated from ecologically relevant spatial scales (e.g., 12-digit hydrologic unit codes, ecoregions) to county-level boundaries. Metrics associated with natural hazard and toxic exposures were population normalized then modeled for the pertinent value if needed. Nearly one hundred percent (99.7%) of counties were represented in the CRSI metric inventory.

CRSI Results: Index, Domains, Indicators, and Metrics

The CRSI demonstration results were produced at four hierarchically-related aggregation levels ( Figures 3A,B )—metrics, indicators, domains, and indices—which collectively represent 448,305 individual results ( Figure 4 ). Metrics were derived directly from processed secondary data and were the most abundant. The summary of county-level metrics quantified indicators, indicators were summarized to domains, and domains informed the equation for the final CRSI values.

www.frontiersin.org

Figure 3. (A) The CRSI-domain-indicator data tree that served to inform the organization of different aggregates of indicator values. (B) A continuation of the CRSI data tree depicting the relationship between one indicator and associated metrics (processed data).

www.frontiersin.org

Figure 4 . A diagram illustrating the relationship of CRSI components and their contribution to the over quantity of demonstration results.

Data Quality Assessments

Statistical summaries, cumulative distribution functions (CDFs), and histograms were created using final CRSI values and each metric, indicator, and domain component to aid in the data quality assessments. If the descriptive statistics or data visualizations presented an unexpected value or data pattern, a review of each step of the data handling process was conducted to determine if an error occurred because of the data processing. Corrective actions were taken on detected errors, but if no error was detected, then the value remained. A series of CDFs are offered to demonstrate the value of this data quality assessment exercise. Figure 5A shows the distribution pattern related to one set of metric-level data found with an “suspected” error and the distribution of these same metrics after the error is corrected. Figures 5B–D show the relative influence of this single metric across a full spectrum of derived CRSI components, both before and after error correction takes place.

www.frontiersin.org

Figure 5 . Cumulative distribution function (CDF) analyses were performed for each suite of metrics, indicators, domains and CRSI values. Graphs were used to identify possible processing errors and to understand how errors influence the different aggregates of results: (A) the stair-step pattern of the “Before Correction” CDF suggests that a problem existed in the suite of Community Rating System metrics while the “After Correction” CDF shows the more expected distribution pattern; (B) demonstrates the level of influence a single metric can exert on an indicator; (C) illustrates the difficulty in identifying the metric error at the domain-level of CRSI calculations; and (D) shows that the metric-level error is virtually undetectable in the final index (CRSI) values.

Histograms of CRSI values initially presented a right-skewed distribution pattern ( Figure 6A ). Several boroughs in the state of Alaska were the primary driver. After results and processing steps were verified, each record was qualified in the D3-level CRSI data set. When extreme outliers were removed, CRSI results appeared better distributed, aligning more with expectations ( Figure 6B ). Qualified results were kept in the final set of CRSI results.

www.frontiersin.org

Figure 6 . ( A) The histogram shows severely right-skewed results in the distribution of calculated CRSI scores. A review of the results found that pattern was due to CRSI values for 12 of 22 boroughs in Alaska were far outside the 3rd quartile range rather than any specific data processing error. (B) After qualifying these 12 CRSI results as outliers, the histogram reflects a distribution pattern that was expected with the 12 outlier values were removed. Publications using the final CRSI measures report results, both with and without these qualified outliers.

CRSI Data Warehouse

The CRSI database was constructed using an MS Access (2016) database and designed to serve as a data warehouse. Leveraging features and functions available in MS Access, menus, forms, and reports were created to assist researchers in navigating the CRSI data warehouse. A switchboard (e.g., menu system) operated as the primary user interface. Forms provided interactive filtering capabilities to customize the information displayed from the various data tables held within the warehouse. Pre-defined queries joined relevant information from across the CRSI data management framework. Pre-defined report formats presented query results. Filtering functions were also offered in reports to refine the information offered for print. The general flow of data and information to and from the CRSI data warehouse is presented in Figure 7 .

www.frontiersin.org

Figure 7 . A visual representation depicting the general flow of CRSI data using the CRSI data warehouse as the avenue for the CRSI research team to access results, literature, and secondary data information.

The size limitation associated with MS Access databases (2 GB; Microsoft support https://support.office.com ) proved problematic for housing secondary data but accommodated all of the results (D3). A set of relational tables were created to link CRSI metrics with original data download files, relevant literature, and supplementary material. Results could be displayed graphically and downloaded so team members can reuse the data without compromising the resources that support the research. Figure 8 provides a detailed illustration of the CRSI data warehouse framework.

www.frontiersin.org

Figure 8 . The CRSI data warehouse framework depicting the flow of data and information; access controls; outputs generated; and research asset monitoring and management loop.

Big data have ushered in the promise of new research possibilities. In indicator research and development, big data has most assuredly found a home. This wealth of publicly accessible information has helped advance indicator research. Big data helps small research efforts like CRSI flourish and prove relevant on the global stage. However, broader discussions regarding best research data management and sharing practices are needed ( Borgman, 2012 ). The apparent lack of consistent SDM standards and the impact this has on research reproducibility is driving the development of new technologies for managing enterprise-wide research assets. Methods and technology continue to evolve potentially offering more scalable data management solutions for research efforts of all sizes ( Davidson et al., 2014 ; Zook et al., 2017 ; Peng et al., 2018 ). Given the SDM inequities between “big science” and “small-science,” even these newer approaches may remain beyond the grasp of small-scale research ( Borycz and Carroll, 2018 ).

The SDM strategies described in this paper may be self-evident, but an abundance of literature seems to suggest that Hale's (1999 , 2000 ) observations regarding the poor state of SDM persists even after two decades of data technology and knowledge advancements. The scientific community runs the risk of losing access to valuable research assets over time if SDM continues to lag in smaller-scale research ( Crowston and Qin, 2011 ). The CRSI SDM illustration suggests that “small-science” does not necessarily equate to “small data.” On the contrary, big data assures us that vast amounts of data are available with just a mouse-click, even if the SDM infrastructure to manage them does not exist.

The CRSI SDM approach demonstrates one potential model for managing big data needs in a small-scale research setting. The CRSI SDM framework is easy to understand and offers ample opportunity to increase a research team's SDM capacity when data expertise is limited or unavailable. Big data management can be messy. Lowndes et al. (2017) describes the transitioning of the OHI SDM data processing methods for calculating the index from a plodding and inefficient process to a cost-effectiveness and highly functional data processing supported research reproducibility and accessibility better. Open-source tools such as freely available software packages (e.g., R-Project, Python), collaboration and workflow platforms (e.g., GitHub, Project Jupytr) and database engines (e.g., SQLite, MongoDB) are a few tool-kits that may be considered for evolving the CRSI SDM. Each enhancement would represent progress in SDM life-cycle and a step toward best SDM practices. The CRSI SDM approach could serve as starting point for small-scale indicator research projects to successfully leverage big data resources.

The current release of CRSI and domain sub-index measures are available for 3135 counties in Portable Document Format (PDF) as Appendix B in Summers, K. et al. (2017) . An updated suite of CRSI results are being reviewed presently. The next release of CRSI data will be made available as a downloadable file through the Data.gov portal ( https://www.data.gov/ ) when the review is complete.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

The views expressed in this manuscript are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Any mention of trade names, products, or services does not imply an endorsement by the U.S. Government or the U.S. Environmental Protection Agency. The EPA does not endorse any commercial products, services, or enterprises.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Berman, F., and Cerf, V. (2013). Who will pay for public access to research data? Science 341, 616–617. doi: 10.1126/science.1241625

PubMed Abstract | CrossRef Full Text | Google Scholar

Borgman, C. L. (2012). The conundrum of sharing research data. J. Am. Soc. Inform. Sci. Technol. 63, 1059–1078. doi: 10.1002/asi.22634

CrossRef Full Text | Google Scholar

Borycz, J., and Carroll, B. (2018). Managing digital research objects in an expanding science ecosystem: 2017 conference summary. Data Sci. J. 17:16. doi: 10.5334/dsj-2018-016

Boyd, D., and Crawford, K. (2012). Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inform. Commun. Soc . 15, 662–679. doi: 10.1080/1369118X.2012.678878

Buck, K. D., Summers, J. K., Smith, L. M., and Harwell, L. C. (2018). Application of the human well-being index to sensitive population divisions: a children's well-being index development. Child Indicators Res. 11, 1249–1280. doi: 10.1007/s12187-017-9469-4

Cai, L., and Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14:2. doi: 10.5334/dsj-2015-002

Carlson, S., and Anderson, B. (2007). What are data? The many kinds of data and their implications for data re-use. J. Comp. Mediated Commun. 12, 635–651. doi: 10.1111/j.1083-6101.2007.00342.x

Carroll, M. W. (2015). Sharing research data and intellectual property law: a primer. PLoS Biol. 13:e1002235. doi: 10.1371/journal.pbio.1002235

Cox, A. M., and Pinfield, S. (2014). Research data management and libraries: current activities and future priorities. J. Librarianship Inform. Sci. 46, 299–316. doi: 10.1177/0961000613492542

Crowston, K., and Qin, J. (2011). A capability maturity model for scientific data management: evidence from the literature. Proc. Am. Soc. Inform. Sci. Technol. 48, 1–9. doi: 10.1002/meet.2011.14504801036

Cutter, S. L., Ash, K. D., and Emrich, C. T. (2014). The geographies of community disaster resilience. Global Environ. Change 29, 65–77. doi: 10.1016/j.gloenvcha.2014.08.005

Davidson, J., Jones, S., Molloy, L., and Kejser, U. B. (2014). Emerging good practice in managing research data and research information within UK Universities. Proc. Comp. Sci. 33, 215–222. doi: 10.1016/j.procs.2014.06.035

Davis-Kean, P. E., Jager, J., and Maslowsky, J. (2015). Answering developmental questions using secondary data. Child Dev. Perspect. 9, 256–261. doi: 10.1111/cdep.12151

De Mauro, A., Greco, M., and Grimaldi, M. (2015). “What is big data? A consensual definition and a review of key research topics,” in AIP Conference Proceedings Vol. 1644 (Madrid), 97–104.

Google Scholar

Demchenko, Y., Grosso, P., De Laat, C., and Membrey, P. (2013). “Addressing big data issues in scientific data infrastructure,” in Collaboration Technologies and Systems (CTS), 2013 International Conference on . IEEE, 48–55.

Dietrich, D., Adamus, T., Miner, A., and Steinhart, G. (2012). De-mystifying the data management requirements of research funders. Issues Sci. Technol. Librarianship 70. doi: 10.5062/F44M92G2

Everyone Needs a Data-Management Plan (2018). Nature 555:286. [Editorial]. Available oniline at: https://www.nature.com/articles/d41586-018-03065-z (accessed July 10, 2018)

Floridi, L., and Taddeo, M. (2016). What is data ethics? Phil. Trans. R. Soc. A 374:20160360. doi: 10.1098/rsta.2016.0360

Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., and Heber, G. (2005). Scientific data management in the coming decade. Acm Sigmod Record 34, 34–41. doi: 10.1145/1107499.1107503

Hale, S. S. (1999). How to manage data badly (part 1). Bull. Ecol. Soc. Am. 80, 265–268.

Hale, S. S. (2000). How to manage data badly (part 2). Bull. Ecol. Soc. Am. 81, 101–103. doi: 10.1890/0012-9623(2000)086[0101:C]2.0.CO;2

CrossRef Full Text

Hale, S. S., Miglarese, A. H., Bradley, M. P., Belton, T. J., Cooper, L. D., Frame, M. T., et al. (2003). “Managing troubled data: coastal data partnerships smooth data integration,” in Coastal Monitoring through Partnerships (Dordrecht: Springer), 133–148.

Halpern, B. S., Longo, C., Hardy, D., McLeod, K. L., Samhouri, J. F., Katona, S. K., et al. (2012). An index to assess the health and benefits of the global ocean. Nature 488:615. doi: 10.1038/nature11397

Hampton, S. E., Strasser, C. A., Tewksbury, J. J., Gram, W. K., Budden, A. E., Batcheller, A. L., et al. (2013). Big data and the future of ecology. Front. Ecol. Environ. 11, 156–162. doi: 10.1890/120103

Harzing, A. W. (2007). Publish or Perish . Available online at: http://www.harzing.com/pop.htm (accessed August 22, 2018)

Helliwell, J., Layard, R., and Sachs, J. (2019). World Happiness Report 2019. New York, NY: Sustainable Development Solutions Network. Available online at: http://worldhappiness.report/ed/2019/

Holdren, J. P. (2013, February 22). Increasing Access to the Results of Federally Funded Scientific Research . Washington, DC: Executive Office of the President, Office of Science and Technology Policy. Available online at: https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf .

Homer, C. G., Dewitz, J. A., Yang, L., Jin, S., Danielson, P., Xian, G., et al. (2015). Completion of the 2011 National Land Cover Database for the conterminous United States-Representing a decade of land cover change information. Photogr. Eng. Remote Sensing 81, 345–354.

Jenkins, C. N., Van Houtan, K. S., Pimm, S. L., and Sexton, J. O. (2015). U.S. protected lands mismatch biodiversity priorities. Proc Natl Acad Sci USA . 112, 5081–5086. doi: 10.1073/pnas.1418034112

Jha, M., Jha, S., and O'Brien, L. (2015). “Integrating big data solutions into enterprize architecture: constructing the entire information landscape,” in The International Conference on Big Data, Internet of Things, and Zero-Size Intelligence BIZ2015 (Kuala Lumpur), 8–10.

Link, G. J., Lumbard, K., Conboy, K., Feldman, M., Feller, J., George, J., et al. (2017). Contemporary issues of open data in information systems research: considerations and recommendations. Commun. Assoc. Inform. Syst. 41:25. doi: 10.17705/1CAIS.04125

Lowndes, J. S. S., Best, B. D., Scarborough, C., Afflerbach, J. C., Frazier, M. R., O'Hara, C. C., et al. (2017). Our path to better science in less time using open data science tools. Nat Ecol Evol. 1:0160. doi: 10.1038/s41559-017-0160

Lynch, C. (2008). Big data: how do your data grow? Nature 455:28. doi: 10.1038/455028a

Madin, J., Bowers, S., Schildhauer, M., Krivov, S., Pennington, D., and Villa, F. (2007). An ontology for describing and synthesizing ecological observation data. Ecol. Inform. 2, 279–296. doi: 10.1016/j.ecoinf.2007.05.004

Mooney, P., and Winstanley, A. C. (2007). “Improving environmental research data management,” in EnviroInfo 2007. Paper presented at the 21st International Conference for Environmental Protection Part 1, Warsaw, Poland, 12-14 September , eds O. Hryniewicz, J. Studzinski, and M. Romaniuk (Aachen: Shaker Verlag), 473–477.

Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman, A., and Giovannini, E. (2005). Handbook on Constructing Composite Indicators: Methodology and User Guide , OECD Statistics Working Papers, OECD Publishing, Paris.

Niemeijer, D., and de Groot, R. S. (2008). A conceptual framework for selecting environmental indicator sets. Ecol. Indicators 8, 14–25. doi: 10.1016/j.ecolind.2006.11.012

OECD (2017). How's Life? 2017: Measuring Well-being. Paris: OECD Publishing.

Peng, G., Privette, J. L., Tilmes, C., Bristol, S., Maycock, T., Bates, J. J., et al. (2018). A conceptual enterprise framework for managing scientific data stewardship. Data Sci. J. 17:15. doi: 10.5334/dsj-2018-015

Pilat, D., and Fukasaku, Y. (2007). OECD principles and guidelines for access to research data from public funding. Data Sci. J. 6, OD4–OD11. doi: 10.2481/dsj.6.OD4

Piwowar, H. A., Day, R. S., and Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PLoS ONE 2:e308. doi: 10.1371/journal.pone.0000308

Sansone, S.-A., Cruse, P., and Thorley, M. (2018). High-quality science requires high-quality open data infrastructure. Sci. Data 5:180027. doi: 10.1038/sdata.2017.27

Simms, S., Strong, M., Jones, S., and Ribeiro, M. (2016). The future of data management planning: tools, policies, and players. Int. J. Digital Curation 11, 208–217. doi: 10.2218/ijdc.v11i1.413

Smith, L. M., Case, J. L., Smith, H. M., Harwell, L. C., and Summers, J. K. (2013). Relating ecosystem services to domains of human well-being: foundation for a US index. Ecol. Indicators 28, 79–90. doi: 10.1016/j.ecolind.2012.02.032

Submarine Telecoms (2017). Industry Report 6 th Edition. Issuu . Available online at: https://issuu.com/subtelforum/docs/stfindustryreportissue6final (accessed October 15, 2017).

Summers, J. K., Harwell, L. C., Smith, L. M., and Buck, K. D. (2018). Measuring community resilience to natural hazards: the natural hazard resilience screening index (NaHRSI)—development and application to the United States. GeoHealth 2, 372–394. doi: 10.1029/2018GH000160

Summers, J. K., Smith, L. M., Harwell, L. C., and Buck, K. D. (2017). Conceptualizing holistic community resilience to climate events: foundation for a climate resilience screening index. GeoHealth , 1, 151–164. doi: 10.1002/2016GH000047

Summers, K., Harwell, L., Buck, K., Smith, L., Vivian, D., Bousquin, J., et al. (2017). Development of a Climate Resilience Screening Index (CRSI): An Assessment of Resilience to Acute Meteorological Events and Selected Natural Hazards. Washington, DC: U.S. Environmental Protection Agency.

United Nations Development Programme (2018). Human development indices and indicators: 2018 Statistical update . Available online at: http://hdr.undp.org/en/content/human-development-indices-indicators-2018-statistical-update

Vayena, E., and Tasioulas, J. (2016). The dynamics of big data and human rights: the case of scientific research. Phil. Trans. R. Soc. A , 374:20160129. doi: 10.1098/rsta.2016.0129

Wendling, Z. A., Emerson, J. W., Esty, D. C., Levy, M. A., de Sherbinin, A., et al. (2018). 2018 Environmental Performance Index. New Haven, CT: Yale Center for Environmental Law & Policy. Available online at: https://epi.yale.edu/

Zook, M., Barocas, S., Crawford, K., Keller, E., Gangadharan, S. P., Goodman, A., et al. (2017). Ten simple rules for responsible big data research. PLoS Comput. Biol. 13:e1005399. doi: 10.1371/journal.pcbi.1005399

Keywords: resilience, indicators, data management, framework, curation

Citation: Harwell LC, Vivian DN, McLaughlin MD and Hafner SF (2019) Scientific Data Management in the Age of Big Data: An Approach Supporting a Resilience Index Development Effort. Front. Environ. Sci. 7:72. doi: 10.3389/fenvs.2019.00072

Received: 08 November 2018; Accepted: 14 May 2019; Published: 04 June 2019.

Reviewed by:

Copyright © 2019 Harwell, Vivian, McLaughlin and Hafner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Linda C. Harwell, harwell.linda@epa.gov

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Back to Entry
  • Entry Contents
  • Entry Bibliography
  • Academic Tools
  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Notes to Scientific Research and Big Data

1. When a data collection can or should be regarded as “big data”, and the significance of this particular label for research, is discussed at length in Leonelli (2016), Kitchin and McArdle (2016) and Aronova, van Oertzen, and Sepkoski (2017).

2. This understanding of scientific knowledge is also embedded within publishing practices. As exemplified by the use of impact factors, scientific excellence is evaluated on the strength of authorship of articles, thus placing the production of scientific claims at the pinnacle of knowledge creation. Researchers whose activities focus away from writing theoretical statements—such as data curators or software developers—are often viewed as technicians with a lower status. The emergence of big data is challenging these habits and perceptions, for instance through the rise of Open Science practices, but it is no wonder that within this landscape, philosophers have focused their attention on models and theories as central outputs of research, leaving data behind.

Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

The dynamics of big data and human rights: the case of scientific research

Affiliations.

  • 1 Health Ethics and Policy Lab, Epidemiology, Biostatistics and Prevention Institute, University of Zurich, 8001 Zurich, Switzerland [email protected].
  • 2 Yeoh Tiong Lay Centre for Politics, Philosophy, and Law, The Dickson Poon School of Law, King's College London, London WC2R 2LS, UK.
  • PMID: 28336802
  • PMCID: PMC5124070
  • DOI: 10.1098/rsta.2016.0129

In this paper, we address the complex relationship between big data and human rights. Because this is a vast terrain, we restrict our focus in two main ways. First, we concentrate on big data applications in scientific research, mostly health-related research. And, second, we concentrate on two human rights: the familiar right to privacy and the less well-known right to science. Our contention is that human rights interact in potentially complex ways with big data, not only constraining it, but also enabling it in various ways; and that such rights are dynamic in character, rather than fixed once and for all, changing in their implications over time in line with changes in the context we inhabit, and also as they interact among themselves in jointly responding to the opportunities and risks thrown up by a changing world. Understanding this dynamic interaction of human rights is crucial for formulating an ethic tailored to the realities-the new capabilities and risks-of the rapidly evolving digital environment.This article is part of the themed issue 'The ethical impact of data science'.

Keywords: big data; data ethics; human right to privacy; human right to science.

© 2016 The Author(s).

PubMed Disclaimer

Similar articles

  • Beyond privacy and exposure: ethical issues within citizen-facing analytics. Grindrod P. Grindrod P. Philos Trans A Math Phys Eng Sci. 2016 Dec 28;374(2083):20160132. doi: 10.1098/rsta.2016.0132. Philos Trans A Math Phys Eng Sci. 2016. PMID: 28336804
  • Locating ethics in data science: responsibility and accountability in global and distributed knowledge production systems. Leonelli S. Leonelli S. Philos Trans A Math Phys Eng Sci. 2016 Dec 28;374(2083):20160122. doi: 10.1098/rsta.2016.0122. Philos Trans A Math Phys Eng Sci. 2016. PMID: 28336799 Free PMC article.
  • Data science ethics in government. Drew C. Drew C. Philos Trans A Math Phys Eng Sci. 2016 Dec 28;374(2083):20160119. doi: 10.1098/rsta.2016.0119. Philos Trans A Math Phys Eng Sci. 2016. PMID: 28336798
  • The opportunities and ethics of big data: practical priorities for a national Council of Data Ethics. Varley-Winter O, Shah H. Varley-Winter O, et al. Philos Trans A Math Phys Eng Sci. 2016 Dec 28;374(2083):20160116. doi: 10.1098/rsta.2016.0116. Philos Trans A Math Phys Eng Sci. 2016. PMID: 28336795 Review.
  • The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts. Mittelstadt BD, Floridi L. Mittelstadt BD, et al. Sci Eng Ethics. 2016 Apr;22(2):303-41. doi: 10.1007/s11948-015-9652-2. Epub 2015 May 23. Sci Eng Ethics. 2016. PMID: 26002496 Review.
  • The future regulation of artificial intelligence systems in healthcare services and medical research in the European Union. Meszaros J, Minari J, Huys I. Meszaros J, et al. Front Genet. 2022 Oct 4;13:927721. doi: 10.3389/fgene.2022.927721. eCollection 2022. Front Genet. 2022. PMID: 36267404 Free PMC article.
  • Scientific Data Management in the Age of Big Data: An Approach Supporting a Resilience Index Development Effort. Harwell LC, Vivian DN, McLaughlin MD, Hafner SF. Harwell LC, et al. Front Environ Sci. 2019 Jun 4;7(Article 72):1-13. doi: 10.3389/fenvs.2019.00072. Front Environ Sci. 2019. PMID: 33123540 Free PMC article.
  • Considerations for ethics review of big data health research: A scoping review. Ienca M, Ferretti A, Hurst S, Puhan M, Lovis C, Vayena E. Ienca M, et al. PLoS One. 2018 Oct 11;13(10):e0204937. doi: 10.1371/journal.pone.0204937. eCollection 2018. PLoS One. 2018. PMID: 30308031 Free PMC article. Review.
  • Packard V. 1964. The naked society. New York, NY: Ig Publishing.
  • Floridi L. 2014. The fourth revolution: how the infosphere is reshaping human reality. Oxford, UK: Oxford University Press.
  • Schwab K. 2015. The fourth industrial revolution: what it means and how to respond. Foreign Affairs, December. See https://www.foreignaffairs.com/articles/2015-12-12/fourth-industrial-rev... .
  • Ward JS, Barker A. 2013. Undefined by data: a survey of Big Data definitions. ( https://arxiv.org/abs/1309.5821 )
  • McGrath R. 2013. The pace of technology adoption is speeding up. Harvard Bus. Rev. See https://hbr.org/2013/11/the-pace-of-technology-adoption-is-speeding-up/ .
  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Europe PubMed Central
  • PubMed Central

Other Literature Sources

  • scite Smart Citations

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • I am a Current Student Admitted Student Faculty/Staff Parent/Family Alumni Industry Partner
  • Ways to Give
  • Explore All Programs
  • Academic Departments
  • Courses & Registration
  • Education Abroad
  • Internships & Co-ops
  • Co-curricular Programs
  • Academic Support
  • Devereaux Library
  • Undergraduate Admissions
  • Graduate Admissions
  • International Students
  • Tuition & Fees
  • Financial Aid & Scholarships
  • Visit Mines
  • Contact Admissions
  • Living On Campus
  • Opportunity for All
  • Health & Safety
  • Support & Services
  • Campus Points of Interest
  • Get Involved
  • Research Facilities
  • Innovation & Entrepreneurship
  • Industry Relations
  • Athletics Expand Submenu
  • Accreditation
  • State Authorization
  • Mission, Vision, & Values
  • History and Traditions
  • University Leadership
  • Offices & Administration
  • Environmental Health & Safety
  • Faculty & Staff Directory
  • Policies & Compliance 

Mines Researchers Receive NSF Funding to Harness Big Data of Geologic Processes

Mines Researchers Receive NSF Funding to Harness Big Data of Geologic Processes

Everyone knows that lava is made of melted rocks. It makes sense that different types of rocks are produced from different types of lava; for example, the lava in Hawaii is different than the lava common in a Cascadian volcano like Mt. Saint Helens. The rocks and minerals that lava, or magma, become also depends on temperature and pressure conditions during cooling. 

Geochemists are interested in understanding how minerals are formed as molten rock cools. This knowledge helps them better understand Earth's geologic processes from the creation of critical minerals and elements like lithium, to the way plate tectonics can build mountains and cause earthquakes.

Experimental geochemists around the world melt rocks and minerals inside special laboratory furnaces that recreate the environment deep inside the earth. They then look at the minerals that form when various mixtures of material are cooled at different temperatures and pressures.

The large amount of data from such experiments are a challenge to compile for analysis. A team at lead by Gokce K. Ustunisik, Ph.D., associate professor of geology and geological engineering at South Dakota Mines helped build a system to compare results from thousands of experiments conducted around the world .  This new study, thanks to a five-year National Science Foundation (NSF) grant totaling nearly $470,000 will help align the data from various sources to build the big picture understanding.

This award is one of four of Ustunisik NSF funded research projects totaling nearly $750,000. The work also ties into the university's new Ph.D. program in data science and engineering .

A challenge when compiling data from various experiments is that data produced by different labs is not always collected and presented in the professional literature in the same way. 

“The way data was collected in multiple experiments can differ greatly. It's possible to develop bias in predictive models if you don't consider the boundary conditions of experimental data,” says Ustunisik , the principal investigator on this research.

Roger Nielsen, Ph.D., a co-principal investigator on this research, a research scientist at Mines, and an emeritus professor at College of Earth Ocean and Atmospheric Sciences at Oregon State University uses this analogy to describe this work.

“If your experiment is driving on Interstate-90 east across South Dakota, the data you collect along the first part of the journey tells you you're going straight and flat and there are no big corners. If you do a second experiment on I-94 across North Dakota, the data shows you the same thing, straight, flat, no corners,” says Nielsen.  “A model you might produce based on this data from these two experiments would predict a straight and flat road, and this model would work great, until you hit the Missouri River, and you end up over a cliff in the water. In the two different experiments, you'd go into the water in two different places. The two models going the same direction in two different places would line up for a time but then have different results,” says Nielsen.

Nielsen says the work needed on these experimental geologic datasets includes improved understanding of limitations, gaps, and anomalies such as the Missouri River in the analogy above. With this new understanding, researchers can then examine when different experiments have data that correlates and diverges. They can use this improved understanding to build better models.

“Whatever we do with experiments, our goal is always to simulate what was observed in nature, what we are doing with this is to try and understand the boundary conditions in different experiments so we can warn the modelers about this bias,” Ustunisik adds.

With this new broader understanding, geologists hope to build a more unifying theory or model for the innerworkings of the entire Earth. Nielsen says another analogy is to consider the system that makes up the Earth like the system that makes an automobile.

“Twenty-five years ago, when I began this work, we were trying to determine what all the parts in the car were, the tires, the fenders, the nuts and bolts and bearings that hold the pistons inside the motor. Today we are trying to better understand how these parts fit together and make the car run.”

A model of the entire Earth's system would be valuable in helping geoscientists predict natural disasters, like earthquakes and volcanic eruptions. This new data analysis research at Mines, brings geologists one step closer to this goal.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

Reproducibility and Scientific Integrity of Big Data Research in Urban Public Health and Digital Epidemiology: A Call to Action

Ana cecilia quiroga gutierrez.

1 Department of Health Sciences and Medicine, University of Lucerne, 6002 Luzern, Switzerland

Daniel J. Lindegger

2 Institute of Global Health, University of Geneva, 1211 Geneva, Switzerland

Ala Taji Heravi

3 CLEAR Methods Center, Department of Clinical Research, Division of Clinical Epidemiology, University Hospital Basel and University of Basel, 4031 Basel, Switzerland

Thomas Stojanov

4 Department of Orthopaedic Surgery and Traumatology, University Hospital of Basel, 4031 Basel, Switzerland

Martin Sykora

5 School of Business and Economics, Centre for Information Management, Loughborough University, Loughborough LE11 3TU, UK

Suzanne Elayan

Stephen j. mooney.

6 Department of Epidemiology, University of Washington, Seattle, WA 98195, USA

John A. Naslund

7 Department of Global Health and Social Medicine, Harvard Medical School, Boston, MA 02115, USA

Marta Fadda

8 Institute of Public Health, Università Della Svizzera Italiana, 6900 Lugano, Switzerland

Oliver Gruebner

9 Epidemiology, Biostatistics and Prevention Institute, University of Zurich, 8001 Zurich, Switzerland

10 Department of Geography, University of Zurich, 8057 Zurich, Switzerland

Associated Data

Not applicable.

The emergence of big data science presents a unique opportunity to improve public-health research practices. Because working with big data is inherently complex, big data research must be clear and transparent to avoid reproducibility issues and positively impact population health. Timely implementation of solution-focused approaches is critical as new data sources and methods take root in public-health research, including urban public health and digital epidemiology. This commentary highlights methodological and analytic approaches that can reduce research waste and improve the reproducibility and replicability of big data research in public health. The recommendations described in this commentary, including a focus on practices, publication norms, and education, are neither exhaustive nor unique to big data, but, nonetheless, implementing them can broadly improve public-health research. Clearly defined and openly shared guidelines will not only improve the quality of current research practices but also initiate change at multiple levels: the individual level, the institutional level, and the international level.

1. Introduction

Research comprises “creative and systematic work undertaken in order to increase the stock of knowledge” [ 1 , 2 ]. Research waste, or research whose results offer no social benefit [ 3 ], was characterized in a landmark series of papers in the Lancet in 2014 [ 4 , 5 ]. The underlying drivers of research waste range from methodological weaknesses in specific studies to systemic shortcomings within the broader research ecosystem, notably including a reward system that incentivises quantity over quality and incentivizes exploring new hypotheses over confirming old ones [ 4 , 5 , 6 , 7 , 8 ].

Published research that cannot be reproduced is wasteful due to doubts about its quality and reliability. Lack of reproducibility is a concern in all scientific research, and it is especially significant in the field of public health, where research aims to improve treatment practices and policies that have widespread implications. In this commentary, we highlight the urgency of improving norms for reproducibility and scientific integrity in urban public health and digital epidemiology and discuss potential approaches. We first discuss some examples of big data sources and their uses in urban public health, digital epidemiology, and other fields, and consider the limitations with the use of big data. We then provide an overview of relevant solutions to address the key challenges to reproducibility and scientific integrity. Finally, we consider some of their expected outcomes, challenges, and implications.

Unreliable research findings also represent a serious challenge in public-health research. While the peer-review process is designed to ensure the quality and integrity of scientific publications, the implementation of peer review varies between journals and disciplines and does not guarantee that the data used are properly collected or employed. As a result, reproducibility remains a challenge. This is also true in the context of the emerging field of big data science. This is largely driven by the characteristics of big data, such as their volume, variety, and velocity, as well as the novelty and excitement surrounding new data science methods, lack of established reporting standards, and a nascent field that continues to change rapidly in parallel to the development of new technological and analytic innovations. Recent reports have uncovered that most research is not reproducible, with findings casting doubt on the scientific integrity of much of the current research landscape [ 6 , 9 , 10 , 11 , 12 ]. At the bottom of this reproducibility crisis lies growing pressure to publish not only novel, but more importantly, statistically significant results at an accelerated pace [ 13 , 14 ], increasing the use of low standards of evidence and disregarding pragmatic metrics, such as clinical or practical significance [ 15 ]. Consequently, the credibility of scientific findings is decreasing, potentially leading to cynicism or reputational damage to the research community [ 16 , 17 ]. Addressing the reproducibility crisis is not only one step towards restoring the public’s trust in scientific research, but also a necessary foundation for future research, as well as guiding evidence-based public-health initiatives and policies [ 18 ], facilitating translation and implementation of research findings [ 19 , 20 ], and accelerating scientific discovery [ 21 ].

While failure to fully document the scientific steps taken in a research project is a fundamental challenge across all research, big data research is additionally burdened by the technical and computational complexities of handling and analysing large datasets. The challenge of ensuring computational capacity, including memory and processing power, to handle the data, as well as statistical and subject matter expertise accounting for data heterogeneity, can lead to reproducibility issues at a more pragmatic level. For example, large datasets derived from social media platforms require data analysis infrastructure, software, and technical skills, which are not always accessible to every research team [ 22 , 23 ]. Likewise, studies involving big data create new methodological challenges for researchers as the complexity for analysis and reporting increases [ 24 ]. This complexity not only requires sophisticated statistical skills but also new guidelines that define how data should be processed, shared, and communicated to guarantee reproducibility and maintain scientific integrity, while protecting private and sensitive information. Some of these challenges lie beyond the abilities and limitations of individual researchers and even institutions, requiring cultural and systematic changes to improve not only the reproducibility but also transparency and quality of big data research in public health.

Importantly, through concerted efforts and collaboration across disciplines, there are opportunities to systematically identify and address this reproducibility crisis and to specifically apply these approaches to big data research in public health. Below, we discuss methodological and analytical approaches to address the previously discussed issues, reduce waste, and improve the reproducibility and replicability of big data research in public health.

Specifically, we focus on approaches to improve reproducibility, which is distinct from replicability. While both are important with regards to research ethics, replicability is about “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”, whereas reproducibility refers to “obtaining consistent results using the same input data’ computational steps, methods and code, and conditions of analysis” [ 25 ]. Though we mention “reproducibility” throughout this commentary, some of the arguments presented may apply to replicability as well. This is particularly true when it comes to transparency when reporting sampling, data collection, aggregation, inference methods, and study context; these affect both replication and reproduction [ 26 ].

2. Big Data Sources and Uses in Urban Public Health and Digital Epidemiology

Big data, as well as relevant methods and analytical approaches, have gained increasing popularity in recent years. This is reflected in the growing number of publications and research studies that have implemented big data methods across a variety of fields and sectors, such as manufacturing [ 27 ], supply-chain management [ 28 ], sports [ 29 ], education [ 30 ], and public health [ 31 ].

Public health, including urban health and epidemiological research, is a field where studies increasingly rely on big data methods, such as in the relatively new field of digital epidemiology [ 32 ]. The use of big data in public-health research is often characterized by the ‘3Vs’: variety in types of data as well as purposes; volume or amount of data; and velocity, referring to the speed at which the data are generated [ 33 ]. Because large datasets can invariably produce statistically significant findings but systematic biases are unaffected by data scale, big data studies are at greater risk of producing inaccurate results [ 34 , 35 , 36 , 37 ].

Big data sources that are used or could be potentially used in fields, such as urban public health and digital epidemiology, can be divided into two main categories. First, those that are collected or generated with health as a main focus and, second, those that are generated out of this scope but that can be associated with or impact public health ( Figure 1 ) [ 32 ].

An external file that holds a picture, illustration, etc.
Object name is ijerph-20-01473-g001.jpg

Data sources used in urban public health and digital epidemiology research can broadly be organized along a continuum of health orientation of the process that generated them.

Data sources generated within the context of public health include large datasets captured within health systems or government health services at the population level, such as the case of Electronic Health Records (EHRs), Electronic Medical Records (EMRs), or personal health records (PHRs) [ 38 ]. Other examples include pharmacy and insurance records, omics data, as well as data collected by sensors and devices that are part of the internet of things (IoT) and are used for health purposes, ranging from smart continuous glucose monitors (CGMs) [ 39 ] to activity and sleep trackers.

In contrast, big data sources generated outside the public-health scope are virtually unlimited and ever-growing, covering virtually all domains in society. As a result, we will focus on selected and non-conclusive examples to illustrate and exemplify the diverse sources of big data that are used or could potentially be used in urban public health and digital epidemiology. Notably, social media have become an important source of big data used for research in different fields, including digital epidemiology. Twitter data have proven to be useful for collecting public-health information, for example, to measure mental health in different patient subgroups [ 40 ]. Examples of big data collected on Twitter that can be used in the context of public-health research are the Harvard CGA Geotweet Archive [ 41 ] or the University of Zurich Social Media Mental Health Surveillance project with their Geotweet Repository for the wider European Region [ 42 ]. Other initiatives, such as the SoBigData Research Infrastructure (RI), aim to foster reproducible and ethical research through the creation of a ‘Social Mining & Big Data Ecosystem’, allowing for the comparison, re-use, and integration of big data, methods, and services into research [ 22 ].

Cities increasingly use technological solutions, including IoT and multiple sensors, to monitor the urban environment, transitioning into Smart Cities with the objective of improving citizens’ quality of life [ 43 , 44 ]. Data stemming from Smart City applications have been used, for example, to predict air quality [ 45 ], analyse transportation to improve road safety [ 46 ], and have the potential to inform urban planning and policy design to build healthier and more sustainable cities [ 47 ].

Data mining techniques also allow for large datasets to be used in the context of urban public health and digital epidemiology. For example, a project using administrative data and data mining techniques in El Salvador identified anomalous spatiotemporal patterns of sexual violence and informed ways in which such analysis can be conducted in real time to allow for local law enforcement agencies and policy makers to respond appropriately [ 48 , 49 ]. Other large-dataset sources, such as transaction data [ 50 ], have been used to investigate the effect of sugar taxes [ 51 ] or labelling [ 52 ] on the consumption of healthy or unhealthy beverages and food products, which can eventually help model their potential impact on health outcomes.

3. Approaches to Improving Reproducibility and Scientific Integrity

Big data science has brought on new challenges, to which the scientific community needs to adapt by applying adequate ethical, methodological, and technological frameworks to cope with the increasing amount of data produced [ 53 ]. As a result, the timely adoption of approaches to address reproducibility and scientific integrity issues is imperative to ensure quality research and outcomes. A timely adoption is relevant not only for the scientific community but also for the general public that can potentially benefit from knowledge and advancements resulting from the use of big data research. This is particularly important in the context of urban public health and digital epidemiology, as the use of big data in these fields can help answer highly relevant and pressing descriptive (what is happening), predictive (what could happen), and prescriptive (why it happened) research questions [ 54 ]. A brief summary of the main points discussed in this section can be found in Figure 2 . We divide our proposed solutions in this commentary into three main domains: (1) good research practice, (2) scientific communication and publication, and (3) education.

An external file that holds a picture, illustration, etc.
Object name is ijerph-20-01473-g002.jpg

Approaches that address good research practice, scientific communication, and education are important to improve reproducibility and scientific integrity.

3.1. Good Research Practice

Practices, such as pre-registration of protocols, predefining research questions and hypotheses, publicly sharing data analysis plans, and communicating through reporting guidelines, can improve the quality and reliability of research and results [ 55 , 56 ]. For experimental studies, clear and complete reporting and documentation are essential to allow for reproduction. Observational studies can also be registered on well-established registries, such as on clinicaltrials.gov. Importantly, pre-registration does not preclude publishing exploratory results; rather, it encourages such endeavours to be explicitly described as exploratory, with defined hypotheses and expected outcomes, which is appropriate [ 35 , 37 ].

Lack of data access is another key challenge to reproducibility. Adoption of open-science practices, including sharing of data and code, represents a partial solution to this issue [ 57 , 58 ], acknowledging that not all data can be shared openly owing to privacy concerns. Similarly, transparent descriptions of data collection and analytic methods are necessary for reproduction [ 59 ]. For example, in the analysis of human mobility, which has applications in a wide range of fields, including public health and digital epidemiology [ 60 , 61 ], the inference of ‘meaningful’ locations [ 62 ] from mobility data has been approached with a multitude of methods, some of which lack sufficient documentation. Whereas a research project using an undocumented method to identify subject homes cannot be reproduced, a project using Chen and Poorthius’s [ 63 ] R package ‘homelocator‘, which is open source and freely available, could be.

Likewise, a case could be made to collaboratively share big data within research networks and IT infrastructures. An example of a project tackling this issue in the context of public health is currently being developed by the Swiss Learning Health System (SLHS), focusing on the design and implementation of a metadata repository with the goal of developing Integrated Health Information Systems (HISs) in the Swiss context [ 64 , 65 ]. The implementation of such repositories and data-management systems allows for retrieval of and access to information; nevertheless, as information systems develop, new challenges arise, particularly when it comes to infrastructure as well as legal and ethical issues, such as data privacy. Solutions are currently in development; it is likely that decentralised data architectures based on blockchain will play an important role in integrated care and health information models [ 66 ]. We briefly expand on this topic in the Anticipated Challenges section below.

The adoption of appropriate big data handling techniques and analytical methods is also important to ensure the findability, accessibility, interoperability, and reusability (FAIR) [ 67 ] of both data and research outcomes [ 68 ]. Such characteristics allow for different stakeholders to use and reuse data and research outcomes for further research, replication, or even implementation purposes.

Complete and standardised reporting of aspects discussed in this section, for instance, in Reproducibility Network Groups, allows for meta-research and meta-analyses, the detection and minimization of publication bias, and the evaluation of the adherence of researchers to guidelines focused on ensuring scientific integrity. The use of checklists by individual researchers, research groups, departments, or even institutions can motivate the implementation of good research practices as well as clear and transparent reporting, ultimately improving research integrity [ 69 ]. Such checklists can serve as training tools for younger researchers, as well as offer practice guidelines to ensure quality research.

Senior researchers and research institutions are vital when it comes to tackling these challenges as well. The adoption of principles for research conduct, such as the Hong Kong principles, can help minimise the use of questionable research practices [ 70 ]. These principles are to: (1) assess responsible research practices; (2) value complete reporting; (3) reward the practice of open science; (4) acknowledge a broad range of research activities; and (5) recognise essential other tasks, such as peer review and mentoring [ 71 ]. The promotion of these principles by mentors and institutions is a cornerstone of good research practices for younger researchers.

3.2. Scientific Communication

Scientific communication, not only between researchers but also between institutions, should be promoted. Recently, requirements for researchers to make data public or open source have grown popular among journals and major funding agencies in the US, Europe, and globally; this is an important catalyst for open science and addressing issues such as reproducibility [ 72 ].

Likewise, publication and sharing of protocols, data, code, analysis, and tools are important. This not only facilitates reproducibility but also promotes openness and transparency [ 73 ]. For example, the Journal of Memory and Language adopted a mandatory data-sharing policy in 2019. An evaluation of this policy implementation found that data sharing increased more than 50% and the strongest predictor for reproducibility was the sharing of analysis code, increasing the probability of reproducibility by 40% [ 57 ]. Such practices are also fostered by the creation and use of infrastructure, such as the aforementioned SoBigData, and reproducibility network groups, such as the Swiss Reproducibility Network, a peer-lead group that aims to improve both replicability and reproducibility [ 74 ], improve communication, collaboration, and encourage the use of rigorous research practices.

When publishing or communicating their work, researchers should also keep in mind that transparency regarding whether studies are exploratory (hypothesis forming) or confirmatory (hypothesis testing) is important to distinguish from testing newly formed hypotheses and the testing of existing ones [ 75 ]; this is particularly important for informing future research. Journal reviewers and referees should also motivate researchers to accurately report this.

Similarly, when publishing results, the quality, impact, and relevance of a publication should be valued more than scores, such as the impact factor, to avoid “publishing for numbers” [ 76 ]. This would, of course, require a shift in the priorities and views shared within the research community and may be a challenging change to effect.

Academic editors can also play an important role by avoiding practices, such as ‘cherry-picking’ publications, either because of statistical significance of results or notoriety of the authors. Instead, practical significance, topic relevance, and replication studies should be important factors to consider, as well as valuing the reporting of negative results. It is important to acknowledge, though, that scientific publication structures face an important number of challenges that hinder the implementation of these practices. Some of these points are mentioned in the Challenges section that follows.

3.3. Education

Academic institutions have the responsibility to educate researchers in an integral way, covering not only the correct implementation of methodological approaches and appropriate reporting but also how to conduct research in an ethical way.

First, competence and capacity building should be addressed explicitly through courses, workshops, and competence-building programs aimed at developing technical skills, good research practices, and adequate application of methods and analytical tools. Other activities such as journal clubs can allow researchers to exchange and become familiar with different methodologies, stay up to date with current knowledge and ongoing research, and develop critical thinking skills [ 77 , 78 ], while fostering a mindset for continuous growth and improvement.

Second, by incorporating practice-based education, particularly with research groups that already adhere to best practices, such as the Hong Kong principles, institutions can foster norms valuing reproducibility implicitly as an aspect of researcher education.

4. Expected Outcomes

Ideally, successful implementation of the approaches proposed in Figure 2 , and the methodological and analytical approaches, such as the standardised protocols that were suggested by Simera et al. [ 55 ] and the Equator Network reporting guidelines [ 79 ], can potentially lead to a cultural shift in the research community. This, in turn, can enhance transparency and the quality of public-health research using big data by fostering interdisciplinary programs and worldwide cooperation among different health-related stakeholders, such as researchers, policy makers, clinicians, providers, and the public. Improving research quality can lead to greater value and reliability, while decreasing research waste, thus, improving the cost–value ratio and trust between stakeholders [ 80 , 81 ], and as previously stated, facilitating translation and implementation of research findings [ 18 ].

Just in the way replicability is fundamental in engineering to create functioning and reliable products or systems, replicability is also necessary for modelling and simulation in the fields of urban public health and digital epidemiology [ 82 ]. Simulation approaches built upon reproducible research allow for the construction of accurate prediction models with important implications for healthcare [ 83 ] and public health [ 84 ]. In the same way, reproduction and replication of results for model validation are essential [ 85 , 86 , 87 ].

The importance of reducing research waste and ensuring the value of health-related research is reflected in the existence of initiatives, such as the AllTrials Campaign, EQUATOR (enhancing the quality and transparency of health research), and EVBRES (evidence-based research), which promote protocol registration, full methods, and result reporting, and new studies that build on an existing evidence base [ 79 , 88 , 89 , 90 ].

Changes in editorial policies and practices can improve critical reflection on research quality by the authors. Having researchers, editors, and reviewers use guidelines [ 91 ], such as ARRIVE [ 92 ] in the case of pre-clinical animal studies or STROBE [ 93 ] for observational studies in epidemiology, can significantly improve reporting and transparency. For example, an observational cohort study analysing the effects of a change in the editorial policy of Nature , which introduced a checklist for manuscript preparation, demonstrated that reporting risk of bias improved substantially as a consequence [ 94 ].

A valuable outcome of adopting open science approaches that could result in improved communication, shared infrastructure, open data, and collaboration between researchers and even institutions is the implementation of competitions, challenges, or even ‘hackathons’. These events are already common among other disciplines, such as computer science, the digital tech sector, and social media research, and are becoming increasingly popular in areas related to public health. Some examples include the Big Data Hackathon San Diego, where the theme for 2022 was ‘Tackling Real-world Challenges in Healthcare’ [ 95 ], and the Yale CBIT Healthcare Hackathon of 2021, which aimed to build solutions to challenges faced in healthcare [ 96 ]. In addition to tackling issues in innovative ways, hackathons and other similar open initiatives invite the public to learn about and engage with science [ 97 ] and can be powerful tools for engaging diverse stakeholders and training beyond the classroom [ 98 ].

5. Anticipated Challenges

While the implementation of the approaches discussed ( Figure 2 ) will ideally translate to a significant reduction in research waste and improvement in scientific research through standardization and transparency, there are also substantial challenges to consider ( Figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is ijerph-20-01473-g003.jpg

Examples of challenges to expect when implementing approaches aimed at improving reproducibility.

First, not all researchers have the adequate resources or opportunities to take advantage of new data that can be used to prevent, monitor, and improve population health. Early career researchers in low-resource settings may be at a particular disadvantage. Among these researchers, barriers to access and adequately using big data may not only be financial, when funding is not available, but also technical, when the knowledge and tools required are not available.

Similarly, events and activities among young researchers can facilitate technical development, networking, and knowledge acquisition, ultimately improving research quality and outcomes. Those who live in environments with limited resources, who are physically isolated, or have limited mobility may not have access to these opportunities. It might be possible to overcome some of these limitations with accessible digital solutions.

Much needed shifts in the research and publishing culture that currently enable Questionable Research Practices (QRPs), such as cherry picking (presenting favourable evidence or results while hiding unfavourable ones), p-hacking (misusing data through relentless analysis in order to obtain statistically significant results), HARKing (Hypothesizing After the Results are Known), among others [ 59 , 99 , 100 ]. To overcome these particular challenges embedded in modern day research, it is necessary to educate researchers about the scope of misconduct, create structures to avoid it from happening, and scrutinize cases in which these instances may be apparent to determine the actual motive [ 101 ].

Conventional data storing and handling strategies are not sufficient when working with big data, as these often impose additional monetary and computational costs. Some solutions are available to tackle these issues, such as cloud computing and platforms that allow end users to access shared resources over the internet [ 102 ]; Data Lakes, consisting of centralized repositories that allow for data storage and analysis [ 103 ]; and Data Mesh, a platform architecture that distributes data among several nodes [ 104 ]. Unfortunately, these solutions are not always easily accessible. Additionally, use of these platforms has given rise to important debates concerning issues, such as data governance and security [ 105 ].

The use of big data, and especially the use of personal and health information, raises privacy issues. The availability of personal and health information that results from the digital transformation represents a constant challenge when it comes to drawing a line between public and private, sensitive and non-sensitive information, and adherence to ethical research practices [ 106 ].

Ethical concerns are not limited to privacy; while big data entails the use of increasingly complex analytical methods that require expertise in order to deal with noise and uncertainty, there are several additional factors that may affect the accuracy of research results [ 107 ]. For example, when using machine learning approaches to analyse big data, methods should be cautiously chosen to avoid issues, such as undesired data-driven variable selection, algorithmic biases, and overfitting the analytic models [ 108 ]. Complexity increases the need for collaboration, which makes “team science” and other collaborative problem-solving events (such as Hackathons) increasingly popular. This leads to new requirements to adequately value and acknowledge contributorship [ 109 ].

Because statistical methods are becoming increasingly complex and the quantity of data is becoming greater, the number of scientific publications is also increasing, making it challenging for already-flawed peer-review systems to keep up by providing high-quality reviews to more and more complex research. Currently, there are mainly four expectations from peer-review processes: (i) assure quality and accuracy of research, (ii) establish a hierarchy of published work, (iii) provide fair and equal opportunities, and (iv) assure fraud-free research [ 110 ]; however, it is not certain whether current peer-review procedures achieve or are capable of delivering such expectations. Some solutions have been proposed to address these issues, such as the automation of peer-review processes [ 111 ] and the implementation of open review guidelines [ 112 , 113 , 114 ].

6. Conclusions

Big data research presents a unique opportunity for a cultural shift in the way public-health research is conducted today. At the same time, big data use will only result in a beneficial impact to the field if used adequately, taking the appropriate measures so that their full potential can be harnessed. The inherent complexity in working with large data quantities requires a clear and transparent framework at multiple levels, ranging from protocols and methods used by individual scientists to institution’s guiding dogma, research, and publishing practices.

The solutions summarized in this commentary are aimed at enhancing results, reproducibility, and scientific integrity; however, we acknowledge that these solutions are not exhaustive and there may be many other promising approaches to improve the integrity of big data research as it applies to public health. The solutions described in this commentary are in line with “a manifesto for reproducible science” published in the Nature Human Behavior journal [ 101 ]. Importantly, reproducibility is only of value if the findings are expected to have an important impact on science, health, and society. Reproducibility of results is highly relevant for funding agencies and governments, who often recognize the importance of research projects with well-structured study designs, defined data-processing steps, and transparent analysis plans (e.g., statistical analysis plans) [ 115 , 116 ]. For imaging data, such as radiologic images, analysis pipelines have been shown to be suitable to structure the analysis pathway [ 117 ]. This is specifically important for big data analysis where interdisciplinarity and collaboration become increasingly important. The development and use of statistical and reporting guidelines support researchers in making their projects more reproducible [ 118 ].

Transparency in all the study-design steps (i.e., from hypothesis generation to availability of collected data and code) is specifically relevant for public health and epidemiological research in order to encourage funding agencies, the public, and other researchers and relevant stakeholders to trust research results [ 119 ]. Similarly, as globalization and digitalization increase the diffusion of infectious diseases [ 120 ] and behavioural risks [ 121 ], research practices that foster reproducible results are imperative to implement and diffuse interventions more swiftly.

We believe that these recommendations outlined in this commentary are not unique to big data and that the entire research community could benefit from the use of these approaches [ 122 , 123 , 124 , 125 , 126 , 127 ]. However, what has been detailed in this commentary is specifically pertinent for big data, as an increase in the volume and complexity of data produced requires more structure and consequent data handling to avoid research waste. With clearly defined and openly shared guidelines, we may strengthen the quality of current research and initiate a shift on multiple levels: at the individual level, the institutional level, and the international level. Some challenges are to be expected, particularly when it comes to finding the right incentives for these changes to stick, but we are confident that with the right effort, we can put scientific integrity back at the forefront of researchers’ minds and, ultimately, strengthen the trust of the population in public-health research and, specifically, public-health research leveraging big data for urban public health and digital epidemiology.

The timely implementation of these solutions is highly relevant, not only to ensure the quality of research and scientific output, but also to potentially allow for the use of data sources that originated without public health in mind, spanning various fields that are relevant to urban public health and digital epidemiology. As outlined in this commentary, such data can originate from multiple sources, such as social media, mobile technologies, urban sensors, and GIS, to mention a few. As such data sources grow and become more readily available, it is important for researchers and the scientific community to be prepared to use these valuable and diverse data sources in innovative ways to advance research and practice. This would allow for the expanded use of big data to inform evidence-based decision making to positively impact public health.

Acknowledgments

We are grateful for the support of the Swiss School of Public Health (SSPH+) and, in particular, to all lecturers and participants of the class of 2021 Big Data in Public Health course. We would also like to extend our gratitude to the reviewers for their excellent feedback and suggestions.

Funding Statement

Swiss School of Public Health (SSPH+) to O.G. This commentary is an outcome of an SSPH+ PhD course on Big Data in Public Health (website: https://ssphplus.ch/en/graduate-campus/en/graduate-campus/course-program/ (accessed on 10 October 2022)).

Author Contributions

Conceptualization, A.C.Q.G., D.J.L., A.T.H., T.S. and O.G.; writing—original draft preparation, A.C.Q.G., D.J.L., A.T.H. and T.S.; writing—review and editing, A.C.Q.G., D.J.L., M.S., S.E., S.J.M., J.A.N., M.F. and O.G.; supervision, O.G. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Microsoft Research: Advancing science and technology to benefit humanity

Microsoft Research Blog

Introducing AutoGen Studio: A low-code interface for building multi-agent workflows 

White icons representing (from left to right) agents (multi), workflow, tasks, and coding on a blue to purple to pink gradient background.

Research Focus: Week of June 24, 2024  

June 26, 2024

SWAN diagram

Born in the research lab a decade ago, SWAN continues to accelerate networking in the Microsoft Cloud  

June 20, 2024 | Victor Bahl

Diagrams showing features of habitual behavior (e.g., eating snack when focusing on work) and goal-directed behavior (planning a meal to lose weight). Left: habitual behavior with features like automatic, model-free, and fast; Right: goal-directed behavior with features like thoughtful, model-based, and slow.

Synergizing habits and goals with variational Bayes: A new framework for biological and artificial embodied agents  

June 19, 2024 | Dongqi Han

Explore Microsoft Research Forum

various abstract 3D shapes on a light blue background

Microsoft Research Forum  

Microsoft Research Forum | Episode 3 | Jacki O'Neill

Keynote: Building Globally Equitable AI  

Microsoft Research Forum | Episode 3 | panel discussion

Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities  

Research Forum | Episode 3 - abstract chalkboard background with colorful hands

Research Forum Brief | June 2024  

Careers in research, principal data science manager – office experience organization  .

Location : Hyderabad, Telangana, India

Data Scientist II – OneDrive-SharePoint team  

Principal machine learning engineer – azure ml  , senior data scientist – cxe data services team  .

Location : Bangalore, Karnataka, India

Senior Data Scientist – Windows  

Data scientist – clipchamp  .

Locations : Adelaide, South Australia, Australia; Brisbane, Queensland, Australia; Canberra, Australian Capital Territory, Australia; Melbourne, Victoria, Australia; Remote; Sydney, New South Wales, Australia

Data & Applied Scientist II – Bing Local Team  

Location : Barcelona, Spain

Principal Researcher – AI for Code  

Location : Cambridge, UK

Data Scientist – Azure Edge  

Locations : Ireland; Remote

Research Intern – Audio and Acoustics  

Location : Munich, Bavaria, Germany

Senior Data Scientist – Small and Medium Business (SMB)  

Locations : Dublin, Ireland; Remote

Principal Data Scientist – Industry Solutions Engineering team  

Locations : Amsterdam, Netherlands; London, UK

Senior Data Scientist – Education  

Location : Herzliya, Tel Aviv, Israel

Senior Security Researcher – Microsoft Defender For Endpoint  

Principal ai architect – microsoft defender for endpoint  .

Locations : Beer-Sheva, Israel; Haifa, Israel; Herzliya, Tel Aviv, Israel; Nazareth, Northern, Israel

Data Science and Research: MSc & PhD Internship Opportunities  

Data scientist – office of the chief economist  .

Location : Redmond, WA, US

Senior Researcher – Quantum  

Location : Santa Barbara, CA, US

Principal Data Scientist – Threat Protection Research Team  

Data science – minecraft player and data insights (padi)  , data scientist – customer solution areas  .

Locations : Remote (within US); United States

Principal Research Scientist – Responsible & Open Ai Research (ROAR)  

Events & conferences, icml 2024  .

Upcoming: July 21, 2024 – July 27, 2024

Vienna, Austria

Microsoft Research Forum | Episode 4  

Upcoming: September 3, 2024

News & awards

Why ai sometimes gets it wrong — and big strides to address it  .

Microsoft News Center  |  Jun 20, 2024

1 big thing: Cutting through the BS of AI  

Axios Science  |  Jun 20, 2024

Martez Mott receives CRA-WP Skip Ellis Early Career Award  

Computing Research Association  |  Jun 18, 2024

Chatbot teamwork makes the AI dream work  

Wired  |  Jun 6, 2024

  • Follow on Twitter
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram
  • Subscribe to our RSS feed

Share this page:

  • Share on Twitter
  • Share on Facebook
  • Share on LinkedIn
  • Share on Reddit

Main navigation

  • Programs & Courses
  • Prospective Students
  • Current Students
  • Tuition & Funding
  • Supervisors

6th Annual Neuro Open Science in Action Symposium 2024

  • Add to calendar
  • Tweet Widget

big data scientific research

An event organized by the Tanenbaum Open Science Institute In person at The Neuro and livestreamed online.

Registration coming soon

Livestream link coming soon

Open Science Throughout the Research Lifecycle 

This year's Symposium will highlight how Open Science works through various stages of the research lifecyle, focusing on areas where it is not yet widely practiced, such as data acquisition in laboratories. Interactive sessions will cover open resources enabling better study design, initiatives to increase diversity in research data, open-source hardware for data acquisition and collaborative approaches to catalyze big open data analysis.

Ed Lein, Senior Investigator at the Allen Institute for Brain Science, will kick off the event with the keynote lecture, providing an overview of the open education tools developed by the Allen Institute, which are invaluable for enhancing neuroscience education and strengthening experimental design. 

Open Science Prize Ceremony

The day will conclude with the 2024 Neuro-Irv and Helga Cooper Foundation Open Science Prizes Ceremony. The winners of this premier OS competition will accept their awards and present their work. Following the ceremony, symposium attendees are invited to celebrate and network over cocktails.

Organizing Committee

*All times are EST

9:00

9:20

9:30

10:30

11:00

12:05

12:30

Trainee Poster Session

1:30
1:35
2:35
3:05
3:20

Canadian Trainee Prize

International Trainee Prize

Main International Prize

4:05
4:15

Jeanne Timmins Amphitheatre, The Neuro (The Montreal Neurological Institute and Hospital)

The Montreal Neurological Institute and Hospital is at 3801 University Street, north of Pine Avenue West, on the McGill University campus opposite the former Royal Victoria Hospital.

Montreal is served by highway Routes 10, 15, 20 and 40, and by Greyhound Bus, ViaRail and the P-E-Trudeau airport. In the city, bus and metro service is provided by the Société de transport de Montréal (STM).

Wheelchair access

A wheelchair accessible entrance is on University Street north of the main entrance. Another wheelchair accessible entrance is in the loading area behind the building: to enter the loading area, turn into the driveway south of the main entrance. Please note, there is no parking in the loading area.

Parking near The Neuro is sometimes difficult. There are parking meters on University Street and a parking lot north of the main entrance. To enter the lot, turn right into the driveway toward Molson Stadium.

Information about parking fees

There is a taxi stand on University Street across from the main entrance. You may call a cab from the free taxi phone in the main lobby near the Security Desk.

Access by Public Transportation  (STM website)

There are four bus stops within walking distance:

  • Bus 144 stops at Pine Avenue and University Street
  • Bus 356 stops at Sherbrooke Street and University Street (Nightbus)
  • Bus 107 stops at Pine Avenue and Docteur Penfield
  • Bus 24 stops at Sherbrooke Street and University Street

Take the Metro Green Line to the McGill station. Walk north on University Street and cross Pine Avenue. The main entrance is on the right, past the flags.

Gabriel Pelletier, Open Science Data Manager, Tanenbaum Open Science Institute (TOSI)

Leah Lefort, TOSI Coordinator

Annabel Seyller, Chief of Staff, The Neuro and CEO, TOSI

Thomas Durcan, Associate Professor, The Neuro and Chair, TOSI Prize Committee

Luisa Pimentel, Open Science Community Officer, Tanenbaum Open Science Institute (TOSI)

Debbie Rashcovsky, Events Lead, The Neuro

Contact Information

  • Medicine and Health Sciences

Department and University Information

Integrated program in neuroscience (ipn).

  • The Montreal Neurological Institute
  • The Douglas Mental Health University Institute
  • Alan Edwards Center for Research on Pain
  • Center for Research in Neuroscience
  • Center for Research on Brain, Language and Music
  • McGill Vision Research
  • McGill Department of Biology
  • McGill Department of Pharmacology
  • McGill Department of Psychology
  • McGill Department of Physiology
  • The Brain@McGIll

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 June 2024

Detecting hallucinations in large language models using semantic entropy

  • Sebastian Farquhar   ORCID: orcid.org/0000-0002-9185-6415 1   na1 ,
  • Jannik Kossen 1   na1 ,
  • Lorenz Kuhn 1   na1 &
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 1  

Nature volume  630 ,  pages 625–630 ( 2024 ) Cite this article

74k Accesses

1 Citations

1479 Altmetric

Metrics details

  • Computer science
  • Information technology

Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3 , 4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents 5 or untrue facts in news articles 6 and even posing a risk to human life in medical domains such as radiology 7 . Encouraging truthfulness through supervision or reinforcement has been only partially successful 8 . Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Similar content being viewed by others

big data scientific research

Testing theory of mind in large language models and humans

big data scientific research

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

big data scientific research

ThoughtSource: A central hub for large language model reasoning data

‘Hallucinations’ are a critical problem 9 for natural language generation systems using large language models (LLMs), such as ChatGPT 1 or Gemini 2 , because users cannot trust that any given output is correct.

Hallucinations are often defined as LLMs generating “content that is nonsensical or unfaithful to the provided source content” 9 , 10 , 11 but they have come to include a vast array of failures of faithfulness and factuality. We focus on a subset of hallucinations which we call ‘confabulations’ 12 for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed. For example, when asked a medical question “What is the target of Sotorasib?” an LLM confabulates by sometimes answering KRASG12 ‘C’ (correct) and other times KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish this from cases in which a similar ‘symptom’ is caused by the following different mechanisms: when LLMs are consistently wrong as a result of being trained on erroneous data such as common misconceptions 13 ; when the LLM ‘lies’ in pursuit of a reward 14 ; or systematic failures of reasoning or generalization. We believe that combining these distinct mechanisms in the broad category hallucination is unhelpful. Our method makes progress on a portion of the problem of providing scalable oversight 15 by detecting confabulations that people might otherwise find plausible. However, it does not guarantee factuality because it does not help when LLM outputs are systematically bad. Nevertheless, we significantly improve question-answering accuracy for state-of-the-art LLMs, revealing that confabulations are a great source of error at present.

We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. Detecting confabulations allows systems built on LLMs to avoid answering questions likely to cause confabulations, to make users aware of the unreliability of answers to a question or to supplement the LLM with more grounded search or retrieval. This is essential for the critical emerging field of free-form generation in which naive approaches, suited to closed vocabulary and multiple choice, fail. Past work on uncertainty for LLMs has focused on simpler settings, such as classifiers 16 , 17 and regressors 18 , 19 , whereas the most exciting applications of LLMs relate to free-form generations.

The term hallucination in the context of machine learning originally comes from filling in ungrounded details, either as a deliberate strategy 20 or as a reliability problem 4 . The appropriateness of the metaphor has been questioned as promoting undue anthropomorphism 21 . Although we agree that metaphor must be used carefully with LLMs 22 , the widespread adoption of the term hallucination reflects the fact that it points to an important phenomenon. This work represents a step towards making that phenomenon more precise.

To detect confabulations, we use probabilistic tools to define and then measure the ‘semantic’ entropy of the generations of an LLM—an entropy that is computed over meanings of sentences. High entropy corresponds to high uncertainty 23 , 24 , 25 —so semantic entropy is one way to estimate semantic uncertainties. Semantic uncertainty, the broader category of measures we introduce, could be operationalized with other measures of uncertainty, such as mutual information, instead. Entropy in free-form generation is normally hard to measure because answers might mean the same thing (be semantically equivalent) despite being expressed differently (being syntactically or lexically distinct). This causes naive estimates of entropy or other lexical variation scores 26 to be misleadingly high when the same correct answer might be written in many ways without changing its meaning.

By contrast, our semantic entropy moves towards estimating the entropy of the distribution of meanings of free-form answers to questions, insofar as that is possible, rather than the distribution over the ‘tokens’ (words or word-pieces) which LLMs natively represent. This can be seen as a kind of semantic consistency check 27 for random seed variation. An overview of our approach is provided in Fig. 1 and a worked example in Supplementary Table 1 .

figure 1

a , Naive entropy-based uncertainty measures variation in the exact answers, treating ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable for language tasks for which sometimes different answers mean the same things. Our semantic entropy clusters answers which share meanings before computing the entropy. A low semantic entropy shows that the LLM is confident about the meaning. b , Semantic entropy can also detect confabulations in longer passages. We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples  M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation because generations often mean the same thing, despite very different wordings, which a naive entropy would have missed.

Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally 28 . That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster. We measure entailment using both general-purpose LLMs and natural language inference (NLI) tools developed specifically for detecting entailment for which we show direct evaluations in Supplementary Tables 2 and 3 and Supplementary Fig. 1 . Textual entailment has previously been shown to correlate with faithfulness 10 in the context of factual consistency 29 as well as being used to measure factuality in abstractive summarization 30 , especially when applied at the right granularity 31 .

Semantic entropy detects confabulations in free-form text generation across a range of language models and domains, without previous domain knowledge. Our evaluations cover question answering in trivia knowledge (TriviaQA 32 ), general knowledge (SQuAD 1.1; ref. 33 ), life sciences (BioASQ 34 ) and open-domain natural questions (NQ-Open 35 ) derived from actual queries to Google Search 36 . In addition, semantic entropy detects confabulations in mathematical word problems (SVAMP 37 ) and in a biography-generation dataset, FactualBio, accompanying this paper.

Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are all evaluated context-free and involve sentence-length answers (96 ± 70 characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B parameters) 38 , Falcon Instruct (7B and 40B) 39 and Mistral Instruct (7B) 40 . In the Supplementary Information , we further consider short-phrase-length answers. Results for FactualBio (442 ± 122 characters) use GPT-4 (ref. 1 ). At the time of writing, GPT-4 (ref. 1 ) did not expose output probabilities 41 or hidden states, although it does now. As a result, we propose a discrete approximation of our estimator for semantic entropy which allows us to run experiments without access to output probabilities, which we use for all GPT-4 results in this paper and which performs similarly well.

Our confabulation detection with semantic entropy is more robust to user inputs from previously unseen domains than methods which aim to ‘learn’ how to detect confabulations from a set of example demonstrations. Our method is unsupervised, meaning that we do not need labelled examples of confabulations. By contrast, supervised methods detect confabulations by learning patterns behind examples of confabulations, assuming that future questions preserve these patterns. But this assumption is often untrue in new situations or with confabulations that human overseers are unable to identify (compare Fig. 17 of ref. 24 ). As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a question on the basis of the final ‘embedding’ (hidden state) of the LLM. We also use the P (True) method 24 which looks at the probability with which an LLM predicts that the next token is ‘True’ when few-shot prompted to compare a main answer with ‘brainstormed’ alternatives.

Confabulations contribute substantially to incorrect answers given by language models. We show that semantic entropy can be used to predict many incorrect model answers and to improve question-answering accuracy by refusing to answer those questions the model is uncertain about. Corresponding to these two uses, we evaluate two main metrics. First, the widely used area under the receiver operating characteristic (AUROC) curve for the binary event that a given answer is incorrect. This measure captures both precision and recall and ranges from 0 to 1, with 1 representing a perfect classifier and 0.5 representing an un-informative classifier. We also show a new measure, the area under the ‘rejection accuracy’ curve (AURAC). This studies the case in which the confabulation detection score is used to refuse to answer the questions judged most likely to cause confabulations. Rejection accuracy is the accuracy of the answers of the model on the remaining questions and the area under this curve is a summary statistic over many thresholds (representative threshold accuracies are provided in Supplementary Material ). The AURAC captures the accuracy improvement which users would experience if semantic entropy was used to filter out questions causing the highest entropy.

Detecting confabulations in QA and math

In Fig. 2 , we show that both semantic entropy and its discrete approximation outperform our best baselines for sentence-length generations. These results are averaged across datasets and provide the actual scores on the held-out evaluation dataset. We report the raw average score across held-out evaluation datasets without standard error because the distributional characteristics are more a property of the models and datasets selected than the method. Consistency of relative results across different datasets is a stronger indicator of variation in this case.

figure 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y -axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y -axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information .

Semantic entropy greatly outperforms the naive estimation of uncertainty using entropy: computing the entropy of the length-normalized joint probability of the token sequences. Naive entropy estimation ignores the fact that token probabilities also express the uncertainty of the model over phrasings that do not change the meaning of an output.

Our methods also outperform the supervised embedding regression method both in- and out-of-distribution. In pale-yellow bars we show that embedding regression performance deteriorates when its training data do not match the deployment distribution—which mirrors the common real-world case in which there is a distribution shift between training and deployment 42 —the plotted value is the average metric for embedding regression trained on one of the four ‘off-distribution’ datasets for that evaluation. This is critical because reliable uncertainty is most important when the data distribution shifts. Semantic entropy also outperforms P (True) which is supervised ‘in-context’; that is, it is adapted to the deployment task with a few training examples provided in the LLM prompt itself. The discrete variant of semantic entropy performs similarly to our standard estimator, despite not requiring exact output probabilities.

Averaged across the 30 combinations of tasks and models we study, semantic entropy achieves the best AUROC value of 0.790 whereas naive entropy (0.691), P (True) (0.698) and the embedding regression baseline (0.687) lag behind it. Semantic entropy performs well consistently, with stable performance (between 0.78 and 0.81 AUROC) across the different model families (LLaMA, Falcon and Mistral) and scales (from 7B to 70B parameters) which we study (we report summary statistics for each dataset and model as before). Although semantic entropy outperforms the baselines across all model sizes, P (True) seems to improve with model size, suggesting that it might become more competitive for very capable honest models in settings that the model understands well (which are, however, not the most important cases to have good uncertainty). We use ten generations to compute entropy, selected using analysis in Supplementary Fig. 2 . Further results for short-phrase generations are described in Supplementary Figs. 7 – 10 .

The results in Fig. 2 offer a lower bound on the effectiveness of semantic entropy at detecting confabulations. These evaluations determine whether semantic entropy and baseline methods can detect when the answers of the model are incorrect (which we validate against human correctness evaluations in Supplementary Table 4 ). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as consistent errors learned from the training data. The fact that methods such as embedding regression are able to spot other kinds of errors, not just confabulations, but still are outperformed by semantic entropy, suggests that confabulations are a principal category of errors for actual generations.

Examples of questions and answers from TriviaQA, SQuAD and BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1 . These illustrate how only semantic entropy detects when the meaning is constant but the form varies (the first row of the table) whereas semantic entropy and naive entropy both correctly predict the presence of confabulations when the form and meaning vary together (second row) and predict the absence of confabulations when the form and meaning are both constant across several resampled generations (third row). In the final row, we give an example in which semantic entropy is erroneously high as a result of overly sensitive semantic clustering relative to the reference answer. Our clustering method distinguishes the answers which provide a precise date from those which only provide a year. For some contexts that would have been correct but in this context the distinction between the specific day and the year is probably irrelevant. This highlights the importance of context and judgement in clustering, especially in subtle cases, as well as the shortcomings of evaluating against fixed reference answers which do not capture the open-ended flexibility of conversational deployments of LLMs.

Detecting confabulations in biographies

Semantic entropy is most natural for sentences that express a single proposition but the idea of semantic equivalence is trickier to apply to longer passages which express many propositions which might only agree partially 43 . Nevertheless, we can use semantic entropy to detect confabulations in longer generations, such as entire paragraphs of text. To show this, we develop a dataset of biographical generations from GPT-4 (v.0613) for 21 individuals notable enough to have their own Wikipedia page but without extensive online biographies. From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total), which we manually label as true or false.

Applying semantic entropy to this problem is challenging. Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next time profession. This is analogous to the original problem semantic entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy (Fig. 1 ) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4 ) and computing the semantic entropy over those generations plus the original factual claim. We aggregate these by averaging the semantic entropy over all the questions to get an uncertainty score for each proposition, which we use to detect confabulations. Unaggregated results are shown in Supplementary Figs. 5 and 6 .

As GPT-4 did not allow access to the probability of the generation at the time of writing, we use a discrete variant of semantic entropy which makes the further approximation that we can infer a discrete empirical distribution over semantic meaning clusters from only the generations ( Methods ). This allows us to compute semantic entropy using only the black-box outputs of an LLM. However, we were unable to compute the naive entropy baseline, the standard semantic entropy estimator or the embedding regression baseline for GPT-4 without output probabilities and embeddings.

In Fig. 3 we show that the discrete variant of semantic entropy effectively detects confabulations on this dataset. Its AUROC and AURAC are higher than either a simple ‘self-check’ baseline—which just asks the LLM whether the factoid is likely to be true—or a variant of P (True) which has been adapted to work for the paragraph-length setting. Discrete semantic entropy has better rejection accuracy performance until 20% of the questions have been rejected at which point P (True) has a narrow edge. This indicates that the questions predicted to cause confabulations are indeed more likely to be wrong.

figure 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y -axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P (True) baseline exceed semantic entropy.

Our probabilistic approach, accounting for semantic equivalence, detects an important class of hallucinations: those that are caused by a lack of LLM knowledge. These are a substantial portion of the failures at present and will continue even as models grow in capabilities because situations and cases that humans cannot reliably supervise will persist. Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adaptations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination 44 for scalable oversight through debate 45 .

The success of semantic entropy at detecting errors suggests that LLMs are even better at “knowing what they don’t know” than was argued by ref. 24 —they just don’t know they know what they don’t know. Our method explicitly does not directly address situations in which LLMs are confidently wrong because they have been trained with objectives that systematically produce dangerous behaviour, cause systematic reasoning errors or are systematically misleading the user. We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately.

One exciting aspect of our approach is the way it makes use of classical probabilistic machine learning methods and adapts them to the unique properties of modern LLMs and free-form language generation. We hope to inspire a fruitful exchange of well-studied methods and emerging new problems by highlighting the importance of meaning when addressing language-based machine learning problems.

Semantic entropy as a strategy for overcoming confabulation builds on probabilistic tools for uncertainty estimation. It can be applied directly to any LLM or similar foundation model without requiring any modifications to the architecture. Our ‘discrete’ variant of semantic uncertainty can be applied even when the predicted probabilities for the generations are not available, for example, because access to the internals of the model is limited.

In this section we introduce background on probabilistic methods and uncertainty in machine learning, discuss how it applies to language models and then discuss our contribution, semantic entropy, in detail.

Uncertainty and machine learning

We aim to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output is going to be arbitrary.

One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input 25 . The predictive entropy (PE) for an input sentence x is the conditional entropy ( H ) of the output random variable Y with realization y given x ,

A low predictive entropy indicates an output distribution which is heavily concentrated whereas a high predictive entropy indicates that many possible outputs are similarly likely.

Aleatoric and epistemic uncertainty

We do not distinguish between aleatoric and epistemic uncertainty in our analysis. Researchers sometimes separate aleatoric uncertainty (uncertainty in the underlying data distribution) from epistemic uncertainty (caused by having only limited information) 46 . Further advances in uncertainty estimation which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond entropy.

Joint probabilities of sequences of tokens

Generative LLMs produce strings of text by selecting tokens in sequence. Each token is a wordpiece that often represents three or four characters (though especially common sequences and important words such as numbers typically get their own token). To compute entropies, we need access to the probabilities the LLM assigns to the generated sequence of tokens. The probability of the entire sequence, s , conditioned on the context, x , is the product of the conditional probabilities of new tokens given past tokens, whose resulting log-probability is \(\log P({\bf{s}}| {\boldsymbol{x}})={\sum }_{i}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , where s i is the i th output token and s < i denotes the set of previous tokens.

Length normalization

When comparing the log-probabilities of generated sequences, we use ‘length normalization’, that is, we use an arithmetic mean log-probability, \(\frac{1}{N}{\sum }_{i}^{N}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , instead of the sum. In expectation, longer sequences have lower joint likelihoods because of the conditional independence of the token probabilities 47 . The joint likelihood of a sequence of length N shrinks exponentially in N . Its negative log-probability therefore grows linearly in N , so longer sentences tend to contribute more to entropy. We therefore interpret length-normalizing the log-probabilities when estimating the entropy as asserting that the expected uncertainty of generations is independent of sentence length. Length normalization has some empirical success 48 , including in our own preliminary experiments, but little theoretical justification in the literature.

Principles of semantic uncertainty

If we naively calculate the predictive entropy directly from the probabilities of the generated sequence of tokens, we conflate the uncertainty of the model over the meaning of its answer with the uncertainty over the exact tokens used to express that meaning. For example, even if the model is confident in the meaning of a generation, there are still usually many different ways for phrasing that generation without changing its meaning. For the purposes of detecting confabulations, the uncertainty of the LLM over meanings is more important than the uncertainty over the exact tokens used to express those meanings.

Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not the choice of words. To do this, we introduce an algorithm that clusters model generations by meaning and subsequently calculates semantic uncertainty. At a high level this involves three steps:

Generation: sample output sequences of tokens from the predictive distribution of a LLM given a context x .

Clustering: cluster sequences by their meaning using our clustering algorithm based on bidirectional entailment.

Entropy estimation: estimate semantic entropy by summing probabilities of sequences that share a meaning following equation ( 2 ) and compute their entropy.

Generating a set of answers from the model

Given some context x as input to the LLM, we sample M sequences, { s (1) , …,  s ( M ) } and record their token probabilities, { P ( s (1) ∣ x ), …,  P ( s ( M ) ∣ x )}. We sample all our generations from a single model, varying only the random seed used for sampling from the token probabilities. We do not observe the method to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling ( P  = 0.9) (ref. 49 ) and top- K sampling ( K  = 50) (ref. 50 ). We also sample a single generation at low temperature (0.1) as an estimate of the ‘best generation’ of the model to the context, which we use to assess the accuracy of the model. (A lower sampling temperature increases the probability of sampling the most likely tokens).

Clustering by semantic equivalence

To estimate semantic entropy we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other.

This can be described using ‘semantic equivalence’ which is the relation that holds between two sentences when they mean the same thing. We can formalize semantic equivalence mathematically. Let the space of tokens in a language be \({\mathcal{T}}\) . The space of all possible sequences of tokens of length N is then \({{\mathcal{S}}}_{N}\equiv {{\mathcal{T}}}^{N}\) . Note that N can be made arbitrarily large to accommodate whatever size of sentence one can imagine and one of the tokens can be a ‘padding’ token which occurs with certainty for each token after the end-of-sequence token. For some sentence \({\bf{s}}\in {{\mathcal{S}}}_{N}\) , composed of a sequence of tokens, \({s}_{i}\in {\mathcal{T}}\) , there is an associated meaning. Theories of meaning are contested 51 . However, for specific models and deployment contexts many considerations can be set aside. Care should be taken comparing very different models and contexts.

Let us introduce a semantic equivalence relation, E (  ⋅  ,  ⋅  ), which holds for any two sentences that mean the same thing—we will operationalize this presently. Recall that an equivalence relation is any reflexive, symmetric and transitive relation and that any equivalence relation on a set corresponds to a set of equivalence classes. Each semantic equivalence class captures outputs that can be considered to express the same meaning. That is, for the space of semantic equivalence classes \({\mathcal{C}}\) the sentences in the set \(c\in {\mathcal{C}}\) can be regarded in many settings as expressing a similar meaning such that \(\forall {\bf{s}},{{\bf{s}}}^{{\prime} }\in c:E({\bf{s}},{{\bf{s}}}^{{\prime} })\) . So we can build up these classes of semantically equivalent sentences by checking if new sentences share a meaning with any sentences we have already clustered and, if so, adding them into that class.

We operationalize E (  ⋅  ,  ⋅  ) using the idea of bidirectional entailment, which has a long history in linguistics 52 and natural language processing 28 , 53 , 54 . A sequence, s , means the same thing as a second sequence, s ′, only if the sequences entail (that is, logically imply) each other. For example, ‘The capital of France is Paris’ entails ‘Paris is the capital of France’ and vice versa because they mean the same thing. (See later for a discussion of soft equivalence and cases in which bidirectional entailment does not guarantee equivalent meanings).

Importantly, we require that the sequences mean the same thing with respect to the context—key meaning is sometimes contained in the context. For example, ‘Paris’ does not entail ‘The capital of France is Paris’ because ‘Paris’ is not a declarative sentence without context. But in the context of the question ‘What is the capital of France?’, the one-word answer does entail the longer answer.

Detecting entailment has been the object of study of a great deal of research in NLI 55 . We rely on language models to predict entailment, such as DeBERTa-Large-MNLI 56 , which has been trained to predict entailment, or general-purpose LLMs such as GPT-3.5 (ref. 57 ), which can predict entailment given suitable prompts.

We then cluster sentences according to whether they bidirectionally entail each other using the algorithm presented in Extended Data Fig. 1 . Note that, to check if a sequence should be added to an existing cluster, it is sufficient to check if the sequence bidirectionally entails any of the existing sequences in that cluster (we arbitrarily pick the first one), given the transitivity of semantic equivalence. If a sequence does not share meaning with any existing cluster, we assign it its own cluster.

Computing the semantic entropy

Having determined the classes of generated sequences that mean the same thing, we can estimate the likelihood that a sequence generated by the LLM belongs to a given class by computing the sum of the probabilities of all the possible sequences of tokens which can be considered to express the same meaning as

Formally, this treats the output as a random variable whose event-space is the space of all possible meaning-classes, C , a sub- σ -algebra of the standard event-space S . We can then estimate the semantic entropy (SE) as the entropy over the meaning-distribution,

There is a complication which prevents direct computation: we do not have access to every possible meaning-class c . Instead, we can only sample c from the sequence-generating distribution induced by the model. To handle this, we estimate the expectation in equation ( 3 ) using a Rao–Blackwellized Monte Carlo integration over the semantic equivalence classes C ,

where \(P({C}_{i}| {\boldsymbol{x}})=\frac{P({c}_{i}| {\boldsymbol{x}})}{{\sum }_{c}P(c| {\boldsymbol{x}})}\) estimates a categorical distribution over the cluster meanings, that is, ∑ i P ( C i ∣ x ) = 1. Without this normalization step cluster ‘probabilities’ could exceed one because of length normalization, resulting in degeneracies. Equation ( 5 ) is the estimator giving our main method that we refer to as semantic entropy throughout the text.

For scenarios in which the sequence probabilities are not available, we propose a variant of semantic entropy which we call ‘discrete’ semantic entropy. Discrete semantic entropy approximates P ( C i ∣ x ) directly from the number of generations in each cluster, disregarding the token probabilities. That is, we approximate P ( C i ∣ x ) as \({\sum }_{1}^{M}\frac{{I}_{c={C}_{i}}}{M}\) , the proportion of all the sampled answers which belong to that cluster. Effectively, this just assumes that each output that was actually generated was equally probable—estimating the underlying distribution as the categorical empirical distribution. In the limit of M the estimator converges to equation ( 5 ) by the law of large numbers. We find that discrete semantic entropy results in similar performance empirically.

We provide a worked example of the computation of semantic entropy in Supplementary Note  1 .

Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to predict model accuracy, demonstrating that confabulations make up a notable fraction of model mistakes. We further show that semantic uncertainty can be used to improve model accuracy by refusing to answer questions when semantic uncertainty is high. Last, semantic uncertainty can be used to give users a way to know when model generations are probably unreliable.

We use the datasets BioASQ 34 , SQuAD 33 , TriviaQA 32 , SVAMP 37 and NQ-Open 35 . BioASQ is a life-sciences question-answering dataset based on the annual challenge of the same name. The specific dataset we use is based on the QA dataset from Task B of the 2023 BioASQ challenge (11B). SQuAD is a reading comprehension dataset whose context passages are drawn from Wikipedia and for which the answers to questions can be found in these passages. We use SQuAD 1.1 which excludes the unanswerable questions added in v.2.0 that are deliberately constructed to induce mistakes so they do not in practice cause confabulations to occur. TriviaQA is a trivia question-answering dataset. SVAMP is a word-problem maths dataset containing elementary-school mathematical reasoning tasks. NQ-Open is a dataset of realistic questions aggregated from Google Search which have been chosen to be answerable without reference to a source text. For each dataset, we use 400 train examples and 400 test examples randomly sampled from the original larger dataset. Note that only some of the methods require training, for example semantic entropy does not use the training data. If the datasets themselves are already split into train and test (or validation) samples, we sample our examples from within the corresponding split.

All these datasets are free-form, rather than multiple choice, because this better captures the opportunities created by LLMs to produce free-form sentences as answers. We refer to this default scenario as our ‘sentence-length’ experiments. In Supplementary Note  7 , we also present results for confabulation detection in a ‘short-phrase’ scenario, in which we constrain model answers on these datasets to be as concise as possible.

To make the problems more difficult and induce confabulations, we do not provide the context passages for any of the datasets. When the context passages are provided, the accuracy rate is too high for these datasets for the latest generations of models to meaningfully study confabulations.

For sentence-length generations we use: Falcon 39 Instruct (7B and 40B), LLaMA 2 Chat 38 (7B, 13B and 70B) and Mistral 40 Instruct (7B).

In addition to reporting results for semantic entropy, discrete semantic entropy and naive entropy, we consider two strong baselines.

Embedding regression is a supervised baseline inspired by the P (IK) method 24 . In that paper, the authors fine-tune their proprietary LLM on a dataset of questions to predict whether the model would have been correct. This requires access to a dataset of ground-truth answers to the questions. Rather than fine-tuning the entire LLM in this way, we simply take the final hidden units and train a logistic regression classifier to make the same prediction. By contrast to their method, this is much simpler because it does not require fine-tuning the entire language model, as well as being more reproducible because the solution to the logistic regression optimization problem is not as seed-dependent as the fine-tuning procedure. As expected, this supervised approach performs well in-distribution but fails when the distribution of questions is different from that on which the classifier is trained.

The second baseline we consider is the P (True) method 24 , in which the model first samples M answers (identically to our semantic entropy approach) and then is prompted with the list of all answers generated followed by the highest probability answer and a question whether this answer is “(a) True” or “(b) False”. The confidence score is then taken to be the probability with which the LLM responds with ‘a’ to the multiple-choice question. The performance of this method is boosted with a few-shot prompt, in which up to 20 examples from the training set are randomly chosen, filled in as above, but then provided with the actual ground truth of whether the proposed answer was true or false. In this way, the method can be considered as supervised ‘in-context’ because it makes use of some ground-truth training labels but can be used without retraining the model. Because of context-size constraints, this method cannot fit a full 20 few-shot examples in the context when input questions are long or large numbers of generations are used. As a result, we sometimes have to reduce the number of few-shot examples to suit the context size and we note this in the  Supplementary Material .

Entailment estimator

Any NLI classification system could be used for our bidirectional entailment clustering algorithm. We consider two different kinds of entailment detector.

One option is to use an instruction-tuned LLM such as LLaMA 2, GPT-3.5 (Turbo 1106) or GPT-4 to predict entailment between generations. We use the following prompt:

We are evaluating answers to the question {question} Here are two possible answers: Possible Answer 1: {text1} Possible Answer 2: {text2} Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.

Alternatively, we consider using a language model trained for entailment prediction, specifically the DeBERTa-large model 56 fine-tuned on the NLI dataset MNLI 58 . This builds on past work towards paraphrase identification based on embedding similarity 59 , 60 and BERT-style models 61 , 62 . We template more simply, checking if DeBERTa predicts entailment between the concatenation of the question and one answer and the concatenation of the question and another answer. Note that DeBERTa-large is a relatively lightweight model with only 1.5B parameters which is much less powerful than most of the LLMs under study.

In Supplementary Note 2 , we carefully evaluate the benefits and drawbacks of these methods for entailment prediction. We settle on using GPT-3.5 with the above prompt, as its entailment predictions agree well with human raters and lead to good confabulation detection performance.

In Supplementary Note  3 , we provide a discussion of the computational cost and choosing the number of generations for reliable clustering.

Prompting templates

We use a simple generation template for all sentence-length answer datasets:

Answer the following question in a single brief but complete sentence. Question: {question} Answer:

Metrics and accuracy measurements

We use three main metrics to evaluate our method: AUROC, rejection accuracy and AURAC. Each of these is grounded in an automated factuality estimation measurement relative to the reference answers provided by the datasets that we use.

AUROC, rejection accuracy and AURAC

First, we use the AUROC curve, which measures the reliability of a classifier accounting for both precision and recall. The AUROC can be interpreted as the probability that a randomly chosen correct answer has been assigned a higher confidence score than a randomly chosen incorrect answer. For a perfect classifier, this is 1.

Second, we compute the ‘rejection accuracy at X %’, which is the question-answering accuracy of the model on the most-confident X % of the inputs as identified by the respective uncertainty method. If an uncertainty method works well, predictions on the confident subset should be more accurate than predictions on the excluded subset and the rejection accuracy should increase as we reject more inputs.

To summarize this statistic we compute the AURAC—the total area enclosed by the accuracies at all cut-off percentages X %. This should increase towards 1 as given uncertainty method becomes more accurate and better at detecting likely-inaccurate responses but it is more sensitive to the overall accuracy of the model than the AUROC metric.

In Supplementary Note  5 , we provide the unaggregated rejection accuracies for sentence-length generations.

Assessing accuracy

For the short-phrase-length generation setting presented in Supplementary Note  7 , we simply assess the accuracy of the generations by checking if the F1 score of the commonly used SQuAD metric exceeds 0.5. There are limitations to such simple scoring rules 63 but this method is widely used in practice and its error is comparatively small on these standard datasets.

For our default scenario, the longer sentence-length generations, this measure fails, as the overlap between the short reference answer and our long model answer is invariably too small. For sentence-length generations, we therefore automatically determine whether an answer to the question is correct or incorrect by using GPT-4 to compare the given answer to the reference answer. We use the template:

We are assessing the quality of answers to the following question: {question} The expected answer is: {reference answer} The proposed answer is: {predicted answer} Within the context of the question, does the proposed answer mean the same as the expected answer? Respond only with yes or no.

We make a small modification for datasets with several reference answers: line two becomes “The following are expected answers to this question:” and the final line asks “does the proposed answer mean the same as any of the expected answers?”.

In Supplementary Note 6 , we check the quality of our automated ground-truth evaluations against human judgement by hand. We find that GPT-4 gives the best results for determining model accuracy and thus use it in all our sentence-length experiments.

In this section we describe the application of semantic entropy to confabulation detection in longer model generations, specifically paragraph-length biographies.

We introduce a biography-generation dataset—FactualBio—available alongside this paper. FactualBio is a collection of biographies of individuals who are notable enough to have Wikipedia pages but not notable enough to have large amounts of detailed coverage, generated by GPT-4 (v.0613). To generate the dataset, we randomly sampled 21 individuals from the WikiBio dataset 64 . For each biography, we generated a list of factual claims contained in each biography using GPT-4, with 150 total factual claims (the total number is only coincidentally a round number). For each of these factual claims, we manually determined whether the claim was correct or incorrect. Out of 150 claims, 45 were incorrect. As before, we apply confabulation detection to detect incorrect model predictions, even though there may be model errors which are not confabulations.

Prompting and generation

Given a paragraph-length piece of LLM-generated text, we apply the following sequence of steps:

Automatically decompose the paragraph into specific factual claims using an LLM (not necessarily the same as the original).

For each factual claim, use an LLM to automatically construct Q questions which might have produced that claim.

For each question, prompt the original LLM to generate M answers.

For each question, compute the semantic entropy of the answers, including the original factual claim.

Average the semantic entropies over the questions to arrive at a score for the original factual claim.

We pursue this slightly indirect way of generating answers because we find that simply resampling each sentence creates variation unrelated to the uncertainty of the model about the factual claim, such as differences in paragraph structure.

We decompose the paragraph into factual claims using the following prompt:

Please list the specific factual propositions included in the answer above. Be complete and do not leave any factual claims out. Provide each claim as a separate sentence in a separate bullet point.

We found that we agreed with the decompositions in all cases in the dataset.

We then generate six questions for each of the facts from the decomposition. We generate these questions by prompting the model twice with the following:

Following this text: {text so far} You see the sentence: {proposition} Generate a list of three questions, that might have generated the sentence in the context of the preceding original text, as well as their answers. Please do not use specific facts that appear in the follow-up sentence when formulating the question. Make the questions and answers diverse. Avoid yes-no questions. The answers should not be a full sentence and as short as possible, e.g. only a name, place, or thing. Use the format “1. {question} – {answer}”.

These questions are not necessarily well-targeted and the difficulty of this step is the main source of errors in the procedure. We generate three questions with each prompt, as this encourages diversity of the questions, each question targeting a different aspect of the fact. However, we observed that the generated questions will sometimes miss obvious aspects of the fact. Executing the above prompt twice (for a total of six questions) can improve coverage. We also ask for brief answers because the current version of GPT-4 tends to give long, convoluted and highly hedged answers unless explicitly told not to.

Then, for each question, we generate three new answers using the following prompt:

We are writing an answer to the question “{user question}”. So far we have written: {text so far} The next sentence should be the answer to the following question: {question} Please answer this question. Do not answer in a full sentence. Answer with as few words as possible, e.g. only a name, place, or thing.

We then compute the semantic entropy over these answers plus the original factual claim. Including the original fact ensures that the estimator remains grounded in the original claim and helps detect situations in which the question has been interpreted completely differently from the original context. We make a small modification to handle the fact that GPT-4 generations often include refusals to answer questions. These refusals were not something we commonly observe in our experiments with LLaMA 2, Falcon or Mistral models. If more than half of the answers include one of the strings ‘not available’, ‘not provided’, ‘unknown’ or ‘unclear’ then we treat the semantic uncertainty as maximal.

We then average the semantic entropies for each question corresponding to the factual claim to get an entropy for this factual claim.

Despite the extra assumptions and complexity, we find that this method greatly outperforms the baselines.

To compute semantic entailment between the original claim and regenerated answers, we rely on the DeBERTa entailment prediction model as we find empirically that DeBERTa predictions result in higher train-set AUROC than other methods. Because DeBERTa has slightly lower recall than GPT-3.5/4, we use a modified set-up for which we say the answers mean the same as each other if at least one of them entails the other and neither is seen to contradict the other—a kind of ‘non-defeating’ bidirectional entailment check rather than true bidirectional entailment. The good performance of DeBERTa in this scenario is not surprising as both factual claims and regenerated answers are relatively short. We refer to Supplementary Notes 2 and 3 for ablations and experiments regarding our choice of entailment estimator for paragraph-length generations.

We implement two baselines. First, we implement a variant of the P (True) method, which is adapted to the new setting. For each factoid, we generate a question with answers in the same way as for semantic entropy. We then use the following prompt:

Question: {question} Here are some brainstormed ideas: {list of regenerated answers} Possible answer: {original answer} Is the possible answer true? Respond with “yes” or “no”.

As we cannot access the probabilities GPT-4 assigns to predicting ‘yes’ and ‘no’ as the next token, we approximate this using Monte Carlo samples. Concretely, we execute the above prompt ten times (at temperature 1) and then take the fraction of answers which was ‘yes’ as our unbiased Monte Carlo estimate of the token probability GPT-4 assigns to ‘yes’.

As a second, simpler, baseline we check if the model thinks the answer is true. We simply ask:

Following this text: {text so far} You see this statement: {proposition} Is it likely that the statement is true? Respond with ‘yes’ or ‘no’.

It is interesting that this method ought to perform very well if we think that the model has good ‘self-knowledge’ (that is, if “models mostly know what they don’t know” 24 ) but in fact semantic entropy is much better at detecting confabulations.

Data availability

The data used for the short-phrase and sentence-length generations are publicly available and the released code details how to access it. We release a public version of the FactualBio dataset as part of the code base for reproducing the paragraph-length experiments.

Code availability

We release all code used to produce the main experiments. The code for short-phrase and sentence-length experiments can be found at github.com/jlko/semantic_uncertainty and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ). The code for paragraph-length experiments can be found at github.com/jlko/long_hallucinations and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ).

GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language generation. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021).

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association for Computational Linguistics, 2018).

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023).

Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng . 146 , 102182 (2023).

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307 , e230163 (2023).

Article   PubMed   Google Scholar  

Schulman, J. Reinforcement learning from human feedback: progress and challenges. Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/watch?v=hhiLw5Q_UFg (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 , 248 (2023).

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 (Association for Computational Linguistics, 2020).

Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020).

Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7 , 225–241 (1998).

Article   CAS   PubMed   Google Scholar  

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Transact. Mach. Learn. Res. (2022).

Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021).

Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Transact. Assoc. Comput. Linguist. 9 , 962–977 (2021).

Article   Google Scholar  

Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020).

Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational Linguistics, 2021).

Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10 , 680–696 (2022).

Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference on Automatic Face and Gesture Recognition . 83–88 (IEEE, Catalogue no PR00580, 2002).

Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that has got to stop. Forbes Magazine (24 August 2022).

Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery 67 , 68–79 (2024).

MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4 , 590–604 (1992).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 (2022).

Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27 , 986–1005 (1956).

Article   MathSciNet   Google Scholar  

Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Neural Information Processing Systems (NeurIPS, Vancouver, 2019).

Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit .

Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 670–679 (Association for Computational Linguistics, 2011).

Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022).

Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019).

Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10 , 163–177 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics 1601–1611 (Association for Computational Linguistics. 2017).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016).

Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16 , 138 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6086–6096 (Association for Computational Linguistics, 2019).

Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Transact. Assoc. Comput. Linguist. 7 , 452–466 (2019).

Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math word problems? In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2080–2094 (Assoc. Comp. Linguistics, 2021).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In Proc. 36th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 2023)

Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box hallucination detection for generative large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 9004–9017 (Assoc. Comp. Linguistics, 2023).

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023).

Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for Computational Linguistics, 2022).

Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1 (2020).

Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/abs/1805.00899 (2018).

Der Kiureghian, A. & Ditlevsen, O. Aleatory or epistemic? Does it matter? Struct. Saf. 31 , 105–112 (2009).

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=jN5y-zb5Q7m (2021).

Murray, K. & Chiang, D. Correcting length bias in neural machine translation. In Proc. Third Conference on Machine Translation (eds Bojar, O. et al.) 212–223 (Assoc. Comp. Linguistics, 2018).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=rygGQyrFvH (2020).

Fan, A., Lewis, M. & Dauphin, Y. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).

Speaks, J. in The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford Univ., 2021).

Culicover, P. W. Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11 , 78–88 (1968).

Google Scholar  

Padó, S., Cer, D., Galley, M., Jurafsky, D. & Manning, C. D. Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach. Transl. 23 , 181–193 (2009).

Androutsopoulos, I. & Malakasiotis, P. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 , 135–187 (2010).

MacCartney, B. Natural Language Inference (Stanford Univ., 2009).

He, P., Liu, X., Gao, J. & Chen, W. Deberta: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Williams, A., Nangia, N. & Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 1112–1122 (Assoc. Comp. Linguistics, 2018).

Yu, L., Hermann, K. M., Blunsom, P. & Pulman, S. Deep learning for answer sentence selection. Preprint at https://arxiv.org/abs/1412.1632 (2014).

Socher, R., Huang, E., Pennin, J., Manning, C. D. & Ng, A. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th Conference on Neural Information Processing Systems (eds Shawe-Taylor, J. et al.) (2011)

He, R., Ravula, A., Kanagal, B. & Ainslie, J. Realformer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zhong, C., et al.) 929–943 (Assoc. Comp. Linguistics, 2021).

Tay, Y. et al. Charformer: fast character transformers via gradient-based subword tokenization. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=JtBRnrlOEFN (2022).

Kane, H., Kocyigit, Y., Abdalla, A., Ajanoh, P. & Coulibali, M. Towards neural similarity evaluators. In Workshop on Document Intelligence at the 32nd conference on Neural Information Processing (2019).

Lebret, R., Grangier, D. & Auli, M. Neural text generation from structured data with application to the biography domain. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1203–1213 (Association for Computational Linguistics, 2016).

Kossen, J., jlko/semantic_uncertainty: Initial release v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.10964366 (2024).

Download references

Acknowledgements

We thank G. Irving, K. Perlin, J. Richens, L. Rimell and M. Turpin for their comments or discussion related to this work. We thank K. Handa for his help with the human evaluation of our automated accuracy assessment. We thank F. Bickford Smith and L. Melo for their code review. Y.G. is supported by a Turing AI Fellowship funded by the UK government’s Office for AI, through UK Research and Innovation (grant reference EP/V030302/1), and delivered by the Alan Turing Institute.

Author information

These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn

Authors and Affiliations

OATML, Department of Computer Science, University of Oxford, Oxford, UK

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal

You can also search for this author in PubMed   Google Scholar

Contributions

S.F. led the work from conception to completion and proposed using bidirectional entailment to cluster generations as a way of computing entropy in LLMs. He wrote the main text, most of the Methods and Supplementary Information and prepared most of the figures. J.K. improved the mathematical formalization of semantic entropy; led the extension of semantic entropy to sentence- and paragraph-length generations; wrote the code for, and carried out, all the experiments and evaluations; wrote much of the Methods and Supplementary Information and prepared drafts of many figures; and gave critical feedback on the main text. L.K. developed the initial mathematical formalization of semantic entropy; wrote code for, and carried out, the initial experiments around semantic entropy and its variants which demonstrated the promise of the idea and helped narrow down possible research avenues to explore; and gave critical feedback on the main text. Y.G. ideated the project, proposing the idea to differentiate semantic and syntactic diversity as a tool for detecting hallucinations, provided high-level guidance on the research and gave critical feedback on the main text; he runs the research laboratory in which the work was carried out.

Corresponding author

Correspondence to Sebastian Farquhar .

Ethics declarations

Competing interests.

S.F. is currently employed by Google DeepMind and L.K. by OpenAI. For both, this paper was written under their University of Oxford affiliation. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Mirella Lapata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 algorithm outline for bidirectional entailment clustering..

Given a set of outputs in response to a context, the bidirectional entailment answer returns a set of sets of outputs which have been classified as sharing a meaning.

Supplementary information

Supplementary information.

Supplementary Notes 1–7, Figs. 1–10, Tables 1–4 and references. Includes, worked example for semantic entropy calculation, discussion of limitations and computational cost of entailment clustering, ablation of entailment prediction and clustering methods, discussion of automated accuracy assessment, unaggregated results for sentence-length generations and further results for short-phrase generations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0

Download citation

Received : 17 July 2023

Accepted : 12 April 2024

Published : 19 June 2024

Issue Date : 20 June 2024

DOI : https://doi.org/10.1038/s41586-024-07421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

big data scientific research

COMMENTS

  1. Scientific Research and Big Data

    Scientific Research and Big Data. First published Fri May 29, 2020. Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse ...

  2. A review of big data and medical research

    In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers. Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s. 4 The big data age, starting in 2002 ...

  3. Home page

    The Journal of Big Data publishes open-access original research on data science and data analytics. Deep learning algorithms and all applications of big data are welcomed. Survey papers and case studies are also considered. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture ...

  4. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in …. View full aims & scope.

  5. Full article: Big data for scientific research and discovery

    Big data for scientific research and discovery. With data volumes expanding beyond the Petabyte and Exabyte levels across many scientific disciplines, the role of big data for scientific research is becoming increasingly apparent: the massive data processing has become valuable for scientific research. The term big data is not only a buzzword ...

  6. Big data: Historic advances and emerging trends in biomedical research

    What is big data? Big data is defined by massive amounts of data generated in the private and public sectors. In the healthcare industry, sources of big data include clinical records, health records of patients, results of medical examinations, and data collected by using diagnostic and health management devices as part of the internet of things (IoTs) (Jiang et al., 2017, Dash et al., 2019).

  7. Moving back to the future of big data-driven research: reflecting on

    While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises.

  8. Refining the Concept of Scientific Inference When Working with Big Data

    The concept of utilizing big data to enable scientific discovery has generated tremendous excitement and investment from both private and public sectors over the past decade, and expectations continue to grow (FTC, 2016; NITRD/NCO, 2016). Big data is considered herein as data sets whose heterogeneity, complexity, and size—typically measured in terabytes or petabytes—exceed the capability ...

  9. Visualizing big science projects

    Abstract. The number, size and complexity of 'big science' projects are growing — as are the size, complexity and value of the data sets and software services they produce. In this context ...

  10. Applying big data beyond small problems in climate research

    Big data could have big potential in various scientific disciplines 4 including climate research 5,6, but it remains unclear what questions big data can potentially help to answer. The usefulness ...

  11. Big Data in Science and Healthcare: A Review of Recent Literature and

    Recognizing, understanding, and using Big Data in terms of scientific research and healthcare are necessary at this time in order to arrive at best evidence in a world of ever increasing data. Further investigation into the limitations of Big Data, such as inconsistencies regarding standards, policy, ethics, gaps in structured databases and ...

  12. The impact of big data on research methods in information science

    Research methods are roadmaps, techniques, and procedures employed in a study to collect data, process data, analyze data, yield findings, and draw a conclusion to achieve the research aims. To a large degree the availability, nature, and size of a dataset can affect the selection of the research methods, even the research topics.

  13. Frontiers

    Introduction. The current literature shows that there is growing support from the scientific community for using secondary or "found" data in both theoretical and applied research (Niemeijer and de Groot, 2008; Hampton et al., 2013; Davis-Kean et al., 2015).The "big data" environment has proven to be fertile ground for nurturing innovation in indicator research and development.

  14. Notes to Scientific Research and Big Data

    Notes to Scientific Research and Big Data. 1. When a data collection can or should be regarded as "big data", and the significance of this particular label for research, is discussed at length in Leonelli (2016), Kitchin and McArdle (2016) and Aronova, van Oertzen, and Sepkoski (2017). 2. This understanding of scientific knowledge is also ...

  15. Big data in Earth science: Emerging practice and promise

    and reproducibility, big data 's volume, variety, and veracity can make its use intractable. As a result, a pattern of producing Earth informa-tion products is emerging. These are synthe-sized, structured, and organized presentations of scientific data, findings, and research out-comes in a format that is accessible, informative,

  16. Big data for scientific research and discovery

    Numerous examples of big data's contribution to scientific discoveries have been identified, especially for big interdisciplinary research, such as Digital Earth and Global Change. Unprecedentedly, large datasets generated, sensed, and harvested from experi-ments, observations, and simulations have brought great opportunities for making ...

  17. The dynamics of big data and human rights: the case of scientific research

    Big data, we are repeatedly told, has enormous untapped potential as an approach to scientific research. A serious challenge arises, however, in realizing this potential. It will require the development of an ethic tailored to the new realities—the new capabilities and risks—of the rapidly evolving digital environment.

  18. SciSciNet: A large-scale open data lake for the science of science research

    Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses.

  19. The dynamics of big data and human rights: the case of scientific research

    In this paper, we address the complex relationship between big data and human rights. Because this is a vast terrain, we restrict our focus in two main ways. First, we concentrate on big data applications in scientific research, mostly health-related research. And, second, we concentrate on two human rights: the familiar right to privacy and ...

  20. Top 20 Latest Research Problems in Big Data and Data Science

    E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the ...

  21. What is your definition of Big Data? Researchers' understanding of the

    Of the 39 researchers, 27 explicitly stated that they were working on Big Data research projects or on projects that involve Big Data methodologies. Four participants replied that they were not involved in Big Data research and eight were unsure whether their research could be described as Big Data research (See Table 3). A significant ...

  22. Big data: new methods and ideas in geological scientific research

    The objective of big data research is to exploit data using the computer as its tool. Big data research pro-gresses via the determination of correlations between data and is characterized by decision-making based on high probability. 1. Theory-driven mode and data-driven mode.

  23. Mines Researchers Receive NSF Funding to Harness Big Data of Geologic

    Mines Researchers Receive NSF Funding to Harness Big Data of Geologic Processes. September 05, 2023. Home. news. ... university's new Ph.D. program in data science and engineering. A challenge when compiling data from various experiments is that data produced by different labs is not always collected and presented in the professional literature ...

  24. Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Data

    Figure 1: Magic Quadrant for Data Science and Machine Learning Platforms. A key change that occurred between the 2021 and 2024 magic quadrants is the inclusion of Generative AI. This considers the capabilities to use, fine-tune, and build custom large language models as part of the overall data science and machine learning platform.

  25. Reproducibility and Scientific Integrity of Big Data Research in Urban

    Big data science has brought on new challenges, to which the scientific community needs to adapt by applying adequate ethical, methodological, ... Big data research presents a unique opportunity for a cultural shift in the way public-health research is conducted today. At the same time, big data use will only result in a beneficial impact to ...

  26. Microsoft Research

    Data Science and Research: MSc & PhD Internship Opportunities . Posted: April 22, ... Data Science - Minecraft Player and Data Insights (PADI) Posted: June 26, ... Why AI sometimes gets it wrong — and big strides to address it . Microsoft News Center | Jun 20, 2024 1 big thing ...

  27. 6th Annual Neuro Open Science in Action Symposium 2024

    6th Annual Neuro Open Science in Action Symposium 2024 An event organized by the Tanenbaum Open Science Institute In person at The Neuro and livestreamed online. Registration coming soon Livestream link coming soon Open Science Throughout the Research Lifecycle This year's Symposium will highlight how Open Science works through various stages of the research lifecyle, focusing on areas where ...

  28. Detecting hallucinations in large language models using ...

    Large language model (LLM) systems, such as ChatGPT1 or Gemini2, can show impressive reasoning and question-answering capabilities but often 'hallucinate' false outputs and unsubstantiated ...