Logistic Regression: A Basic Approach

  • Conference paper
  • First Online: 31 May 2023
  • Cite this conference paper

research paper on logistic regression

  • Naman Kaur 12 &
  • Himanshu 12 , 13  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 623))

Included in the following conference series:

  • International Conference on Information and Communication Technology for Competitive Strategies

268 Accesses

Talking generally about data mining, and to be more specific about binary data categorization, Logistic Regression is among the most used procedures. In general, it is more often worked with dichotomous dependent variables, Logistic Regression is applicable to multiple dependent variables. The goal of this research is to provide an overview of the Logistic Regression model, why actually Logistic Regression is required even after Linear Regression, their similarities and differences, how Logistic Regression select different independent variables and how many are to be included, primary beliefs of Logistic Regression are also well discussed and shown. The study of this aims for an overview which will provide Logistic Regressions of most significant components for data modeling. It shows how Logistic Regression deals when the data used or the data which is given is irregular or is of rare occurrences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Wright RE (1995) Logistic regression

Google Scholar  

Stoltzfus JC (2011) Logistic regression: a brief primer. Acad Emerg Med 18(10):1099–1104

Article   Google Scholar  

Darlington RB (1990) Regression and linear models. McGraw-Hill college

Tabachnick BG, Fidell LS, Ullman JB (2007) Using multivariate statistics, vol 5. Pearson, Boston, MA, pp 481–498

Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, Hoboken

Himanshu M, Mangla N (2021) Soft security resource scheduling issues in cloud computing: A review. 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 2021, pp 678–684. https://doi.org/10.1109/ISPCC53510.2021.9609428

Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373–1379

Agresti A (2003) Categorical data analysis. Wiley, Hoboken

Feinstein AR (1996) Multivariable analysis: an introduction. Yale University Press, New Haven

Himanshu M, Kaur SS, Sharma A. Real-time information system based on speech recognition

Himanshu M, Kaur S, Chaudhary V (2014) Literature survey on automatic speech recognition system‖. Int J Emerg Technol Adv Eng

Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York

Bender R, Grouven U (1997) Ordinal logistic regression in medical research. J R Coll Phys Lond 31(5):546

Download references

Author information

Authors and affiliations.

Department of Mathematics, Chandigarh University, Sahibzada Ajit Nagar, 140413, India

Naman Kaur &  Himanshu

Department of Computer Science & Engineering, Chandigarh University, Mohali, India

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Naman Kaur .

Editor information

Editors and affiliations.

Global Knowledge Research Foundation, Ahmedabad, Gujarat, India

Nottingham Trent University, Nottingham, UK

Mufti Mahmud

University of Peradeniya, Kandy, Sri Lanka

Roshan G. Ragel

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Kaur, N., Himanshu (2023). Logistic Regression: A Basic Approach. In: Joshi, A., Mahmud, M., Ragel, R.G. (eds) Information and Communication Technology for Competitive Strategies (ICTCS 2022). ICTCS 2022. Lecture Notes in Networks and Systems, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-19-9638-2_41

Download citation

DOI : https://doi.org/10.1007/978-981-19-9638-2_41

Published : 31 May 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-9637-5

Online ISBN : 978-981-19-9638-2

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

research paper on logistic regression

  • Get new issue alerts Get alerts
  • Submit a Manuscript

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Logistic regression

A simple primer.

Pal, Ankita

Mahamana Pandit Madan Mohan Malaviya Cancer Center, and Homi Bhabha Cancer Hospital, Tata Memorial Center, Varanasi, Uttar Pradesh, India

Address for correspondence: Ankita Pal, Mahamana Pandit Madan Mohan Malaviya Cancer Centre, Varanasi, Uttar Pradesh, India. E-mail: [email protected]

Received July 17, 2021

Received in revised form August 31, 2021

Accepted September 18, 2021

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Logistic regression is used to obtain the odds ratio in the presence of more than one explanatory variable. This procedure is quite similar to multiple linear regression, with the only exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage of performing logistic regression is to avoid the effects of confounders by analyzing the association of all the variables together. In this article, we explain how to perform a logistic regression using practical examples. After defining the technique, the assumptions that need to be checked are explained, along with the process of checking them using the R software.

INTRODUCTION

Most of the understanding of the biological effects and their determinants are gained through statistical analysis. Clinical studies that evaluate the relative contributions of various factors to a binary outcome, such as death or disease, are the most common way of gaining an understanding of biological effects and determinants. In this article, we aim to provide a brief and simplified outline on performing a logistic regression, which would be sufficient to permit clinicians who are unfamiliar with regression methodology to understand and interpret the results.[ 1 ]

LOGISTIC REGRESSION

The Multivariate Logistic Regression is the statistical technique used when we wish to estimate the probability of a dichotomous outcome, such as the presence or absence of disease or of death. The probability of the outcome is referred to as the dependent variable, and the various factors that influence it are the independent variables, sometimes termed risk factors.

The probability of an outcome is expressed as a proportion or a percentage. For instance, suppose there were 600 patients with cancer of which 30 died. The proportion of deaths is 30/600, or 0.05 or 5%. In general, the results of logistic regression are presented in terms of the odds rather than the probability of the outcome. There is a direct relationship between probabilities and odds, that is, the odds of the occurrence are the probability of the outcome occurring divided by the probability of the outcome not occurring. In this example, the odds of death were obtained by dividing 0.05, the proportion of deaths by 0.95, the proportion of survivors, and determined to be 1:19. The probability of death can be obtained from the odds simply by dividing the odds by 1 plus the odds or (1/19)/(1 + 1/19) =0.05.[ 2 ]

Logistic regression uses the past experience of a group of patients to estimate the odds of an outcome by mathematically modeling or simulating that experience and describing it by means of a regression equation. Symbolically, a logistic regression equation is given as,

research paper on logistic regression

  • x 1 and x 2 are the two predictor variables
  • Y is a binary (Bernoulli) response variable, which is denoted as p = P ( Y = 1)
  • ℓ is the log-odds
  • β i are the parameters of the model ( i = 0, 1, 2).

A key feature in modeling a clinical experience is the selection of the independent variables that influence the result. The method for calculating the regression coefficients takes into consideration all the possible combinations of the independent variables. It then maximizes the probability that, for any given individual with a specific combination of independent variables, the chances of the result are going to be on the brink of the particular or observed outcome of all other individuals possessing the same combination of independent variables.[ 2 ]

The general form of the logistic regression equation is similar to that of multivariate linear regression; however, the logarithm of the odds of the outcome termed the logit or log-odds, is used as the dependent variable. The regression coefficients also are expressed as natural logarithms.

LOGISTIC REGRESSION DIAGNOSTICS

Many assumptions need to be checked before performing a Logistic Regression analysis. The assumptions are listed below along with a guide on how to check them with the help of R.

Dependent variable

The first assumption is that the binary logistic regression requires the dependent variable to be binary, and in case of ordinal logistic regression, the dependent variable needs to be ordinal.

Independent observations

In order to perform a logistic regression, the observations need to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.

Large sample size

In general, logistic regression typically requires a large sample size. There is a general guideline that one needs a minimum of 10 cases with the least frequent outcome for each independent variable in a model. For example, if there are 5 independent variables and the expected probability of the least frequent outcome is .10 then a minimum sample size of 500 (10 * 5/.10) will be required.[ 3 ]

Linearity assumption

The linear relationship between the continuous predictor variables and the logit of the outcome is checked. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

research paper on logistic regression

If the scatter plot shows non-linearity, there is a need to perform other methods to build the model such as including 2 or 3-power terms, fractional polynomials, and spline function.

Influential values

Influential values are extreme individual data points that can alter the quality of the logistic regression model. The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values.

research paper on logistic regression

A point to be noted is that not all outliers are influential observations. To check whether the data contain potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residual above 3 represent possible outliers and may deserve closer attention.

The following R code computes the standardized residuals (.std.resid) and the Cook’s distance (.cooksd) using the R function augment() [broom package].

research paper on logistic regression

When outliers are present in a continuous predictor, the potential solutions include:

  • Removing the concerned records
  • Transforming the data into a log scale
  • Using non-parametric methods

Multicollinearity

Multicollinearity corresponds to a situation in which the data contain highly correlated predictor variables. Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif()[car package], which computes the variance inflation factors:

research paper on logistic regression

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

LOGISTIC REGRESSION IN R

The general mathematical equation for logistic regression that is used in R software is,

research paper on logistic regression

  • y is the response variable
  • x is the predictor variable
  • a and b are the coefficients which are numeric constants.

The function used to create the regression model is the glm() function.

The basic syntax for glm() function from the stats package in logistic regression is,

research paper on logistic regression

The description of the parameters mentioned in the above function is,

formula An object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted.

family A description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function or the result of a call to a family function.

data An optional data frame, list or environment containing the variables in the model.

Real-life example

Suppose a real-life dataset known as the Cleveland Heart Disease dataset is considered, where the dataset contains information about patients who have or do not have heart disease. The dataset contains many medical indicators. It contains 76 attributes using which the medical history of patients of Hungarian and Switzerland origin was captured. The dataset is available online at: https://archive.ics.uci.edu/ml/datasets/heart+Disease .

The aim is to predict if a person has heart disease or not based on attributes such as blood pressure, heart rate, and others. Here, the dependent/response variable is target (whether the patient has heart disease or not) which is a binary variable, as it only takes the values 0 (= No) or 1 (= Yes). All the other variables are independent/predictor variables that will be used for predicting the response variable.

Therefore, a Logistic Regression Model is built in R with the help of the following R code.

research paper on logistic regression

To understand the above code, it can be broken down into parts and explained.

glm is the generalized linear model we will be using.

target ~ means that we want to model the target using (~) every available feature.

family = bionomial() is used because we are predicting a binary outcome. On running the above code, the result that was obtained is as below.

research paper on logistic regression

From the above output, it is clearly observed that a lot of variables are not significant, with the help of the P values (denoted as Pr(>|z||)). Hence, based on the least significance levels, the variables which were found to be significant will be removed one by one and checked for the best model by applying the glm function each time. Thus, on obtaining the best logistic regression model, it will be used for predicting the response variable.

CONCLUSIONS

No one knows better than a doctor how multiple factors can combine to produce patient outcomes. Logistic regression analysis is a powerful tool for assessing the relative importance of factors that determine outcome. It is increasingly used in clinical medicine to develop diagnostic algorithms and evaluate prognosis. Yet, this tool is both imperfect and subject to misuse. An article by Shahian et al .[ 4 ] describes the deficiencies of the method as currently employed in the production of “report cards.” A basic understanding of logistic regression analysis is the first step to appreciating both the usefulness and the limitations of the technique.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

  • Cited Here |
  • Google Scholar

Diagnostics; logistic regression; odds ratio; R; regression analysis

  • + Favorites
  • View in Gallery
  • Open access
  • Published: 28 December 2018

A logistic regression investigation of the relationship between the Learning Assistant model and failure rates in introductory STEM courses

  • Jessica L. Alzen   ORCID: orcid.org/0000-0002-1706-2975 1 ,
  • Laurie S. Langdon 1 &
  • Valerie K. Otero 1  

International Journal of STEM Education volume  5 , Article number:  56 ( 2018 ) Cite this article

30k Accesses

46 Citations

4 Altmetric

Metrics details

Large introductory STEM courses historically have high failure rates, and failing such courses often leads students to change majors or even drop out of college. Instructional innovations such as the Learning Assistant model can influence this trend by changing institutional norms. In collaboration with faculty who teach large-enrollment introductory STEM courses, undergraduate learning assistants (LAs) use research-based instructional strategies designed to encourage active student engagement and elicit student thinking. These instructional innovations help students master the types of skills necessary for college success such as critical thinking and defending ideas. In this study, we use logistic regression with pre-existing institutional data to investigate the relationship between exposure to LA support in large introductory STEM courses and general failure rates in these same and other introductory courses at University of Colorado Boulder.

Our results indicate that exposure to LA support in any STEM gateway course is associated with a 63% reduction in odds of failure for males and a 55% reduction in odds of failure for females in subsequent STEM gateway courses.

Conclusions

The LA program appears related to lower course failure rates in introductory STEM courses, but each department involved in this study implements the LA program in different ways. We hypothesize that these differences may influence student experiences in ways that are not apparent in the current analysis, but more work is necessary to support this hypothesis. Despite this potential limitation, we see that the LA program is consistently associated with lower failure rates in introductory STEM courses. These results extend the research base regarding the relationship between the LA program and positive student outcomes.

Science, technology, engineering, and mathematics (STEM) departments at institutes of higher education historically offer introductory courses that can serve up to 1000 students per semester. Introductory courses of this size, often referred to as “gateway courses,” are cost-effective due to the number of students able to receive instruction in each semester, but they often lend themselves to lecture as the primary method of instruction. Thus, there are few opportunities for substantive interaction between the instructor and students or among students (Matz et al., 2017 ; Talbot, Hartley, Marzetta, & Wee, 2015 ). Further, these courses typically have high failure rates (Webb, Stade, & Grover, 2014 ) and lead many students who begin as STEM majors to either switch majors or drop out of college without a degree (Crisp, Nora, & Taggart, 2009 ). In efforts to address these issues, STEM departments across the nation now implement active engagement strategies in their classes such as peer instruction and interactive student response systems (i.e., clicker questions) during large lecture meetings (Caldwell, 2007 ; Chan & Bauer, 2015 ; Mitchell, Ippolito, & Lewis, 2012 ; Wilson & Varma-Nelson, 2016 ). In addition to classroom-specific active engagement, interventions are programs designed to guide larger instructional innovations from an institution level, such as the Learning Assistant (LA) model.

The LA model was established at University of Colorado Boulder in 2001. The program represents an effort to change institutional values and practices through a low-stakes, bottom-up system of course assistance. The program supports faculty to facilitate increased learner-centered instruction in ways that are most valued by the individual faculty member. A key component of the LA model is undergraduate learning assistants (LAs). LAs are undergraduate students who, through guidance, encourage active engagement in classes. LAs facilitate discussions, help students manage course material, offer study tips, and motivate students. LAs also benefit as they develop content mastery, teaching, and leadership skills. LAs get a monthly stipend for working 10 h per week, and they also receive training in teaching and learning theories by enrolling in a math and science education seminar taught by discipline-based education researchers. In addition, LAs meet with faculty members once a week to develop deeper understanding of the content, share insights about how students are learning, and prepare for future class meetings (Otero, 2015 ).

LAs are not peer tutors and typically do not work one-on-one with students. They do not provide direct answers to questions or systematically work out problems with students. Instead, LAs facilitate discussion about conceptual problems among groups of students and they focus on eliciting student thinking and helping students make connections between concepts. This is typically done both in the larger lecture section of the course as well as smaller meetings after the weekly lectures, often referred to as recitation. LAs guide students in learning specific content, but also in developing and defending ideas—important skills for higher-order learning in general. The model for training LAs and the design of the LA program at large are aimed at making a difference in the ways students think and learn in college overall and not just in specific courses. That is, we expect exposure to the program to influence student success in college generally.

Prior research indicates a positive relationship between exposure to LAs and course learning outcomes in STEM courses (Pollock, 2009 ; Talbot et al., 2015 ). Other research suggests that modifying instruction to be more learner-centered helps to address high failure rates (Cracolice & Deming, 2001 ; Close, Mailloux-Huberdeau, Close, & Donnelly, 2018 ; Webb et al., 2014 ). This study seeks to further understand the relationship between the LA program and probability of student success. Specifically, we answer the following research question: How do failure rates in STEM gateway courses compare for students who do and do not receive LA support in any STEM gateway course? We investigate this question because, as a model for institutional change, we expect that LAs help students develop skills and dispositions necessary for success in college such as higher-order thinking skills, navigating course content, articulating and defending ideas, and feelings of self-efficacy. Since skills such as these extend beyond a single course, we investigate the extent to which students exposed to the LA program have lower failure rates in STEM gateway courses generally than students who are not exposed to the program.

Literature review

The LA model is not itself a research-based instructional strategy. Instead, it is a model of social and structural organization that induces and supports the adoption of existing (or creation of new) research-based instructional strategies that require increased teacher-student ratio. The LA program is at its core, a faculty development program. However, it does not push specific reforms or try to change faculty directly. Instead, the opt-in program offers resources and structures that lead to changes in values and practices among faculty, departments, students, and the institution (Close et al., 2018 ; Sewell, 1992 ). Faculty members write proposals to receive LAs (these proposals must involve course innovation using active engagement and student collaboration), students apply to be LAs, and departments match funding for their faculty’s requests for LAs. Thus, the LA program has become a valued part of the campus community.

The body of research that documents the relationship between student outcomes and the LA program is growing. Pollock ( 2006 ) provided evidence regarding the relationship between instructional innovation including LAs and course outcomes in introductory physics courses at University of Colorado Boulder by comparing three different introductory physics course models (outlined in Table  1 ).

Pollock provides two sources of evidence related to student outcomes regarding the relative effectiveness of these three course models. First, he discussed average normalized learning gains on the force and motion concept evaluation (FMCE; Thornton & Sokoloff, 1998 ) generally. The FMCE is a concept inventory commonly used in undergraduate physics education to provide information about student learning on the topics of force and motion. Normalized learning gains are calculated by finding the difference in average post-test and pre-test in a class and dividing that value by the difference between 100 and the average pre-test score. It is conceptualized as the amount the students learned divided by the amount they could have learned (Hake, 1998 ).

Prior research suggests that traditional instructional strategies yield an average normalized learning gain of about 15% and research-based instructional methods such as active engagement and collaborative learning yield on average about 63% average normalized learning gains (Thornton, Kuhl, Cummings, & Marx, 2009 ). The approach using the University of Washington Tutorials with LAs saw a normalized learning gain of 66% on the FMCE from pre-test to post-test. Average learning gains for the approach using Knight’s ( 2004 ) workbooks with TAs were about 59%, and average normalized learning gains for the traditional approach were about 45%. The average normalized learning gains for all three methods in Pollock’s study are much higher than what the literature would expect from traditional instruction, but the course model including LAs is aligned with what is expected from research-based instructional strategies. Second, Pollock further investigated the impact of the different course implementations on higher and lower achieving students on FMCE scores. To do this, he considered students with high pre-test scores (those with pre-test scores > 50%) and students with low pre-test scores (those with pre-test scores < 15%). For both groups of students, the course implementation that included recitation facilitated by trained TAs and LAs had the highest normalized learning gains as measured by the FMCE.

In a similar study at Florida International University, Goertzen et al. ( 2011 ) investigated the influence of instructional innovations through the LA program in introductory physics. As opposed to the University of Washington Tutorials in the Pollock ( 2006 ) study, the research-based curriculum materials used by Florida International University were Open Source Tutorials (Elby, Scherr, Goertzen, & Conlin, 2008 ) developed at University of Maryland, College Park. Goertzen et al. ( 2011 ) used the Force Concept Inventory (FCI; Hestenes, Wells, & Swackhamer, 1992 ) as the outcome of interest in their study. Despite the different curriculum from the Pollock ( 2006 ) context, Goertzen et al. found that those students exposed to the LA-supported courses had a 0.24 increase in mean raw gain in scores from pre-test to post-test while students in classes that did not include instructional innovations only saw raw gains of 0.16.

In an attempt to understand the broader relationship between the LA program and student outcomes, White et al. ( 2016 ) investigated the impacts of the LA model on student learning in physics across institutions. In their study, White et al. used paired pre-/post-tests from four concept inventories (FCI, FMCE, Brief Electricity and Magnetism Assessment [BEMA; Ding, Chabay, Sherwood, & Beichner, 2006 ], and Conceptual Survey of Electricity and Magnetism [CSEM]) at 17 different institutions. Researchers used data contributed to the Learning Assistant Alliance through their online assessment tool, Learning About STEM Student Outcomes Footnote 1 (LASSO). This platform allows institutions to administer several common concept inventories, with data securely stored on a central database to make investigation across institutions possible (Learning Assistant Alliance, 2018 ). In order to identify differences in learning gains for students who did and did not receive LA support, White et al. tested differences in course mean effect sizes between the two groups using a two-sample t test. Across all of the concept inventories, White et al. found average Cohen’s d effect sizes 1.4 times higher for LA-supported courses compared to courses that did not receive LA support.

The research about the LA model shows that students exposed to the model tend to have better outcomes than those in more traditional lecture-based learning environments. However, due to the design of the program and the goals of the LA model, there is a reason to expect that there are implications for more long-term outcomes. LAs are trained to help students develop skills such as developing and defending ideas, making connections between concepts, and solving conceptual problems. Prior research suggests that skills such as these develop higher-order thinking for students. Martin et al. ( 2007 ) compared learning outcomes and innovative problem-solving for biomedical engineering students in inquiry-based, active engagement and traditional lecture biotransport courses. They found that both groups reached similar learning gains but that the active engagement group showed greater improvement in innovative thinking abilities. In a similar study, Jensen and Lawson ( 2011 ) investigated achievement and reasoning gains for students in either inquiry-based, active engagement or lecture-based, didactic instruction in undergraduate biology. Results indicated that students in active engagement environments outperformed students in didactic environments on more cognitively demanding items, while the groups performed equally well on items requiring low levels of cognition. In addition, students in active engagement groups showed greater ability to transfer reasoning among contexts.

This research suggests that active engagement such as what is facilitated with the LA model may do more than help students gain knowledge in a particular discipline in a particular course. Over and above, active engagement helps learners grow in reasoning and transfer abilities generally. This increase in higher-order thinking may help students to develop skills that extend beyond the immediate course. However, there is only one study focused on the LA model that investigates long-term outcomes related to the program. Pollock ( 2009 ) investigated the potential long-term relationship between exposure to the LA program and conceptual understanding in physics. In this line of inquiry, Pollock compared BEMA assessment scores for those upper-division physics majors who did and did not receive LA support in their introductory Physics II course, the course in which electricity and magnetism is first covered. Pollock’s results indicate that those students who received LA support in Physics II had higher BEMA scores following upper-division physics courses than those students who did not receive LA support in Physics II. This research provides some evidence to the long-term relationship between exposure to the LA program and conceptual learning. In the current study, we continue this line of inquiry by investigating the relationship between receiving LA support in a gateway course and the potential relationship to course failure in subsequent gateway courses. This study also contributes to the literature on the LA program as no prior research attempts to examine the relationship between taking LA-supported courses and student outcomes while controlling for variables that may confound this relationship. This study thus represents an extension of the previous work regarding the LA model in terms of both the methodology and the outcome of interest.

Data for this study come from administrative records at University of Colorado Boulder. We focus on 16 cohorts of students who entered the university as full-time freshmen for the first time each fall semester from 2001 to 2016 and took Physics I/II, General Chemistry I/II, Calculus I/II (Math department), and/or Calculus I/II for Engineers (Applied Math department). The dataset includes information for 32,071 unique students, 23,074 of whom took at least one of the above courses with LA support. Student-level data includes information such as race/ethnicity, gender, first-generation status, and whether a student ever received financial aid. Additional variables include number of credits upon enrollment, high school grade point average (GPA), and admissions test scores. We translate SAT total scores to ACT Composite Scores using a concordance table provided by the College Board to have a common admissions test score for all students (College Board, 2016 ). We exclude students with no admissions test scores (about 6% of the sample). We also have data on the instructor of record for each course. The outcome of interest in this study is failing an introductory STEM course. We define failing as receiving either a D or an F or withdrawing from the course altogether after the university drop date (i.e., “DFW”).

An important consideration in creating the data set for this study is timing of receiving LA support relative to taking any STEM gateway course. The data begin with all students who took at least one of the courses included in this study. We keep all students who took all of their STEM LA courses either with or without LA support. We also include all students who received LA support in the very first STEM gateway course they took, regardless of if they had LA support in subsequent STEM gateway courses. We would exclude any student who took a STEM gateway course without LA support and then took another STEM gateway course in a subsequent semester with LA support.

This data limitation ensures that exposure to the LA program happened before or at the same time as the opportunity to fail any STEM gateway course. If it were the case that a student failed a STEM gateway course without LA support, say, in their first year and then took LA-supported courses in the second year, this student would be indicated as an LA student in the data, but the courses taken during the first year would not have been affected by the LA program. Students with experiences such as this would misrepresent the relationship between being exposed to the LA program and probability of course failure. Conveniently, there were not any students with this experience in the current dataset. In other words, for every student in our study who took more than one of the courses of interest, their first experience with any of the STEM gateway courses under consideration included LA support if there was ever exposure to the LA program. Although we did not have to exclude any students from our study for timing reasons, other institutions carrying out similar studies should carefully consider such cases when finalizing their data for analysis.

We provide Fig.  1 as a way for readers to gain a better understanding of the adoption of the LA program in each of the departments in this study. This figure also gives information regarding the number of students exposed to LAs or not in each department, course, and term in our study.

figure 1

Course enrollment over time by LA exposure

Ideally, we would design a controlled experiment to estimate the causal effect of LA exposure on the probability of failing introductory STEM courses. To do this, we would need two groups of students: first, those who were exposed to LA support in a STEM gateway course, and second, a comparable group, on average, that significantly differed only in that they were not exposed to LA support in any STEM gateway course. However, many institutions do not begin their LA programs with such studies in mind, so the available data do not come from a controlled experiment. Instead, we must rely on historical institutional data that was not gathered for this type of study. Thus, this study not only contributes to the body of literature regarding the relationship between LA exposure and student outcomes, but it also serves as a model for other institutions with LA programs that would like to use historical institutional data for similar investigations.

Selection bias

The ways students are assigned to receive LA support in each of the departments represented in this study are not random, and the ways LAs are used in each department are not identical. These characteristics of pre-existing institutional data manifest themselves as issues related to selection bias within a study. For example, in the chemistry department, LA support was only offered in the “on semester” sections of chemistry from 2008 to 2013. “On semester” indicates General Chemistry I in the fall and General Chemistry II in the spring. Thus, there were few opportunities for those students who took the sequence in the “off semester,” or General Chemistry I in the spring and General Chemistry II in the fall to receive LA support in these courses during the span of time covered in this analysis. The most typical reasons why students take classes in the “off semester” are that they simply prioritize other courses more in the fall semester, so there is insufficient space to take General Chemistry I; they do not feel prepared for General Chemistry I in the fall and take a more introductory chemistry class first; or they fail General Chemistry I the first time in the fall and re-take General Chemistry I in the spring. This method of assignment to receiving LA support may overstate the relationship between receiving LA support and course failure in this department. That is, it might be the case that those students who received LA support were those who were more likely to pass introductory chemistry to begin with. Our analysis includes prior achievement variables (described below) to attempt to address these selection bias issues.

In chemistry, LAs attend the weekly lecture meetings and assist small groups of students during activities such as answering clicker questions. Instructors present questions designed to elicit student levels of conceptual understanding. The questions are presented to the students; they discuss the questions in groups and then respond using individual clickers based on their selection from one of several multiple-choice options. LAs help students think about and answer these questions in the large lecture meetings. In addition, every student enrolled in General Chemistry I and II is also enrolled in a recitation section. Recitations are smaller group meetings of approximately 20 students. In these recitation sections, LAs work with graduate TAs to facilitate small group activities related to the weekly lecture material. The materials for these recitation sections are created by the lead instructor for the course and are designed to help students investigate common areas of confusion related to the weekly material.

In the physics and math departments, the introductory courses went from no LA support in any section in any semester to all sections in all semesters receiving LA support. This historical issue affects selection bias in a different way than the off-semester chemistry sequence. One interpretation of decreased course failure rates could be that LA support caused the difference. However, we could not rule out the possibility that failure rates decreased due to other factors that also changed over time. It could be that the university implemented other student supports in addition to the LA model at the same time or that the types of students who enrolled in STEM courses changed. There is no way to determine conclusively which of these (or other) factors may have caused changes in failure rates. Thus, causal estimates of the effect of LA support on failure rates would be threatened by any historic changes that occurred. We have no way of knowing if we might over or underestimate the relationship between LA exposure and course failure rates due to the ways students were exposed (or not) to the LA program in these departments. In order to address this issue, we control for student cohort. This adjustment, described below, attempts to account for differences that might exist among cohorts of students that might be related to probability of failing a course.

The use of LAs in the math department only occurs during weekly recitation meetings. During this weekly meeting, students work in small groups to complete carefully constructed activities designed to enhance conceptual understanding of the materials covered during the weekly lecture. An anomaly in the math department is that though Calculus I/II are considered gateway courses, the math department at this institution is committed to keeping course enrollment under 40. This means that LA support is tied to smaller class sizes in this department. However, since this condition is constant across the timeframe in our study, it does not influence selection bias.

Similar to the math department, the physics department only uses LAs in the weekly recitation meeting. An additional anomaly in physics is that, not incidentally, the switch to the LA model happened concurrently with the adoption of the University of Washington Tutorials in introductory physics (McDermott & Shaffer, 2002 ). LAs facilitate small group work with the materials in the University of Washington Tutorials during recitation meetings. In other words, it is not possible to separate the effects of the content presentation in the Tutorials from the LAs facilitating the learning of the content in this department. Thus, data from this department might overestimate the relationship between receiving LA support and course failure. However, it should be noted that the University of Washington Tutorials require a low student-teacher ratio, and proper implementation of this curriculum is not possible without the undergraduate LAs helping to make that ratio possible.

Finally, every student in every section of Calculus I and II in the applied math department had the opportunity to be exposed to LA support. This is because LAs are not used in lecture or required recitation meetings, but instead facilitate an additional weekly one-unit course, called workgroup, that is open to all students. Thus, students who sign up for workgroup not only gain exposure to LA support, but they also gain an additional 90 min of time each week formally engaging in calculus material. It is not possible to know if lower failure rates might be due to the additional time on task generally, or exposure to LAs during that time specifically. This might cause us to overestimate the relationship between LA support and course failure. Additionally, those students who are expected to struggle in calculus (based on placement scores on the Assessment and LEarning in Knowledge Spaces [ALEKS] assessment) or are not confident in their own math abilities are more strongly encouraged to sign up for the weekly meeting by their instructors and advisors. Thus, those students who sign up for LA support might be more likely to fail calculus. This might lead us to underestimate the relationship between LA exposure and course failure. Similar to the chemistry department, we use prior achievement variables (described below) to address this issue to the best of our abilities.

We mention one final assumption about the LA model before describing our methods of statistical adjustment. Our data span 32 semesters of 8 courses (see Fig.  1 ). Although it is surely the case that the LA model adapted and changed in some ways over the course of this time, we make the assumption that the program was relatively stable within department throughout the time period represented in this study.

Statistical adjustment

Although we do not have a controlled experiment that warrants causal claims, we desire to estimate a causal effect. The current study includes a control group, but it is not ideal because of the potential selection bias in each department described above. However, this study is warranted because it takes advantage of historical data. Our analytic approach is to control for some sources of selection bias. Specifically, we use R to control for standardized high school GPA, standardized admissions test scores, and standardized credits at entry to try and account for issues related to prior aptitude. This helps to address the selection bias issues in the chemistry and applied math departments. Additionally, we control for student cohort to account for some of the historical bias in the physics and math departments. We also control for instructor and course as well as gender (coded 1 = female; 0 = male), race/ethnicity (coded 1 = nonwhite; 0 = white), first-generation status (coded 1 = first-generation college student; 0 = not first-generation college student), and financial aid status (coded 1 = received financial aid ever; 0 = never received financial aid) to disentangle other factors that might bias our results in any department. Finally, we consider possible interaction effects between exposure to LA support and various student characteristics. Table  2 presents the successive model specifications explored in this study. Model 1 controls only for student characteristics. Model 2 adds course, cohort, and instructor factor variables. Model 3 adds an interaction between exposure to the LA program and gender to the model 2 specification.

The control variables in Table  2 help to account for the selection bias described above as well as other unobserved bias in our samples, but we are limited by the availability of observed covariates. Thus, the results presented here lie somewhere between “true” causal effects and correlations. We know that our results tell us more than simple correlations, but we also know that we are surely missing key control variables that are typically not collected by institutes of higher education such as a measure of student self-efficacy, social and emotional health, or family support. Thus, we anticipate weak model fit, and the results presented here are not direct causal effects. Instead, they provide information about the partial association between course failure and LA support.

We begin our analysis by providing raw counts of failure rates for the students who did and did not receive LA support in STEM gateway courses. Next, we describe the differences between those students who did and did not receive LA support with respect to available covariates. If it is the case that we see large differences in our covariates between the group of students who did and did not receive LA support, we expect that controlling for those factors in the regression analysis will affect our results in meaningful ways. Thus, we close with estimating logistic regression models to disentangle some of the relationship between LA-support and course failure. The variable of most interest in this analysis is the indicator for exposure to the LA program. A student received a “1” for this variable if they were exposed to the LA program either concurrently or prior to taking STEM gateway courses, and a 0 if they took any classes in the study but never had any LA support in those classes.

Table  3 includes raw pass and failure rates across all courses. Students are counted every time they enrolled in one of the courses included in our study. We see that those students who were exposed to the LA program in at least one STEM gateway course had 6% lower failure rates in concurrent or subsequent STEM gateway course. We also provide the unadjusted odds ratios for ease of comparison with the logistic regression results. The odds ratio represents the odds that course failure will occur given exposure to the LA program, compared to the odds of course failure occurring without LA exposure. Odds ratios equal to 1.0 indicates the odds of failure is the same for both groups. Odds ratios less than 1.0 indicates that exposure to LA support is associated with a lower chance of failing, while odds ratios greater than 1.0 indicates that exposure to LA support is associated with a higher chance of failing. Thus, the odds ratio of 0.65 in Table  3 indicates a lower chance of failure with LA exposure compared to no LA exposure.

Although the raw data indicates that students exposed to LA support have lower course failure rates, these differences could be due, at least in part, to factors outside of LA support. To explore this possibility, we next examine demographic and academic achievement differences between the groups. In Table  4 , we present the mean values for all of our predictor variables for students who did and did not receive LA support. The top panel presents all of the binary variables, so averages indicate the percentage of students who identify with the respective characteristics. The bottom panel shows the average for the continuous variables. The p values are for the comparisons of means from a t test across the two groups for each variable. Table  4 indicates that students exposed to the LA program were more likely to be male, nonwhite, non-first-generation students who did not received financial aid. They also had more credits at entry, higher high school GPAs, and higher admissions test scores. These higher prior achievement variables might lead us to think that students exposed to LA support are more likely to pass STEM gateway courses. If this is true, then the relationship between LA exposure and failure in Table  3 may overestimate the actual relationship between exposure to LAs and probability for course failure. Thus, we next use logistic regression to control for potentially confounding variables and investigate any resulting change in the odds ratio.

R calculates logistic regression estimates in logits, but these estimates are often expressed in odds ratios. We present abbreviated logit estimates in the Appendix and abbreviated odds ratios estimates in Table  5 . Estimates for all factor variables (i.e., course, cohort, and instructor) are suppressed in these tables for ease of presentation. In order to make the transformation from logits to odds ratios, the logit estimates were exponentiated to calculate the odds ratios presented in Table  5 . For example, the logit estimate for exposure to LA in model 1 from the Appendix converts to the odds ratio estimate in Table  5 by finding exp(− 1.41) = 0.24.

We start off by discussing the results for model 3 as it is the full model for this analysis. Discussion of models 1 and 2 are saved for the discussion of model fit below. The results in model 3 provide information about what we can expect, on average, across all courses and instructors in the sample. We include confidence intervals with the odds ratios. Confidence intervals that include 1.0 suggest results that are not statistically significant (Long, 1997 ). The odds ratio estimate in Table  5 for model 3 is 0.367 for LA exposure with a confidence interval from (0.337–0.400). Since the odds ratio is less than 1.0, LA exposure is associated with a lower probability of failing, on average, and the relationship is statistically significant because the confidence interval does not include 1.0. Compared to the odds ratio in Table  3 (0.65), these results indicate that covariate adjustment has a large impact on this odds ratio. Failure to adjust for possible sources of confounding variables lead to an understatement of the “effect” of exposure to the LA program on course failure.

Our results show that LA exposure is associated with lower odds of failing STEM gateway courses. We also see that the interaction between exposure to the LA program and gender is statistically significant. The odds ratio of 0.37 for exposure to LA support in Table  5 is for male students. In order to find the relationship for female students, we must exponentiate the logit estimates for exposure to the LA program, female, and the interaction between the two variables (i.e. exp[01.002–0.092 + 0.297] = 0.45; see the Appendix ). This means that the LA program actually lowers the odds of failing for male students slightly more than female students. Recall that Table  3 illustrated that the raw odds ratio for failure when exposed to LA support was 0.65. Our results show that after controlling for possibly confounding variables, the relationship between LA support and odds of course failure is better for both male (0.37) and female (0.45) students.

Discussion and limitations

Throughout this paper, we have been upfront about the limitations of the current analysis. Secondary analysis of institutional data for longstanding programs is complex and difficult. In this penultimate section, we mention a few other limitations to the study as well as identify some ideas for future research that could potentially bolster the results found here or identify where this analysis may have gone astray.

First, and most closely related to the results presented above is model fit. The McFadden pseudo R-squared (Verbeek, 2008 ) values for the three models are 0.0708, 0.1793, and 0.1797 respectively. These values indicate two things: (1) that the data do not fit any of the models well and (2) that the addition of the interaction term does little to improve model fit. This is also seen in the comparison of AIC and log likelihood values in Table  5 . We spend significant time on the front end of this paper describing why these data are not ideal for understanding the relationship between exposure to the LA program and probability of failing, so we do not spend additional time here discussing this lack of goodness-of-fit. Instead, we acknowledge this as a limitation of the current analysis and reiterate the desire to conduct a similar type analysis to what is presented here with data more likely to fit the model. Such situations would include institutions that have the ability to compare, for example, large samples of students with and without LA exposure within the same semester, course, and instructor. Another way to improve such data would be to include a way to control for student confidence and feelings of self-efficacy. For example, the descriptions of selection bias above indicate that students in Applied Math might systematically be students who differ in terms of self-confidence. Data that could control for such factors would better facilitate understanding of the relationship between exposure to LA support and course failure. Alternatively, it may be more appropriate to consider the nested structure of the data (i.e., students nested within courses nested within departments) in a context with data better suited for such analysis. Hierarchical linear modeling might even be appropriate for a within-department study if it would be reasonable to consider students nested within classes if there was sufficient sample size at the instructor level.

Second, in addition to a measure of student self-efficacy, there are other variables that might be interesting to investigate such as transfer, out-of-state, or international student status; if students live on-campus; and a better measure of socioeconomic status than receiving financial aid. These are other important student characteristics that might uncover differential relationships between the LA program and particular types of students. Such analysis is important because persistence and retention in gateway courses—particularly for students from traditionally marginalized groups—are an important concern for institutions generally and STEM departments specifically. If we are to maintain and even build diversity in these departments, it is crucial we have solid and clear work in these areas.

Third, although this study controls for course- and instructor-level factors, there are surely complications introduced into this study due to the differential way the LA program is implemented in each department. A more careful study within department is another interesting and valuable approach to understanding the influence of the LA program but one that this data is not well-suited for. Again, there is a need for data which includes students exposed to the LA program and not exposed within the same term, course, and instructor to better disentangle the relationship. Due to the nature of the way the LA program was taken up at University of Colorado Boulder, we do not have the appropriate data for such an analysis.

Finally, an interesting consideration is the choice of outcome variable made in this analysis. Course failure rates are particularly important in gateway courses because failing such a course can lead students to switch majors or drop out of college. We do see a relationship between the LA model and lower failure rates in the current analysis. However, other approaches to course outcomes include course grades, pass rates, average GPA in other courses, and average grade anomaly (Freeman et al., 2014 ; Haak et al., 2011 ; Matz et al., 2017 ; Webb, Stade, & Grover, 2014 ). Similar investigations to what is presented here with other course outcomes are also of interest. For example, course grades would provide more nuanced information regarding how the LA model influences student outcomes. A measure such as Matz et al.’s ( 2017 ) average GPA in other courses could provide more information about how the LA program impacts course other than the ones in which the LA exposure occurred. In either of these situations, it would be interesting to see if the LA program would continue to appear to have a greater impact for male students than female. In short, there are a wide variety of student outcomes that have yet to be fully investigated with data from the LA model and more nuanced information would be a valuable contribution to the research literature.

In this study, we attempt to disentangle the relationship between LA support and course failure in introductory STEM courses. Our results indicate that failure to control for confounding variables underestimates the relationship between exposure to the LA program and course failure. The results here extend the prior literature regarding the LA model by providing evidence to suggest that exposure to the program increases student outcomes in subsequent as well as current courses. Programs such as the LA model that facilitate instructional innovations where students are more likely to be successful increase student retention.

Preliminary qualitative work suggests potential hypotheses for the relationship between LA support and student success. Observations of student-LA interactions indicate that LAs develop safe yet vulnerable environments necessary for learning. Undergraduates are more comfortable revealing their thinking to LAs than to TAs and instructors and are therefore better able to receive input about their ideas. Researchers find that LAs exhibit pedagogical skills introduced in the pedagogy course and course experience that promote deep understanding of relevant content as well as critical thinking and questioning needed in higher education (Top, Schoonraad, & Otero, 2018 ). Also, through their interactions with LAs, faculty seem to be learning how to embrace the diversity of student identities and structure educational experiences accordingly. Finally, institutional norms are changing as more courses adopt new ways of teaching students. For example, the applied math department provides additional time on task because of the LA program. Although we do not know if it is the additional time on task, the presence of LAs, or a combination of both that drives the relationship between LA exposure and lower course failure rates, both the additional time and LA exposure occur because of the LA program generally.

Further work is necessary to more fully understand the relationship between the LA program and student success. Although we controlled for several student-level variables, we surely missed key variables that contribute to these relationships. Despite this limitation, the regression analysis represents an improvement over unadjusted comparisons. We used the available institutional data to control for variables related to the selection bias present in each department’s method of assigning students to receive LA support. More research is needed to identify if the emerging themes in the present study are apparent at other institutions. Additional research with data better suited to isolate potential causal effects is also needed to bolster the results presented here. Despite the noted limitations discussed here, the current findings are encouraging for further development and implementation of the LA program in STEM gateway courses. Identifying relationships between models for change and lower course failure rates are helpful for informing future decisions regarding those models.

For more information about joining LASSO and resources available to support LA programs, visit https://www.learningassistantalliance.org /

Abbreviations

Brief Electricity and Magnetism Assessment

Conceptual Survey of Electricity and Magnetism

Force Concept Inventory

Force and Motion Concept Evaluation

Learning Assistant model

Learning assistants

Peer-led team learning

Science, technology, engineering, and mathematics

Caldwell, J. E. (2007). Clickers in the large classroom: current research and best-practice tips. CBE-Life Sci Educ, 6 (1), 9–20.

Article   Google Scholar  

Chan, J. Y., & Bauer, C. F. (2015). Effect of peer-led team learning (PLTL) on student achievement, attitude, and self-concept in college general chemistry in randomized and quasi experimental designs. J Res Sci Teach, 52 (3), 319–346.

Close, E. W., Mailloux-Huberdeau, J. M., Close, H. G., & Donnelly, D. (2018). Characterization of time scale for detecting impacts of reforms in an undergraduate physics program. In L. Ding, A. Traxler, & Y. Cao (Eds.), AIP Conference Proceedings: 2017 Physics Education Research Conference .

Google Scholar  

College Board. (2016). Concordance tables. Retrieved from https://collegereadiness.collegeboard.org/pdf/higher-ed-brief-sat-concordance.pdf

Cracolice, M. S., & Deming, J. C. (2001). Peer-led team learning. Sci Teach, 68 (1), 20.

Crisp, G., Nora, A., & Taggart, A. (2009). Student characteristics, pre-college, college, and environmental factors as predictors of majoring in and earning a STEM degree: an analysis of students attending a Hispanic serving institution. Am Educ Res J, 46 (4), 924–942 Retrieved from http://www.jstor.org/stable/40284742 .

Ding, L., Chabay, R., Sherwood, B., & Beichner, R. (2006). Evaluating an electricity and magnetism assessment tool: brief electricity and magnetism assessment. Physical Rev Special Topics Physics Educ Res, 2 (1), 010105.

Elby, A., Scherr, R. E., Goertzen, R. M., & Conlin, L. (2008). Open-source tutorials in physics sense making. Retrieved from http://umdperg.pbworks.com/w/page/10511238/Tutorials%20from%20the%20UMd%20PERG

Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Active learning increases student performance in science, engineering, and mathematics. Proc Nat Acad Sci, 111 (23), 8410–8415.

Goertzen, R. M., Brewe, E., Kramer, L. H., Wells, L., & Jones, D. (2011). Moving toward change: institutionalizing reform through implementation of the Learning Assistant model and Open Source Tutorials. Physical Rev Special Topics Physics Education Research, 7 (2), 020105.

Haak, D. C., HilleRisLambers, J., Pitre, E., & Freeman, S. (2011). Increased structure and active learning reduce the achievement gap in introductory biology. Science, 332 (6034), 1213–1216.

Hake, R. R. (1998). Interactive-engagement versus traditional methods: a six-thousand-student survey of mechanics test data for introductory physics courses. Am J Physics, 66 (1), 64–74.

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. Physics Teach, 30 (3), 141–158.

Jensen, J. L., & Lawson, A. (2011). Effects of collaborative group composition and inquiry instruction on reasoning gains and achievement in undergraduate biology. CBE-Life Sci Educ, 10 (1), 64–73.

Knight, R. (2004). Physics for scientists and engineers: A strategic approach. Upper Saddle River, NJ: Pearson/Addison Wesley.

Learning Assistant Alliance. (2018). About LASSO. Retrieved from https://www.learningassistantalliance.org/modules/public/lasso.php

Long, J. S. (1997). Advanced quantitative techniques in the social sciences series, Vol. 7. Regression models for categorical and limited dependent variables. Thousand Oaks, CA, US.

Martin, T., Rivale, S. D., & Diller, K. R. (2007). Comparison of student learning in challenge-based and traditional instruction in biomedical engineering. Annals of Biomedical Engineering, 35 (8), 1312–1323.

Matz, R. L., Koester, B. P., Fiorini, S., Grom, G., Shepard, L., Stangor, C. G., et al. (2017). Patterns of gendered performance differences in large introductory courses at five research universities. AERA Open, 3 (4), 2332858417743754.

McDermott, L. C., and Shaffer, P. S. (2002). Tutorials in introductory physics. Upper Saddle Ridge, New Jersey: Prentice Hall.

Mitchell, Y. D., Ippolito, J., & Lewis, S. E. (2012). Evaluating peer-led team learning across the two semester general chemistry sequence. Chemistry Education Research and Practice, 13 (3), 378–383.

Otero, V. K. (2015). Effective practices in preservice teacher education. In C. Sandifer & E. Brewe (Eds.), Recruiting and educating future physics teachers: case studies and effective practices (pp. 107–127). College Park: American Physical Society.

Pollock, S. J. (2006). Transferring transformations: Learning gains, student attitudes, and the impacts of multiple instructors in large lecture courses. In P. Heron, L. McCullough, & J. Marx (Eds.), Proceedings of 2005 Physics Education Research Conference (pp. 141–144). Salt Lake City, Utah.

Pollock, S. J. (2009). Longitudinal study of student conceptual understanding in electricity and magnetism. Physical Review Special Topics-Physics Education Research, 5 (2), 1–8.

Talbot, R. M., Hartley, L. M., Marzetta, K., & Wee, B. S. (2015). Transforming undergraduate science education with learning assistants: student satisfaction in large-enrollment courses. J College Sci Teach, 44 (5), 24–30.

Thornton, R. K., & Sokoloff, D. R. (1998). Assessing student learning of Newton’s laws: the force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture curricula. Am J Physics, 66 (4), 338–352.

Thornton, R. K., Kuhl, D., Cummings, K., & Marx, J. (2009). Comparing the force and motion conceptual evaluation and the force concept inventory. Physical review special topics-Physics education research, 5(1), 010105.

Top, L., Schoonraad, S., & Otero, V. (2018). Development of pedagogical knowledge among learning assistants. Int J STEM Educ, 5 (1). https://doi.org/10.1186/s40594-017-0097-9 .

Verbeek, M. (2008). A guide to modern econometrics . West Sussex: Wiley.

Webb, D. C., Stade, E., & Grover, R. (2014). Rousing students’ minds in postsecondary mathematics: the undergraduate learning assistant model. J Math Educ Teach College, 5 (2).

White, J. S. S., Van Dusen, B., & Roualdes, E. A. (2016). The impacts of learning assistants on student learning of physics. arXiv preprint arXiv, 1607.07469 . Retrieved from https://arxiv.org/ftp/arxiv/papers/1607/1607.07469.pdf .

Wilson, S. B., & Varma-Nelson, P. (2016). Small groups, significant impact: a review of peer-led team learning research with implications for STEM education researchers and faculty. J Chem Educ, 93 (10), 1686–1702.

William H. Sewell, (1992) A Theory of Structure: Duality, Agency, and Transformation. American Journal of Sociology 98 (1):1–29

Download references

Acknowledgements

There is no funding for this study.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the LAs and Subsequent Course Failure repository, https://github.com/jalzen/LAs-and-Subsequent-Course-Failure .

Author information

Authors and affiliations.

University of Colorado Boulder, 249 UCB, Boulder, CO, 80309, USA

Jessica L. Alzen, Laurie S. Langdon & Valerie K. Otero

You can also search for this author in PubMed   Google Scholar

Contributions

JLA managed the data collection and analysis. All authors participated in writing, revising, and approving the final manuscript.

Corresponding author

Correspondence to Jessica L. Alzen .

Ethics declarations

Ethics approval and consent to participate.

The IRB at University of Colorado Boulder (FWA 00003492) determined that this study did not involve human subjects research. The approval letter specifically stated the following:

The IRB determined that the proposed activity is not research involving human subjects as defined by DHHS and/or FDA regulations. IRB review and approval by this organization is not required. This determination applies only to the activities described in the IRB submission and does not apply should any changes be made. If changes are made and there are questions about whether these activities are research involving human subjects in which the organization is engaged, please submit a new request to the IRB for a determination.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Alzen, J.L., Langdon, L.S. & Otero, V.K. A logistic regression investigation of the relationship between the Learning Assistant model and failure rates in introductory STEM courses. IJ STEM Ed 5 , 56 (2018). https://doi.org/10.1186/s40594-018-0152-1

Download citation

Received : 29 August 2018

Accepted : 10 December 2018

Published : 28 December 2018

DOI : https://doi.org/10.1186/s40594-018-0152-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Learning assistant
  • Underrepresented students

research paper on logistic regression

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Bras Pneumol
  • v.48(6); 2022

Linear and logistic regression models: when to use and how to interpret them?

Modelos de regressão linear e logística: quando utilizá-los e como interpretá-los, horacio matias castro.

1 . Methods in Epidemiologic, Clinical, and Operations Research-MECOR-program, American Thoracic Society/Asociación Latinoamericana del Tórax, Montevideo, Uruguay.

2 . Pulmonary Medicine Department, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.

Juliana Carvalho Ferreira

3 . Divisão de Pneumologia, Instituto do Coração, Hospital das Clínicas, Faculdade de Medicina, Universidade de São Paulo, São Paulo (SP) Brasil.

PRACTICAL SCENARIO

A secondary analysis 1 of a study designated “Integrating Palliative and Critical Care,” a cluster randomized trial, was conducted to explore differences in receipt of elements of palliative care among patients who died in the ICU with interstitial lung disease (ILD) or COPD in comparison with those who died of cancer. The authors used two methods of multiple regression analysis: linear regression to estimate the impact of COPD and ILD, in comparison with that of cancer, on the length of ICU stay, and logistic regression to evaluate the effects of COPD and ILD on the presence or absence of elements of palliative care. All regression models were adjusted for confounders (age, sex, minority status, education level, among others) of the association between the patient diagnosis and palliative care outcomes.

INTRODUCTION

Linear and logistic regressions are widely used statistical methods to assess the association between variables in medical research. These methods estimate if there is an association between the independent variable (also called predictor, exposure, or risk factor) and the dependent variable (outcome). 2

The association between two variables is evaluated with simple regression analysis. However, in many clinical scenarios, more than one independent variable may be associated with the outcome, and there may be the need to control for confounder variables. When more than two independent variables are associated with the outcome, multiple regression analysis is used. Multiple regression analysis evaluates the independent effect of each variable on the outcome, adjusting for the effect of the other variables included in the same regression model.

WHEN TO USE LINEAR OR LOGISTIC REGRESSION?

The determinant of the type of regression analysis to be used is the nature of the outcome variable. Linear regression is used for continuous outcome variables (e.g., days of hospitalization or FEV1), and logistic regression is used for categorical outcome variables, such as death. Independent variables can be continuous, categorical, or a mix of both.

In our example, the authors wanted to know if there was a relationship between cancer, COPD, and ILD (baseline disease; the independent variables) with two different outcomes. One outcome was continuous (length of ICU stay) and the other one was categorical (presence or absence of elements of palliative care). Therefore, two models were built: a linear model to examine the association between baseline disease (chronic pulmonary disease or cancer) and length of ICU stay, and a logistic regression analysis to examine the association between the baseline disease and being in receipt of elements of palliative care.

HOW TO INTERPRET RESULTS OF REGRESSION ANALYSIS?

Regression models are performed within statistical packages, and the output results include several parameters, which can be complex to interpret. Clinicians who are learning the basics of regression models should focus on the key parameters presented in Chart 1 .

ParameterLinear regressionLogistic regression
Direction and strength of the association between the independent variable and the dependent variable (outcome)Beta coefficient:
Describes the (expected) average change in the outcome variable for each one-unit change in the independent variable for continuous variables, or the average change in the outcome variable for one category of the independent variable compared with a reference category for categorical variables
OR:
The OR for a continuous independent variable is interpreted as the change in the odds of the outcome occurring for every one-unit increase in the independent variable

The OR for categorical independent variables is interpreted as the increase or decrease in odds between two categories (e.g., men vs women)

OR = 1: no association; OR > 1: positive association or risk factor; and OR < 1: negative association or protective factor
Example (for a continuous independent variable)The expected increase in FEV for each centimeter increase in heightThe expected increase in the odds of death for each increase of one year of age among patients with sepsis
Example (for a categorical independent variable)The expected increase in FEV for men compared with women with the same height and ageThe expected increase in the odds of death for men compared with women among COVID-19 patients
Precision of the estimateThe 95% CI of the beta coefficientThe 95%CI of the OR
Statistical significanceThe p value (significant when < 0.05)The p value (significant when < 0.05)

In our example, the baseline disease-COPD, ILD, or cancer (the reference category)-is the independent variable, and length of ICU stay and receipt of palliative care elements are the outcomes of interest. In addition, the regression models also included other independent variables considered as potential confounders, such as age, sex, and minority status. In the linear regression model, the length of ICU stay for patients with ILD was longer than for those with cancer (β = 2.75; 95% CI, 0.52-4.98; p = 0.016), which means that, on average, having ILD increased the length of ICU stay in 2.75 days when compared with the length of ICU stay among cancer patients. In the logistic regression model, the authors found that patients with ILD, when compared with cancer patients, were less likely to have any documentation of their pain assessment in the last 24 h of life (OR = 0.43; 95% CI, 0.19-0.97; p = 0.042), which means that having ILD decreased the odds of documentation of pain assessment by more than half.

  • Linear and logistic regressions are important statistical methods for testing relationships between variables and quantifying the direction and strenght of the association.
  • Linear regression is used with continuous outcomes, and logistic regression is used with categorical outcomes.
  • These procedures require expertise in regression model building and typically require the assistance of a biostatistician.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

An Introduction to Logistic Regression Analysis and Reporting

Profile image of Ramana Kumar Penmetsa

Related Papers

Jesus Navarrete

research paper on logistic regression

Gifted Child Quarterly

Francis Huang

Veterinary World

How to cite this article: Selim AM, Elhaig MM, Moawed SA, El-Nahas E (2018) Modeling the potential risk factors of bovine viral diarrhea prevalence in Egypt using univariable and multivariable logistic regression analyses, Veterinary World, 11(3): 259-267. Abstract Aim: The present cross-sectional study was conducted to determine the seroprevalence and potential risk factors associated with Bovine viral diarrhea virus (BVDV) disease in cattle and buffaloes in Egypt, to model the potential risk factors associated with the disease using logistic regression (LR) models, and to fit the best predictive model for the current data. Materials and Methods: A total of 740 blood samples were collected within November 2012-March 2013 from animals aged between 6 months and 3 years. The potential risk factors studied were species, age, sex, and herd location. All serum samples were examined with indirect ELIZA test for antibody detection. Data were analyzed with different statistical approaches such as Chi-square test, odds ratios (OR), univariable, and multivariable LR models. Results: Results revealed a non-significant association between being seropositive with BVDV and all risk factors, except for species of animal. Seroprevalence percentages were 40% and 23% for cattle and buffaloes, respectively. OR for all categories were close to one with the highest OR for cattle relative to buffaloes, which was 2.237. Likelihood ratio tests showed a significant drop of the −2LL from univariable LR to multivariable LR models. Conclusion: There was an evidence of high seroprevalence of BVDV among cattle as compared with buffaloes with the possibility of infection in different age groups of animals. In addition, multivariable LR model was proved to provide more information for association and prediction purposes relative to univariable LR models and Chi-square tests if we have more than one predictor.

Andres Sandoval-Hernandez

Factors and conditions that promote academic resilience: A cross-country perspective Objectives or purposes The main objective of this paper is to identify factors and conditions that could help socially disadvantaged students in different countries to become academically resilient. To do that four specific objectives have been set: i) to conceptualize and quantitatively operationalize the notion of academic resilience; then, ii) to estimate the proportion of resilient students across different school systems; iii) to identify the factors and conditions more consistently associated to a high likelihood of academic resilience; and finally, iv) to evaluate the existence of cross-country patterns in the findings listed above. Perspective(s) or theoretical framework Students from low socio-economic status (SES) families live and study in different contexts, and therefore have specific and different educational needs than their more socially advantaged peers. Although it is well documented that students from low SES families tend to perform worse at school, several studies have shown that in most countries there is a group of students who are academically successful despite their challenging backgrounds. These students are called resilient. There is a fairly large body of empirical research on resilience in education; however the discussion of the concept has very often lacked a sound theoretical basis. This paper proposes the Bronfenbrenner's Ecological Systems Theory as a framework to elaborate a theoretical concept and to explain the processes related to academic resilience. Bronfenbrenner suggests that human development processes (e.g. resilience) can be explained in terms of the relationships between individuals and their environment. In resemblance to the hierarchical structure of educational data, under this view the environment consists of different dimensions, or levels, that make up an individual’s context. Methods, techniques or modes of inquiry Then, in order to address our second objective, we operationalized academic resilience based upon the two sine qua non characteristics of a resilient student: a challenging social background and academic success. We used principal component analysis (PCA) to summarize six variables into a SES index (e.g. parent’s level of education, parent’s occupational status, subjective family financial status, home possessions). We then categorized students coming from “challenging backgrounds” as those who score at or below the 20th percentile of the SES index across each group of countries. Finally, we defined “academically successful” students as those who, while controlling for SES, achieved a reading score at or above the 20th percentile in their country. Finally, in order to identify the factors and conditions more consistently associated to a high likelihood of academic resilience, we used different specifications of hierarchical logistic regression models. While controlling for student and socio-demographic characteristics, these models evaluate the association of a theoretically relevant set of variables with the likelihood of academic resilience. Data sources or evidence The data used for the analyses stem from the Progress in International Reading Literacy Study (PIRLS) 2006. PIRLS assessed students’ reading literacy in a target population of 4th graders in 40 participant countries. Apart from reading literacy scores, PIRLS collects extensive information from the pupils, their teachers and head teachers, and their parents, to explore home, school and national influences on student achievement. Results and/or conclusions/points of view Preliminary results suggest that: i) The proportion of resilient students varies considerably across education systems. ii) While generally providing a good fit to the empirical data, from the four dimensions proposed by the Bronfenbrenner’s Model (i.e. personal, family, school and community), the first one seemed to be the most important in predicting a high likelihood of academic resilience; specifically through students’ reading self-concept, and positive attitudes towards reading. In a lesser degree, school and family resources were also among the variables more consistently associated to resilience across countries. iii) Descriptive analyses reveal a faint pattern indicating that variables identified as important in the previous analyses show a stronger association with resilience in less-developed and more unequal countries. Interactions among dimensions, possible policy implications and suggested further research are discussed in the full paper. Educational importance of this study Resilience is important because delivering quality education to all students is a major goal for every education system. As preliminary results suggest, education can indeed play a catalytic role in encouraging that children’s future is not pre-determined by their SES. Understanding the processes involved in academic resilience could provide conceptual and theoretical tools for breaking the intergenerational cycle of poor academic achievement, poor job prospects and poverty. Moreover, an international comparative study of this kind can contribute to establishing a basis for the development of effective policies and practices for promoting resilience across countries. Connection to the themes of the congress Finally, as it is implicit in the theoretical model and in the analytical approach described above, the implications of this work are closely related to the interplay between policy, research and practice. The information used to fit the statistical models was collected from students, teachers, head-teachers, parents and national coordinators; consequently, the discussion and recommendations drawn from the analysis call for coordinated actions of all of them.

Lực Trần Thế

Desalegne Mesa

ABSTRACT Back ground:The term malnutrition, generally, refers both to under nutrition and over nutrition, but in this study theterm is used to refer solely to a deficiency of nutrition.Nutritional status is the result of complex interactions between food consumption and the overall status of health and health care practices. In Ethiopia, 44% of under-five children are stunted while 29% are underweight(EDHS, 2011).Although studies shows good progression in declining proportion of malnourished children in the recent past, there are still problems to be addressed. Objective:Generally, the focus of the study is to identify and examine the correlates of malnutrition of under-five children in the SNNPR regional state. Method:Based on the nature of the response variable, the statistical method employed in this study is ordinal logistic regression model. In other words, the response variable, nutritional status of children under-five years of age, possess the characteristic of ordinal in addition to its multilevel. Therefore, the writer used the ordinal logistic regression method for analysis of data. Results: Results from descriptive statistics show that 37.4% of children in SNNPR state are severely malnourished while 24.2% are moderately malnourished.The findings of the study show that size of child at birth, use of vitamin A during the six months prior to the survey, prenatal treatment of mothers by iron tablet/syrup, education level of mother/father/partner, mothers’ age at first birth, preceding birth interval, sex and age of a child have statistically significant effect on the nutritional status of children under-five years of age. Conclusion: Based on the findings of the studyuse of micronutrients for children andmothers, prevention of early marriage for females, stretched birth spacing, giving due attention for children aged below 11 months and education for mothers are recommended in order totackle the problems related tomalnutrition of under-five children. Key words: Nutritional status, ordinal logistic regression, proportional odds model.

Educational Review

This study reviews the international literature of empirical educational research to examine the application of logistic regression. The aim is to examine common practices of the report and interpretation of logistic regression results, and to discuss the implications for educational research. A review of 130 studies suggests that: (a) the majority of studies report statistical significance and sign of predictors but do not interpret relationship magnitude in terms of probabilities; (b) odds ratio is the most commonly reported effect size, and it tends to be incorrectly interpreted as relative risk, which leads to significant exaggeration of the association magnitude and misleading conclusions; and (c) marginal effects and predicted probabilities are reported by only 10.7% of reviewed studies, and the specification of independent variables’ values is frequently missing. It is suggested that marginal effects and predicted probabilities be reported more frequently to fully utilise the information provided by logistic regression results.

Dale Steele

Research in Higher Education

Alicia C. Dowd , Tarek Coury

Sheryl L Hendriks

The lack of a “gold standard” to determine and predict household food insecurity is well documented. While a considerable volume of research continues to explore universally applicable measurement approaches, robust statistical techniques have not been applied in food security monitoring and early warning systems, especially in countries where food insecurity is chronic. This study explored the application of various Ordinal Logistic Regression techniques in the analysis of national data from South Sudan. Five Link Functions of the Ordinal Regression model were tested. Of these techniques, the Probit Model was found to be the most efficient for predicting food security using ordered categorical outcomes (Food Consumption Scores). The study presents the first rigorous analysis of national food security levels in post conflict South Sudan and shows the power of the model in identifying significant predictors of food insecurity, surveillance, monitoring and early warning.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

RELATED PAPERS

Journal of Applied Research in the Community College

Keith Wurtz

Clifford Adelman

Science and Education Development Institute (SEDInst)

Kwesi Nsoah

Sociology Mind

Vernon Loke , Gina Chowa

Dove Medical Press

Venanzio Vella

Texila International Journal , Rajab Muchiri

Educational Psychology Review

DeLeon Gray

Mulugeta Mengstu

Marian Hickendorff

Thomas Debray

WILBERT M G A N G A MTESSIGWA

UNILAG Journal of Medicine, Science and Technology

Ademola Ogunniran

International Review of Education/ …

Journal of College Student Retention Research Theory and Practice

Dheeraj Raju

IAEME Publication

Shlomo S. Sawilowsky

Sunil Bhougal

Youth & Society

D. Shifrer , Chandra Muller

IJESRT Journal

Texila International Journal

Psychometrika

Norman Verhelst

Kennedy Gachigi

Journal of Consumer Psychology

Joseph Verducci

Dissertation

Darby Southgate

Etika Permata

Irene Peniche Ayora

Indira Adhikari

Journal of Women and Minorities in Science and Engineering

Amaury Nora

Dawit G Ayele

Journal of Educational and Behavioral Statistics

Educational Measurement: Issues and Practice

Michael Chajewski , Krista Mattern

Journal of Postgraduate Medicine

S. Kuril , Sunil Karande

Adesupo Akinrefon

Elizabeth Crawford , Gary Wilkerson , David Rausch

International Journal of Pediatric Obesity

Annemien Haveman-Nies

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Food consumption score and predictors among pregnant women attending antenatal care services in health centers of Addis Ababa, Ethiopia: Using ordinal logistic regression model

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing – original draft

Affiliation Department of Human Nutrition, Institute of Public Health, University of Gondar, Gondar, Ethiopia

Roles Conceptualization, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

Roles Formal analysis, Software, Writing – review & editing

Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

ORCID logo

  • Jerusalem Ketema Belay, 
  • Solomon Mekonnen Abebe, 
  • Lemlem Daniel Baffa, 
  • Berhanu Mengistu

PLOS

  • Published: June 26, 2024
  • https://doi.org/10.1371/journal.pone.0306169
  • Peer Review
  • Reader Comments

Table 1

Poor maternal nutrition during pregnancy creates a stressful environment that can lead to long-term effects on tissue development. Understanding the food consumption score can be used to prevent problems associated with poor dietary intake of pregnant mothers. In Ethiopia, the food consumption score ranges from 54% to 81.5%, which is far below the World Food Program (WFP) recommendation. Thus, this study aimed to assess food consumption score and associated factors among pregnant women attending antenatal care services in health centers of Addis Ababa, Ethiopia.

This study has used institution based cross sectional study. Overall, 999 pregnant women were selected for this study. A multistage sampling technique followed by systematic random sampling was used to include pregnant women coming for antenatal care services in the selected health centers of Addis Ababa from June 07 to July 08, 2022. We used interviewer administered questionnaire using the Kobo toolbox. Food consumption score (FCS) was assessed after collecting data on frequency of eight food groups consumed over the previous seven days, which were weighted according to their relative nutritional value. STATA 14 was used to analyse the data. Ordinal logistic regression was used to identify independent predictors of food consumption score. Those variables having p value < 0.25 in the bivariable ordinal logistic regression were considered for the final model. Crude and Adjusted Odds Ratio were used to assess the strength of the association. In the final model, p value < 0.05 at 95% confidence interval was used to declare statistical significance.

From the total of 949 pregnant women a little over half (51.20% (95%CI: 48.00%-54.40%) had acceptable food consumption score, while just over two fifth (42.60% (95% CI: 39.40%-45.70%)) and a small proportion (6.2% (95%CI: 4.84%-7.94%)) of the study participants had borderline and poor food consumption score, respectively. No meal skip (AOR = 1.37, 95% CI:1.03–1.81), able to read and write (AOR = 3.99, 95% CI: 1.33–11.96), poorest wealth status (AOR = 0.52, 95% CI: 0.34–0.78), positive attitude towards consumption of a diversified diet (AOR = 1.52,95% CI: 1.17–1.98) were independent predictors of acceptable food consumption score.

In this study, considerably low level of acceptable food consumption score among the study participants was observed. Besides, not skipping meal, having better educational status, wealth status and attitude towards consumption of a diversified diet were associated with acceptable food consumption score. Therefore, nutritional education considering important dietary modifications should be intensified targeting vulnerable groups.

Citation: Belay JK, Abebe SM, Baffa LD, Mengistu B (2024) Food consumption score and predictors among pregnant women attending antenatal care services in health centers of Addis Ababa, Ethiopia: Using ordinal logistic regression model. PLoS ONE 19(6): e0306169. https://doi.org/10.1371/journal.pone.0306169

Editor: Girma Beressa, Madda Walabu University, ETHIOPIA

Received: February 8, 2023; Accepted: June 12, 2024; Published: June 26, 2024

Copyright: © 2024 Belay et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: all relevant data are within the manuscript and its Supporting Information files.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: FCS, Food Consumption Score; WFP, World Food Program; ANC, Antenatal Care; COR, Crude Odds Ratio; SD, Standard Deviation; AOR, Adjusted Odds Ratio; VIF, Variance Inflation Factor

A woman’s nutritional requirements vary during pregnancy as she is now feeding both her unborn child and herself. Although prenatal nutrition has an impact on how a pregnancy develops, there is never a wrong moment to start eating healthily. Therefore, it is imperative to have a sound nutrition during the period of gestation for both the mother and her growing foetus [ 1 – 3 ].

However, poor maternal nutrition during pregnancy that is either due to decreased intake or quality results a range of problems [ 4 ]. It affects the general growth and development of the offspring. These changes can have a significant impact on the overall health and production performance of the offspring [ 5 , 6 ]. Along with its negative impacts on the offspring’s nutritional quality, it also produces a stressful environment that may have long-term or permanent repercussions on tissue development, as seen by the emergence of chronic non-communicable diseases later in life [ 7 – 9 ]. Understanding the food consumption score (FCS) of a pregnant woman will help to prevent the issues linked to poor dietary intake during the period of gestation [ 10 ].

Nutritional needs during pregnancy can be satisfied by eating foods from a variety of food groups including fruits, vegetables, dairy products, carbohydrates, fats, and vitamins [ 11 ]. However, poor dietary diversity and FCS have been reported during pregnancy. For example, in Bangladesh acceptable FCS among pregnant women was found to be 58%, different studies in Ethiopia have also revealed a similar figure of FCS among pregnant women:81.5% in East Gojam Zone [ 12 ], and 54% in rural Eastern Ethiopia [ 13 ], which were far below the World Food Program (WFP) recommendations (90%) [ 1 ].

A number of studies have shown the following as independent predictors of having an acceptable FCS during pregnancy: religion [ 12 ], residence [ 12 ], maternal educational status [ 14 ], educational status of the father [ 10 ], wealth status [ 13 , 14 ], attitude [ 13 ], antenatal care (ANC) visit [ 13 ], skipping meal [ 15 ] and consumption of animal source food [ 13 ].

In recent years, introduction of western lifestyles in the big cities of Ethiopia like Addis Ababa has brought a drastic change in food consumption pattern of pregnant women [ 16 ], which runs counter to unrelenting efforts that is outlined in different policies and programmes enacted by the government [ 17 , 18 ]. Socio-cultural factors such as women’s education and employment, food preference, recent epidemics like COVID-19 and cultural practices have also been reported as driving forces for this change [ 19 , 20 ]. In cognizant of this, findings from this study can be used to provide an evidence-based decision to determine factors that influence FCS of pregnant women [ 21 ].

Even though there are a handful of researches that focused on FCS among pregnant women, our study employed a different method-ordinal logistic regression to better understand predictors of FCS among pregnant women [ 22 ]. Thus, this study aimed to assess the food consumption pattern and associated factors among pregnant women attending ANC services in health centers of Addis Ababa, Ethiopia. The goal of this study is to improve the dietary practice of pregnant women, thereby preventing long term ramifications of malnutrition.

Study area, design and period

The study was conducted in the capital city of Ethiopia, Addis Ababa, it is among the fastest growing cities in Africa. It was estimated that 5,228,000 people reside in the ten sub-cities of Addis Ababa in the study period [ 23 ]. The city has a sub-tropical highland climate, and is populated by people from the different regions of Ethiopia. The magnitude of food insecurity among productive safety net program beneficiaries of the city was 77.10% [ 24 ]. There were six publicly owned general hospitals and one hundred two (102) health centers, and eleven privately owned hospitals and 882 clinics in the city. By using cross-sectional study, pregnant mothers who came for ANC follow up from June 07 to July 08, 2022 at the selected health centers were approached to participate in this study. In these health centers, there were 2478 mothers who came for ANC services.

Sample size determination and sampling technique

Sample size was estimated for each specific objective, and the highest was taken for this study. For the first specific objective, by assuming 54.46% proportion of FCS from previous study [ 13 ], 5% margin of error, 1.96 Z value at 95% confidence interval (CI) and by adding 10% non-response rate at 1.5 design effect and it was estimated to be 629. However, the highest sample size was obtained using the second specific objective. Accordingly, epi-info version 7.2.2 was used to estimate the sample size by considering the following assumptions: crude odds ratio of having acceptable FCS among pregnant women who had positive attitude towards consumption a diversified diet, which was 1.6 from a previous study [ 13 ], 80% power and 95% CI, 1.5 designs effect. Therefore, 999 was the final sample size after adding 10% non-response rate.

Pregnant women coming for antenatal care services at the selected health centers were included. However, pregnant women who were seriously ill during the data collection period were excluded in the study. Multistage sampling technique followed by systematic random sampling technique was employed to select the study participants. Out of the ten sub-cities in Addis Ababa, four sub-cities were selected randomly (30%): Nifas silk lafto sub-city, Kolfe keraniyo sub-city, Lideta sub-city and Akaki kality sub-city. In the selected sub-cities, there were 28 health centers. First, nine health centers (one from Nifas silk lafto sub-city, two from Kolfe keraniyo sub-city, three from Lideta sub-city and three Akaki kality sub-city) were selected randomly using a lottery method. Then, the required sample size was proportionated to the selected health centers, and every three (k≈ 2478/999) pregnant woman who was coming for ANC follow up was selected.

Data collection tools and measurement

Data was collected using pretested interviewer administered questionnaire that comprises socio-demographic data, dietary habits, attitude towards consumption of a diversified diet, obstetric history, and food consumption score (FCS). The questionnaire was first prepared in English and then translated into Amharic (Local language). We used kobo toolbox to collect the data. Nine B.Sc. nurses and four public health officers were the data collectors and supervisors, respectively. The questionnaire was pretested at 5% of the final estimated sample size at Arada sub-city. After the pre-test, the question that assessed participants’ residence was excluded as all the study participants were urban residents. On the food frequency questionnaire, necessary modification was made by including foods that were not previously included.

The outcome variable of this study was food consumption score (FCS), information on foods which were consumed in the last seven days prior to the data collection time was gathered. Food consumption score (FCS) is a composite variable that is constructed based on the following criteria: food frequency, diet diversity and relative nutritional value of each food item [ 1 ]. Food consumption score (FCS) was computed after asking the study participants about the frequency and consumption of eight food groups over the period of seven days prior to the data collection period. In the questionnaire, there were 70 food items which were commonly consumed in the study area. The Cronbach’s alpha value (internal consistency) was observed to be 0.82.

Then, the consumption frequencies were summed and multiplied by the standardized food group weight. Finally, it was categorized into three categories; poor food consumption score(FCS)(0–21), borderline food consumption score(FCS) (21.5–35), and acceptable food consumption score(FCS) (>35) [ 1 , 25 ]. The wealth status was determined using principal component analysis which contained 15 items, and it was later categorized into five categories (Poorest to the richest) [ 23 ]. The attitude of the study participants towards the consumption diversified diet was measured using 4 item Likert-scale questions, the response ranges from strongly disagree to strongly agree. It was considered positive attitude when respondents score above the median. The internal consistency of the questionnaire was checked using Cronbach’s alpha (0.78). The trimesters were defined as first trimester (less than 14 weeks), second trimester (14–27 complete weeks) and third trimester (28 complete weeks until delivery). Finally, birth interval was categorized as recommended birth interval when interpregnancy interval was more 24 months otherwise it was categorized as not recommended birth interval [ 26 ].

Data analysis

The collected data using Kobo toolbox was exported to STATA 14 for analysis. A descriptive data was reported as frequencies, percentage, mean(±SD) and presented in tables. Ordinal logistic regression was used to identify predictors of FCS. Multicollinearity was checked using Variance inflation factor (VIF<10). Brant test of parallel regression assumption (p value = 0.66) conferred proportion of odds assumption. After checking the assumptions of ordinal logistic regression, COR and AOR at 95% was used to ascertain predictors of the outcome variable in both bivariable (p value <0.25) and multivariable ordinal logistic regression respectively. Finally, P value < 0.05 was used to determine level of significance in the final model. The final model reached after checking adequacy of the data using the Hosmer and Lemeshow test.

Ethics approval and consent to participate

The study was conducted according to the guidelines of the 1964 Declaration of Helsinki and following amendments. Ethical clearance was obtained from University of Gondar Institutional Review Board of Institute of Public Health (Ref. No IPH/2119/2014). Permission letter was obtained from Addis Ababa Health Office. Written informed consent was obtained from all study participants. Study participants who were unable to read and write signed by fingerprints, while doing so there were two literate witnesses. Data collectors have strictly followed COVID-19 prevention protocols. Confidentiality of the study participants was ensured; no person identifiers were used and the kobo account was password protected-only authorized user was able to access the data.

Sociodemographic characteristics of the study participants

In this study, 949 pregnant women consented to participate in the study period, yielding 95% response rate. The vast majority of the study participants (96.80%) were married. The mean (±Standard deviation (SD)) age of the study participants in years was 27.16(±4.46SD), about two fifth (39.10%) of the study participants were in the age range 25–29 years. Regarding educational status, half of (50.10%) the study participates had accomplished primary education. More than two fifth (43.90%) of the study participants were housewives. Almost a quarter (24.50%) of the participants were from poor households. More than half (57.60%) of pregnant women have positive attitude towards consumption of variety of food ( Table 1 ).

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0306169.t001

Maternal characteristics

As to the maternal characteristics the study participants, more than half (57.6%) were multigravida, almost two third (61.5%) were in the second trimester pregnancy, more than two third (67.02%) had at least one ANC visit, and 69.6% had received nutritional counselling when they came for ANC visit ( Table 2 ).

thumbnail

https://doi.org/10.1371/journal.pone.0306169.t002

Dietary habits of the study participants

Of the study participants, less than half (45.3%) ate three times a day, whereas over half (56.5%) regularly ate snacks. Nearly two thirds (59.2%) skipped meals, with the most common reasons being fatigued or preoccupied with work (19.6%), not wanting to gain weight (19.6%), and other (31.3%) causes such as loss of appetite, vomiting, and discomfort. Likewise, nearly one-third (31.1%) reported a history of food taboos. Lastly, more than a quarter (26.7%) of study participants reported having a history of food cravings ( Table 3 ).

thumbnail

https://doi.org/10.1371/journal.pone.0306169.t003

Food consumption pattern

In this study, practically all of the study participants had consumed common staples, and nearly three quarters (73.2%) of the participants had consumed animal-source food, such as meat ( Table 4 ).

thumbnail

https://doi.org/10.1371/journal.pone.0306169.t004

Food consumption score

This study has revealed that a little over half [51.20% (95%CI: 48.00%-54.40)] had acceptable food consumption score. More than two fifth [42.60% (95% CI: 39.45%-45.74%)] had borderline food consumption, and the small proportion [6.2% (95%CI: 4.84%-7.94%) ( Table 5 ).

thumbnail

https://doi.org/10.1371/journal.pone.0306169.t005

Factors associated with food consumption score

Ordinal logistic regression was used to identify factors associated with food consumption score. The following variables which were significant in the bivariable analysis (p value<0.25): age, husband educational status, husband occupation, maternal education, attitude, wealth status, family size, meal skip, food avoid, food craving, taking supplements, still birth, ANC visit, and nutrition counselling during ANC follow-up were fitted in the final model. However, only meal skip, maternal education, attitude and wealth status were found to be the independent predictors of food consumption score.

The odds of having acceptable food consumption score among study participants who can read and write was 3.99 (Relative to borderline and poor food consumption score) times higher than study participants who were unable to read and write [AOR = 3.99,95%CI: 1.33–11.96]. The odds of having acceptable food consumption score were 49% (Relative to borderline and poor food consumption score) lower among study participants who came from the poorest households when compared to participants who came from the richest households [AOR = 0.52, 95%CI: 0.24–0.78]. The odds of having acceptable food consumption score were 1.36 times higher among study participants who did not skip meal (Versus borderline and poor food consumption score) compared to participants who skipped meal [AOR = 1.36, 95%CI: 1.03–1.81]. Finally, among study participants with positive attitude towards consumption of diversified diet there was 52% increased odds to have acceptable food consumption score (Relative to borderline and poor food consumption score) [AOR = 1.52,95%CI: 1.17–1.98] ( Table 6 ).

thumbnail

https://doi.org/10.1371/journal.pone.0306169.t006

This study sought to examine FCS and associated factors among pregnant women who were having ANC follow up in health centers of Addis Ababa, Ethiopia. The results of this study have showed a little over half (51.20%, 95% CI: 48.00%-54.30%) of the study participants had acceptable FCS, and the small proportion of the study participants had poor FCS (6.20%).

Our report was far below the WFP recommendation [ 1 ]. Furthermore, the finding has showed that the percentage of acceptable FCS was comparatively lower than studies from Bangladesh(58%) [ 27 ], Nigeria (80.3%) [ 28 ] and pocket studies from Ethiopia (81.5% and 54.6% at Shegaw Motta and Eastern Ethiopia, respectively) [ 12 , 13 ]. The study period could explain the decreased rate FCS, for example, the study at Shegaw Motta was conducted in the main harvest season while our study was conducted in fasting season when there is a decreased consumption animal source food [ 29 ]. Methodologically, the use of larger sample size in the current study and difference in outcome ascertainment might explain decreased rate of acceptable FCS in this study. In Ethiopia, pregnant women avoid foods due to cultural and religious reasons, and this might explain the discrepancy between the current the study and study from Nigeria where religion and culture has lesser influence over their food choice [ 30 ].

As to the associated factors of FCS, our study has showed maternal educational status-able to read and write, not being in the poorest wealth status, positive attitude towards dietary diversity, and skipping meal were independent predictors of FCS. Those mothers who were able to read and write had higher odds of having acceptable FCS compared to mothers who were unable to read and write, emphasizing the importance of nutritional educational during pregnancy. This was supported by other similar studies conducted in Nigeria [ 30 ], Ghana [ 31 ], and other studies in Ethiopia [ 32 ]. It is evident that increasing level of literacy is crucial to mitigate the problem even in the poorest households [ 33 ]. Besides, mothers who are able to read and write will have a better access to nutritional information from internet, brochures, newspapers and magazines [ 34 – 36 ]. In the affluents, where the toll of non-communicable disease is spiralling- enhancing level of literacy will play a pivotal role for an appropriate food selection and consumption too [ 37 ].

Being in the poorest wealth status decreases the odds of having acceptable FCS by 49% when compared mothers from the richest wealth status. This was also observed in previous studies conducted in Bishoftu, Oromia [ 10 ]. Pregnant mothers from the poorest households have limited economical accesses to procure and buy a diversified diet. On top of this, different studies have pinpointed that being in the lowest wealth status is associated with decreased consumption of animal source food [ 38 ], which in turn results lower FCS. Mothers who did not skip meal had also higher odds of having acceptable FCS when compared to their counterparts. A similar finding was observed from a study in Eastern Ethiopia [ 15 ]. During the period of gestation, meal patterning is highly important since pregnant women who sustain prolonged periods of time without food by skipping meals or snacks may be inducing a physiologic stress in their pregnancy [ 39 ]. Even though accidentally skipping a meal is not going to be harmful, skipping meals regularly for different reasons is not advisable to have a better pregnancy outcome [ 40 , 41 ]. Moreover, from different studies, it has been seen that skipping meals during this period is associated decreased dietary quality [ 15 ].

The study has also revealed, study participants who had positive attitude towards consumption diversified diet had an increased odds of having acceptable FCS than their counter parts. A similar finding was observed from a study conducted in Eastern Ethiopia [ 13 ]. Different researches have supported that pregnant women with increased level of attitude have a better practice of consuming a diversified diet [ 42 ]. Women with positive feeling towards a diversified diet are also motivated to consume foods from different food groups [ 43 , 44 ].

It should be mentioned that the present study has provided greater evidence on the dietary quality and predictors among pregnant women using ordinal logistic regression [ 22 ]. However, methodological limitation of the study cannot go unnoticed. Despite the use of probes like photographs-to recite memory of the study participants-problem of recall bias cannot be ignored which in turn might overestimate or underestimate the result. On top of that, cross- sectional nature of the study limits detection of causal association between the outcome and predicator variables. Even though FCS is a validated tool to asses calorie intake, the tool has not been validated to measure adequacy of macronutrients and micronutrients. The use of a 4 item Likert questionnaire is another limitation of the current study, while recommending the use of a questionnaire with sufficient numbers questions.

Implication of the study

The findings of this study can be used to implement public health policies and programmes that strive to bring a better pregnancy outcome by promoting a balanced diet to vulnerable groups of the population. Therefore, to meet the WFP recommendation of having 90% acceptable FCS, interventions need to give a due attention to mothers with lower educational status who are from a lower socio-economic status. The implications of this study can be linked to the importance nutritional educations that target to bring a positive attitude towards consumption of a diversified diet. Moreover, findings of the study imply the importance of provision of a diversified diet in deterring the sequala of malnutrition.

The findings of this study have showed that only half of the study participants had acceptable FCS which is far below the WFP recommendation. Besides, able to read and write, not skipping meal, positive attitude towards the consumption variety of foods, not being from the poorest household were significantly associated with having acceptable FCS relative to borderline and poor FCS. Therefore, it is important to give a special attention to pregnant mothers with low socioeconomic status, and mothers who skip meals in order to enhance their food variety score and improve their nutritional intake.

Future researches are encouraged to investigate nutrient adequacy among pregnant women. Finally, future studies triangulated with qualitative research that investigate behavioural factors such as food taboos and norms that influence FCS among pregnant women are also encouraged.

Supporting information

S1 file. fcs plos one..

https://doi.org/10.1371/journal.pone.0306169.s001

Acknowledgments

The authors of this article are grateful for the study participants without whom this would not be possible.

  • 1. Wfp V., Food consumption analysis: calculation and use of the food consumption score in food security analysis. WFP: Rome, Italy, 2008.
  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 26. Organization W.H., Report of a WHO technical consultation on birth spacing: Geneva, Switzerland 13–15 June 2005. 2007, World Health Organization.
  • 27. Ahmed, E., I. Jahan, and M.A. Islam, FOOD SECURITY STATUS AND FOOD CONSUMPTION AMONG URBAN AND RURAL PREGNANT WOMEN OF JASHORE DISTRICT IN BANGLADESH.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 June 2024

Detecting hallucinations in large language models using semantic entropy

  • Sebastian Farquhar   ORCID: orcid.org/0000-0002-9185-6415 1   na1 ,
  • Jannik Kossen 1   na1 ,
  • Lorenz Kuhn 1   na1 &
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 1  

Nature volume  630 ,  pages 625–630 ( 2024 ) Cite this article

70k Accesses

1 Citations

1458 Altmetric

Metrics details

  • Computer science
  • Information technology

Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3 , 4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents 5 or untrue facts in news articles 6 and even posing a risk to human life in medical domains such as radiology 7 . Encouraging truthfulness through supervision or reinforcement has been only partially successful 8 . Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Similar content being viewed by others

research paper on logistic regression

Testing theory of mind in large language models and humans

research paper on logistic regression

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

research paper on logistic regression

ThoughtSource: A central hub for large language model reasoning data

‘Hallucinations’ are a critical problem 9 for natural language generation systems using large language models (LLMs), such as ChatGPT 1 or Gemini 2 , because users cannot trust that any given output is correct.

Hallucinations are often defined as LLMs generating “content that is nonsensical or unfaithful to the provided source content” 9 , 10 , 11 but they have come to include a vast array of failures of faithfulness and factuality. We focus on a subset of hallucinations which we call ‘confabulations’ 12 for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed. For example, when asked a medical question “What is the target of Sotorasib?” an LLM confabulates by sometimes answering KRASG12 ‘C’ (correct) and other times KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish this from cases in which a similar ‘symptom’ is caused by the following different mechanisms: when LLMs are consistently wrong as a result of being trained on erroneous data such as common misconceptions 13 ; when the LLM ‘lies’ in pursuit of a reward 14 ; or systematic failures of reasoning or generalization. We believe that combining these distinct mechanisms in the broad category hallucination is unhelpful. Our method makes progress on a portion of the problem of providing scalable oversight 15 by detecting confabulations that people might otherwise find plausible. However, it does not guarantee factuality because it does not help when LLM outputs are systematically bad. Nevertheless, we significantly improve question-answering accuracy for state-of-the-art LLMs, revealing that confabulations are a great source of error at present.

We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. Detecting confabulations allows systems built on LLMs to avoid answering questions likely to cause confabulations, to make users aware of the unreliability of answers to a question or to supplement the LLM with more grounded search or retrieval. This is essential for the critical emerging field of free-form generation in which naive approaches, suited to closed vocabulary and multiple choice, fail. Past work on uncertainty for LLMs has focused on simpler settings, such as classifiers 16 , 17 and regressors 18 , 19 , whereas the most exciting applications of LLMs relate to free-form generations.

The term hallucination in the context of machine learning originally comes from filling in ungrounded details, either as a deliberate strategy 20 or as a reliability problem 4 . The appropriateness of the metaphor has been questioned as promoting undue anthropomorphism 21 . Although we agree that metaphor must be used carefully with LLMs 22 , the widespread adoption of the term hallucination reflects the fact that it points to an important phenomenon. This work represents a step towards making that phenomenon more precise.

To detect confabulations, we use probabilistic tools to define and then measure the ‘semantic’ entropy of the generations of an LLM—an entropy that is computed over meanings of sentences. High entropy corresponds to high uncertainty 23 , 24 , 25 —so semantic entropy is one way to estimate semantic uncertainties. Semantic uncertainty, the broader category of measures we introduce, could be operationalized with other measures of uncertainty, such as mutual information, instead. Entropy in free-form generation is normally hard to measure because answers might mean the same thing (be semantically equivalent) despite being expressed differently (being syntactically or lexically distinct). This causes naive estimates of entropy or other lexical variation scores 26 to be misleadingly high when the same correct answer might be written in many ways without changing its meaning.

By contrast, our semantic entropy moves towards estimating the entropy of the distribution of meanings of free-form answers to questions, insofar as that is possible, rather than the distribution over the ‘tokens’ (words or word-pieces) which LLMs natively represent. This can be seen as a kind of semantic consistency check 27 for random seed variation. An overview of our approach is provided in Fig. 1 and a worked example in Supplementary Table 1 .

figure 1

a , Naive entropy-based uncertainty measures variation in the exact answers, treating ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable for language tasks for which sometimes different answers mean the same things. Our semantic entropy clusters answers which share meanings before computing the entropy. A low semantic entropy shows that the LLM is confident about the meaning. b , Semantic entropy can also detect confabulations in longer passages. We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples  M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation because generations often mean the same thing, despite very different wordings, which a naive entropy would have missed.

Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally 28 . That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster. We measure entailment using both general-purpose LLMs and natural language inference (NLI) tools developed specifically for detecting entailment for which we show direct evaluations in Supplementary Tables 2 and 3 and Supplementary Fig. 1 . Textual entailment has previously been shown to correlate with faithfulness 10 in the context of factual consistency 29 as well as being used to measure factuality in abstractive summarization 30 , especially when applied at the right granularity 31 .

Semantic entropy detects confabulations in free-form text generation across a range of language models and domains, without previous domain knowledge. Our evaluations cover question answering in trivia knowledge (TriviaQA 32 ), general knowledge (SQuAD 1.1; ref. 33 ), life sciences (BioASQ 34 ) and open-domain natural questions (NQ-Open 35 ) derived from actual queries to Google Search 36 . In addition, semantic entropy detects confabulations in mathematical word problems (SVAMP 37 ) and in a biography-generation dataset, FactualBio, accompanying this paper.

Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are all evaluated context-free and involve sentence-length answers (96 ± 70 characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B parameters) 38 , Falcon Instruct (7B and 40B) 39 and Mistral Instruct (7B) 40 . In the Supplementary Information , we further consider short-phrase-length answers. Results for FactualBio (442 ± 122 characters) use GPT-4 (ref. 1 ). At the time of writing, GPT-4 (ref. 1 ) did not expose output probabilities 41 or hidden states, although it does now. As a result, we propose a discrete approximation of our estimator for semantic entropy which allows us to run experiments without access to output probabilities, which we use for all GPT-4 results in this paper and which performs similarly well.

Our confabulation detection with semantic entropy is more robust to user inputs from previously unseen domains than methods which aim to ‘learn’ how to detect confabulations from a set of example demonstrations. Our method is unsupervised, meaning that we do not need labelled examples of confabulations. By contrast, supervised methods detect confabulations by learning patterns behind examples of confabulations, assuming that future questions preserve these patterns. But this assumption is often untrue in new situations or with confabulations that human overseers are unable to identify (compare Fig. 17 of ref. 24 ). As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a question on the basis of the final ‘embedding’ (hidden state) of the LLM. We also use the P (True) method 24 which looks at the probability with which an LLM predicts that the next token is ‘True’ when few-shot prompted to compare a main answer with ‘brainstormed’ alternatives.

Confabulations contribute substantially to incorrect answers given by language models. We show that semantic entropy can be used to predict many incorrect model answers and to improve question-answering accuracy by refusing to answer those questions the model is uncertain about. Corresponding to these two uses, we evaluate two main metrics. First, the widely used area under the receiver operating characteristic (AUROC) curve for the binary event that a given answer is incorrect. This measure captures both precision and recall and ranges from 0 to 1, with 1 representing a perfect classifier and 0.5 representing an un-informative classifier. We also show a new measure, the area under the ‘rejection accuracy’ curve (AURAC). This studies the case in which the confabulation detection score is used to refuse to answer the questions judged most likely to cause confabulations. Rejection accuracy is the accuracy of the answers of the model on the remaining questions and the area under this curve is a summary statistic over many thresholds (representative threshold accuracies are provided in Supplementary Material ). The AURAC captures the accuracy improvement which users would experience if semantic entropy was used to filter out questions causing the highest entropy.

Detecting confabulations in QA and math

In Fig. 2 , we show that both semantic entropy and its discrete approximation outperform our best baselines for sentence-length generations. These results are averaged across datasets and provide the actual scores on the held-out evaluation dataset. We report the raw average score across held-out evaluation datasets without standard error because the distributional characteristics are more a property of the models and datasets selected than the method. Consistency of relative results across different datasets is a stronger indicator of variation in this case.

figure 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y -axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y -axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information .

Semantic entropy greatly outperforms the naive estimation of uncertainty using entropy: computing the entropy of the length-normalized joint probability of the token sequences. Naive entropy estimation ignores the fact that token probabilities also express the uncertainty of the model over phrasings that do not change the meaning of an output.

Our methods also outperform the supervised embedding regression method both in- and out-of-distribution. In pale-yellow bars we show that embedding regression performance deteriorates when its training data do not match the deployment distribution—which mirrors the common real-world case in which there is a distribution shift between training and deployment 42 —the plotted value is the average metric for embedding regression trained on one of the four ‘off-distribution’ datasets for that evaluation. This is critical because reliable uncertainty is most important when the data distribution shifts. Semantic entropy also outperforms P (True) which is supervised ‘in-context’; that is, it is adapted to the deployment task with a few training examples provided in the LLM prompt itself. The discrete variant of semantic entropy performs similarly to our standard estimator, despite not requiring exact output probabilities.

Averaged across the 30 combinations of tasks and models we study, semantic entropy achieves the best AUROC value of 0.790 whereas naive entropy (0.691), P (True) (0.698) and the embedding regression baseline (0.687) lag behind it. Semantic entropy performs well consistently, with stable performance (between 0.78 and 0.81 AUROC) across the different model families (LLaMA, Falcon and Mistral) and scales (from 7B to 70B parameters) which we study (we report summary statistics for each dataset and model as before). Although semantic entropy outperforms the baselines across all model sizes, P (True) seems to improve with model size, suggesting that it might become more competitive for very capable honest models in settings that the model understands well (which are, however, not the most important cases to have good uncertainty). We use ten generations to compute entropy, selected using analysis in Supplementary Fig. 2 . Further results for short-phrase generations are described in Supplementary Figs. 7 – 10 .

The results in Fig. 2 offer a lower bound on the effectiveness of semantic entropy at detecting confabulations. These evaluations determine whether semantic entropy and baseline methods can detect when the answers of the model are incorrect (which we validate against human correctness evaluations in Supplementary Table 4 ). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as consistent errors learned from the training data. The fact that methods such as embedding regression are able to spot other kinds of errors, not just confabulations, but still are outperformed by semantic entropy, suggests that confabulations are a principal category of errors for actual generations.

Examples of questions and answers from TriviaQA, SQuAD and BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1 . These illustrate how only semantic entropy detects when the meaning is constant but the form varies (the first row of the table) whereas semantic entropy and naive entropy both correctly predict the presence of confabulations when the form and meaning vary together (second row) and predict the absence of confabulations when the form and meaning are both constant across several resampled generations (third row). In the final row, we give an example in which semantic entropy is erroneously high as a result of overly sensitive semantic clustering relative to the reference answer. Our clustering method distinguishes the answers which provide a precise date from those which only provide a year. For some contexts that would have been correct but in this context the distinction between the specific day and the year is probably irrelevant. This highlights the importance of context and judgement in clustering, especially in subtle cases, as well as the shortcomings of evaluating against fixed reference answers which do not capture the open-ended flexibility of conversational deployments of LLMs.

Detecting confabulations in biographies

Semantic entropy is most natural for sentences that express a single proposition but the idea of semantic equivalence is trickier to apply to longer passages which express many propositions which might only agree partially 43 . Nevertheless, we can use semantic entropy to detect confabulations in longer generations, such as entire paragraphs of text. To show this, we develop a dataset of biographical generations from GPT-4 (v.0613) for 21 individuals notable enough to have their own Wikipedia page but without extensive online biographies. From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total), which we manually label as true or false.

Applying semantic entropy to this problem is challenging. Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next time profession. This is analogous to the original problem semantic entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy (Fig. 1 ) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4 ) and computing the semantic entropy over those generations plus the original factual claim. We aggregate these by averaging the semantic entropy over all the questions to get an uncertainty score for each proposition, which we use to detect confabulations. Unaggregated results are shown in Supplementary Figs. 5 and 6 .

As GPT-4 did not allow access to the probability of the generation at the time of writing, we use a discrete variant of semantic entropy which makes the further approximation that we can infer a discrete empirical distribution over semantic meaning clusters from only the generations ( Methods ). This allows us to compute semantic entropy using only the black-box outputs of an LLM. However, we were unable to compute the naive entropy baseline, the standard semantic entropy estimator or the embedding regression baseline for GPT-4 without output probabilities and embeddings.

In Fig. 3 we show that the discrete variant of semantic entropy effectively detects confabulations on this dataset. Its AUROC and AURAC are higher than either a simple ‘self-check’ baseline—which just asks the LLM whether the factoid is likely to be true—or a variant of P (True) which has been adapted to work for the paragraph-length setting. Discrete semantic entropy has better rejection accuracy performance until 20% of the questions have been rejected at which point P (True) has a narrow edge. This indicates that the questions predicted to cause confabulations are indeed more likely to be wrong.

figure 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y -axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P (True) baseline exceed semantic entropy.

Our probabilistic approach, accounting for semantic equivalence, detects an important class of hallucinations: those that are caused by a lack of LLM knowledge. These are a substantial portion of the failures at present and will continue even as models grow in capabilities because situations and cases that humans cannot reliably supervise will persist. Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adaptations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination 44 for scalable oversight through debate 45 .

The success of semantic entropy at detecting errors suggests that LLMs are even better at “knowing what they don’t know” than was argued by ref. 24 —they just don’t know they know what they don’t know. Our method explicitly does not directly address situations in which LLMs are confidently wrong because they have been trained with objectives that systematically produce dangerous behaviour, cause systematic reasoning errors or are systematically misleading the user. We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately.

One exciting aspect of our approach is the way it makes use of classical probabilistic machine learning methods and adapts them to the unique properties of modern LLMs and free-form language generation. We hope to inspire a fruitful exchange of well-studied methods and emerging new problems by highlighting the importance of meaning when addressing language-based machine learning problems.

Semantic entropy as a strategy for overcoming confabulation builds on probabilistic tools for uncertainty estimation. It can be applied directly to any LLM or similar foundation model without requiring any modifications to the architecture. Our ‘discrete’ variant of semantic uncertainty can be applied even when the predicted probabilities for the generations are not available, for example, because access to the internals of the model is limited.

In this section we introduce background on probabilistic methods and uncertainty in machine learning, discuss how it applies to language models and then discuss our contribution, semantic entropy, in detail.

Uncertainty and machine learning

We aim to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output is going to be arbitrary.

One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input 25 . The predictive entropy (PE) for an input sentence x is the conditional entropy ( H ) of the output random variable Y with realization y given x ,

A low predictive entropy indicates an output distribution which is heavily concentrated whereas a high predictive entropy indicates that many possible outputs are similarly likely.

Aleatoric and epistemic uncertainty

We do not distinguish between aleatoric and epistemic uncertainty in our analysis. Researchers sometimes separate aleatoric uncertainty (uncertainty in the underlying data distribution) from epistemic uncertainty (caused by having only limited information) 46 . Further advances in uncertainty estimation which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond entropy.

Joint probabilities of sequences of tokens

Generative LLMs produce strings of text by selecting tokens in sequence. Each token is a wordpiece that often represents three or four characters (though especially common sequences and important words such as numbers typically get their own token). To compute entropies, we need access to the probabilities the LLM assigns to the generated sequence of tokens. The probability of the entire sequence, s , conditioned on the context, x , is the product of the conditional probabilities of new tokens given past tokens, whose resulting log-probability is \(\log P({\bf{s}}| {\boldsymbol{x}})={\sum }_{i}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , where s i is the i th output token and s < i denotes the set of previous tokens.

Length normalization

When comparing the log-probabilities of generated sequences, we use ‘length normalization’, that is, we use an arithmetic mean log-probability, \(\frac{1}{N}{\sum }_{i}^{N}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , instead of the sum. In expectation, longer sequences have lower joint likelihoods because of the conditional independence of the token probabilities 47 . The joint likelihood of a sequence of length N shrinks exponentially in N . Its negative log-probability therefore grows linearly in N , so longer sentences tend to contribute more to entropy. We therefore interpret length-normalizing the log-probabilities when estimating the entropy as asserting that the expected uncertainty of generations is independent of sentence length. Length normalization has some empirical success 48 , including in our own preliminary experiments, but little theoretical justification in the literature.

Principles of semantic uncertainty

If we naively calculate the predictive entropy directly from the probabilities of the generated sequence of tokens, we conflate the uncertainty of the model over the meaning of its answer with the uncertainty over the exact tokens used to express that meaning. For example, even if the model is confident in the meaning of a generation, there are still usually many different ways for phrasing that generation without changing its meaning. For the purposes of detecting confabulations, the uncertainty of the LLM over meanings is more important than the uncertainty over the exact tokens used to express those meanings.

Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not the choice of words. To do this, we introduce an algorithm that clusters model generations by meaning and subsequently calculates semantic uncertainty. At a high level this involves three steps:

Generation: sample output sequences of tokens from the predictive distribution of a LLM given a context x .

Clustering: cluster sequences by their meaning using our clustering algorithm based on bidirectional entailment.

Entropy estimation: estimate semantic entropy by summing probabilities of sequences that share a meaning following equation ( 2 ) and compute their entropy.

Generating a set of answers from the model

Given some context x as input to the LLM, we sample M sequences, { s (1) , …,  s ( M ) } and record their token probabilities, { P ( s (1) ∣ x ), …,  P ( s ( M ) ∣ x )}. We sample all our generations from a single model, varying only the random seed used for sampling from the token probabilities. We do not observe the method to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling ( P  = 0.9) (ref. 49 ) and top- K sampling ( K  = 50) (ref. 50 ). We also sample a single generation at low temperature (0.1) as an estimate of the ‘best generation’ of the model to the context, which we use to assess the accuracy of the model. (A lower sampling temperature increases the probability of sampling the most likely tokens).

Clustering by semantic equivalence

To estimate semantic entropy we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other.

This can be described using ‘semantic equivalence’ which is the relation that holds between two sentences when they mean the same thing. We can formalize semantic equivalence mathematically. Let the space of tokens in a language be \({\mathcal{T}}\) . The space of all possible sequences of tokens of length N is then \({{\mathcal{S}}}_{N}\equiv {{\mathcal{T}}}^{N}\) . Note that N can be made arbitrarily large to accommodate whatever size of sentence one can imagine and one of the tokens can be a ‘padding’ token which occurs with certainty for each token after the end-of-sequence token. For some sentence \({\bf{s}}\in {{\mathcal{S}}}_{N}\) , composed of a sequence of tokens, \({s}_{i}\in {\mathcal{T}}\) , there is an associated meaning. Theories of meaning are contested 51 . However, for specific models and deployment contexts many considerations can be set aside. Care should be taken comparing very different models and contexts.

Let us introduce a semantic equivalence relation, E (  ⋅  ,  ⋅  ), which holds for any two sentences that mean the same thing—we will operationalize this presently. Recall that an equivalence relation is any reflexive, symmetric and transitive relation and that any equivalence relation on a set corresponds to a set of equivalence classes. Each semantic equivalence class captures outputs that can be considered to express the same meaning. That is, for the space of semantic equivalence classes \({\mathcal{C}}\) the sentences in the set \(c\in {\mathcal{C}}\) can be regarded in many settings as expressing a similar meaning such that \(\forall {\bf{s}},{{\bf{s}}}^{{\prime} }\in c:E({\bf{s}},{{\bf{s}}}^{{\prime} })\) . So we can build up these classes of semantically equivalent sentences by checking if new sentences share a meaning with any sentences we have already clustered and, if so, adding them into that class.

We operationalize E (  ⋅  ,  ⋅  ) using the idea of bidirectional entailment, which has a long history in linguistics 52 and natural language processing 28 , 53 , 54 . A sequence, s , means the same thing as a second sequence, s ′, only if the sequences entail (that is, logically imply) each other. For example, ‘The capital of France is Paris’ entails ‘Paris is the capital of France’ and vice versa because they mean the same thing. (See later for a discussion of soft equivalence and cases in which bidirectional entailment does not guarantee equivalent meanings).

Importantly, we require that the sequences mean the same thing with respect to the context—key meaning is sometimes contained in the context. For example, ‘Paris’ does not entail ‘The capital of France is Paris’ because ‘Paris’ is not a declarative sentence without context. But in the context of the question ‘What is the capital of France?’, the one-word answer does entail the longer answer.

Detecting entailment has been the object of study of a great deal of research in NLI 55 . We rely on language models to predict entailment, such as DeBERTa-Large-MNLI 56 , which has been trained to predict entailment, or general-purpose LLMs such as GPT-3.5 (ref. 57 ), which can predict entailment given suitable prompts.

We then cluster sentences according to whether they bidirectionally entail each other using the algorithm presented in Extended Data Fig. 1 . Note that, to check if a sequence should be added to an existing cluster, it is sufficient to check if the sequence bidirectionally entails any of the existing sequences in that cluster (we arbitrarily pick the first one), given the transitivity of semantic equivalence. If a sequence does not share meaning with any existing cluster, we assign it its own cluster.

Computing the semantic entropy

Having determined the classes of generated sequences that mean the same thing, we can estimate the likelihood that a sequence generated by the LLM belongs to a given class by computing the sum of the probabilities of all the possible sequences of tokens which can be considered to express the same meaning as

Formally, this treats the output as a random variable whose event-space is the space of all possible meaning-classes, C , a sub- σ -algebra of the standard event-space S . We can then estimate the semantic entropy (SE) as the entropy over the meaning-distribution,

There is a complication which prevents direct computation: we do not have access to every possible meaning-class c . Instead, we can only sample c from the sequence-generating distribution induced by the model. To handle this, we estimate the expectation in equation ( 3 ) using a Rao–Blackwellized Monte Carlo integration over the semantic equivalence classes C ,

where \(P({C}_{i}| {\boldsymbol{x}})=\frac{P({c}_{i}| {\boldsymbol{x}})}{{\sum }_{c}P(c| {\boldsymbol{x}})}\) estimates a categorical distribution over the cluster meanings, that is, ∑ i P ( C i ∣ x ) = 1. Without this normalization step cluster ‘probabilities’ could exceed one because of length normalization, resulting in degeneracies. Equation ( 5 ) is the estimator giving our main method that we refer to as semantic entropy throughout the text.

For scenarios in which the sequence probabilities are not available, we propose a variant of semantic entropy which we call ‘discrete’ semantic entropy. Discrete semantic entropy approximates P ( C i ∣ x ) directly from the number of generations in each cluster, disregarding the token probabilities. That is, we approximate P ( C i ∣ x ) as \({\sum }_{1}^{M}\frac{{I}_{c={C}_{i}}}{M}\) , the proportion of all the sampled answers which belong to that cluster. Effectively, this just assumes that each output that was actually generated was equally probable—estimating the underlying distribution as the categorical empirical distribution. In the limit of M the estimator converges to equation ( 5 ) by the law of large numbers. We find that discrete semantic entropy results in similar performance empirically.

We provide a worked example of the computation of semantic entropy in Supplementary Note  1 .

Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to predict model accuracy, demonstrating that confabulations make up a notable fraction of model mistakes. We further show that semantic uncertainty can be used to improve model accuracy by refusing to answer questions when semantic uncertainty is high. Last, semantic uncertainty can be used to give users a way to know when model generations are probably unreliable.

We use the datasets BioASQ 34 , SQuAD 33 , TriviaQA 32 , SVAMP 37 and NQ-Open 35 . BioASQ is a life-sciences question-answering dataset based on the annual challenge of the same name. The specific dataset we use is based on the QA dataset from Task B of the 2023 BioASQ challenge (11B). SQuAD is a reading comprehension dataset whose context passages are drawn from Wikipedia and for which the answers to questions can be found in these passages. We use SQuAD 1.1 which excludes the unanswerable questions added in v.2.0 that are deliberately constructed to induce mistakes so they do not in practice cause confabulations to occur. TriviaQA is a trivia question-answering dataset. SVAMP is a word-problem maths dataset containing elementary-school mathematical reasoning tasks. NQ-Open is a dataset of realistic questions aggregated from Google Search which have been chosen to be answerable without reference to a source text. For each dataset, we use 400 train examples and 400 test examples randomly sampled from the original larger dataset. Note that only some of the methods require training, for example semantic entropy does not use the training data. If the datasets themselves are already split into train and test (or validation) samples, we sample our examples from within the corresponding split.

All these datasets are free-form, rather than multiple choice, because this better captures the opportunities created by LLMs to produce free-form sentences as answers. We refer to this default scenario as our ‘sentence-length’ experiments. In Supplementary Note  7 , we also present results for confabulation detection in a ‘short-phrase’ scenario, in which we constrain model answers on these datasets to be as concise as possible.

To make the problems more difficult and induce confabulations, we do not provide the context passages for any of the datasets. When the context passages are provided, the accuracy rate is too high for these datasets for the latest generations of models to meaningfully study confabulations.

For sentence-length generations we use: Falcon 39 Instruct (7B and 40B), LLaMA 2 Chat 38 (7B, 13B and 70B) and Mistral 40 Instruct (7B).

In addition to reporting results for semantic entropy, discrete semantic entropy and naive entropy, we consider two strong baselines.

Embedding regression is a supervised baseline inspired by the P (IK) method 24 . In that paper, the authors fine-tune their proprietary LLM on a dataset of questions to predict whether the model would have been correct. This requires access to a dataset of ground-truth answers to the questions. Rather than fine-tuning the entire LLM in this way, we simply take the final hidden units and train a logistic regression classifier to make the same prediction. By contrast to their method, this is much simpler because it does not require fine-tuning the entire language model, as well as being more reproducible because the solution to the logistic regression optimization problem is not as seed-dependent as the fine-tuning procedure. As expected, this supervised approach performs well in-distribution but fails when the distribution of questions is different from that on which the classifier is trained.

The second baseline we consider is the P (True) method 24 , in which the model first samples M answers (identically to our semantic entropy approach) and then is prompted with the list of all answers generated followed by the highest probability answer and a question whether this answer is “(a) True” or “(b) False”. The confidence score is then taken to be the probability with which the LLM responds with ‘a’ to the multiple-choice question. The performance of this method is boosted with a few-shot prompt, in which up to 20 examples from the training set are randomly chosen, filled in as above, but then provided with the actual ground truth of whether the proposed answer was true or false. In this way, the method can be considered as supervised ‘in-context’ because it makes use of some ground-truth training labels but can be used without retraining the model. Because of context-size constraints, this method cannot fit a full 20 few-shot examples in the context when input questions are long or large numbers of generations are used. As a result, we sometimes have to reduce the number of few-shot examples to suit the context size and we note this in the  Supplementary Material .

Entailment estimator

Any NLI classification system could be used for our bidirectional entailment clustering algorithm. We consider two different kinds of entailment detector.

One option is to use an instruction-tuned LLM such as LLaMA 2, GPT-3.5 (Turbo 1106) or GPT-4 to predict entailment between generations. We use the following prompt:

We are evaluating answers to the question {question} Here are two possible answers: Possible Answer 1: {text1} Possible Answer 2: {text2} Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.

Alternatively, we consider using a language model trained for entailment prediction, specifically the DeBERTa-large model 56 fine-tuned on the NLI dataset MNLI 58 . This builds on past work towards paraphrase identification based on embedding similarity 59 , 60 and BERT-style models 61 , 62 . We template more simply, checking if DeBERTa predicts entailment between the concatenation of the question and one answer and the concatenation of the question and another answer. Note that DeBERTa-large is a relatively lightweight model with only 1.5B parameters which is much less powerful than most of the LLMs under study.

In Supplementary Note 2 , we carefully evaluate the benefits and drawbacks of these methods for entailment prediction. We settle on using GPT-3.5 with the above prompt, as its entailment predictions agree well with human raters and lead to good confabulation detection performance.

In Supplementary Note  3 , we provide a discussion of the computational cost and choosing the number of generations for reliable clustering.

Prompting templates

We use a simple generation template for all sentence-length answer datasets:

Answer the following question in a single brief but complete sentence. Question: {question} Answer:

Metrics and accuracy measurements

We use three main metrics to evaluate our method: AUROC, rejection accuracy and AURAC. Each of these is grounded in an automated factuality estimation measurement relative to the reference answers provided by the datasets that we use.

AUROC, rejection accuracy and AURAC

First, we use the AUROC curve, which measures the reliability of a classifier accounting for both precision and recall. The AUROC can be interpreted as the probability that a randomly chosen correct answer has been assigned a higher confidence score than a randomly chosen incorrect answer. For a perfect classifier, this is 1.

Second, we compute the ‘rejection accuracy at X %’, which is the question-answering accuracy of the model on the most-confident X % of the inputs as identified by the respective uncertainty method. If an uncertainty method works well, predictions on the confident subset should be more accurate than predictions on the excluded subset and the rejection accuracy should increase as we reject more inputs.

To summarize this statistic we compute the AURAC—the total area enclosed by the accuracies at all cut-off percentages X %. This should increase towards 1 as given uncertainty method becomes more accurate and better at detecting likely-inaccurate responses but it is more sensitive to the overall accuracy of the model than the AUROC metric.

In Supplementary Note  5 , we provide the unaggregated rejection accuracies for sentence-length generations.

Assessing accuracy

For the short-phrase-length generation setting presented in Supplementary Note  7 , we simply assess the accuracy of the generations by checking if the F1 score of the commonly used SQuAD metric exceeds 0.5. There are limitations to such simple scoring rules 63 but this method is widely used in practice and its error is comparatively small on these standard datasets.

For our default scenario, the longer sentence-length generations, this measure fails, as the overlap between the short reference answer and our long model answer is invariably too small. For sentence-length generations, we therefore automatically determine whether an answer to the question is correct or incorrect by using GPT-4 to compare the given answer to the reference answer. We use the template:

We are assessing the quality of answers to the following question: {question} The expected answer is: {reference answer} The proposed answer is: {predicted answer} Within the context of the question, does the proposed answer mean the same as the expected answer? Respond only with yes or no.

We make a small modification for datasets with several reference answers: line two becomes “The following are expected answers to this question:” and the final line asks “does the proposed answer mean the same as any of the expected answers?”.

In Supplementary Note 6 , we check the quality of our automated ground-truth evaluations against human judgement by hand. We find that GPT-4 gives the best results for determining model accuracy and thus use it in all our sentence-length experiments.

In this section we describe the application of semantic entropy to confabulation detection in longer model generations, specifically paragraph-length biographies.

We introduce a biography-generation dataset—FactualBio—available alongside this paper. FactualBio is a collection of biographies of individuals who are notable enough to have Wikipedia pages but not notable enough to have large amounts of detailed coverage, generated by GPT-4 (v.0613). To generate the dataset, we randomly sampled 21 individuals from the WikiBio dataset 64 . For each biography, we generated a list of factual claims contained in each biography using GPT-4, with 150 total factual claims (the total number is only coincidentally a round number). For each of these factual claims, we manually determined whether the claim was correct or incorrect. Out of 150 claims, 45 were incorrect. As before, we apply confabulation detection to detect incorrect model predictions, even though there may be model errors which are not confabulations.

Prompting and generation

Given a paragraph-length piece of LLM-generated text, we apply the following sequence of steps:

Automatically decompose the paragraph into specific factual claims using an LLM (not necessarily the same as the original).

For each factual claim, use an LLM to automatically construct Q questions which might have produced that claim.

For each question, prompt the original LLM to generate M answers.

For each question, compute the semantic entropy of the answers, including the original factual claim.

Average the semantic entropies over the questions to arrive at a score for the original factual claim.

We pursue this slightly indirect way of generating answers because we find that simply resampling each sentence creates variation unrelated to the uncertainty of the model about the factual claim, such as differences in paragraph structure.

We decompose the paragraph into factual claims using the following prompt:

Please list the specific factual propositions included in the answer above. Be complete and do not leave any factual claims out. Provide each claim as a separate sentence in a separate bullet point.

We found that we agreed with the decompositions in all cases in the dataset.

We then generate six questions for each of the facts from the decomposition. We generate these questions by prompting the model twice with the following:

Following this text: {text so far} You see the sentence: {proposition} Generate a list of three questions, that might have generated the sentence in the context of the preceding original text, as well as their answers. Please do not use specific facts that appear in the follow-up sentence when formulating the question. Make the questions and answers diverse. Avoid yes-no questions. The answers should not be a full sentence and as short as possible, e.g. only a name, place, or thing. Use the format “1. {question} – {answer}”.

These questions are not necessarily well-targeted and the difficulty of this step is the main source of errors in the procedure. We generate three questions with each prompt, as this encourages diversity of the questions, each question targeting a different aspect of the fact. However, we observed that the generated questions will sometimes miss obvious aspects of the fact. Executing the above prompt twice (for a total of six questions) can improve coverage. We also ask for brief answers because the current version of GPT-4 tends to give long, convoluted and highly hedged answers unless explicitly told not to.

Then, for each question, we generate three new answers using the following prompt:

We are writing an answer to the question “{user question}”. So far we have written: {text so far} The next sentence should be the answer to the following question: {question} Please answer this question. Do not answer in a full sentence. Answer with as few words as possible, e.g. only a name, place, or thing.

We then compute the semantic entropy over these answers plus the original factual claim. Including the original fact ensures that the estimator remains grounded in the original claim and helps detect situations in which the question has been interpreted completely differently from the original context. We make a small modification to handle the fact that GPT-4 generations often include refusals to answer questions. These refusals were not something we commonly observe in our experiments with LLaMA 2, Falcon or Mistral models. If more than half of the answers include one of the strings ‘not available’, ‘not provided’, ‘unknown’ or ‘unclear’ then we treat the semantic uncertainty as maximal.

We then average the semantic entropies for each question corresponding to the factual claim to get an entropy for this factual claim.

Despite the extra assumptions and complexity, we find that this method greatly outperforms the baselines.

To compute semantic entailment between the original claim and regenerated answers, we rely on the DeBERTa entailment prediction model as we find empirically that DeBERTa predictions result in higher train-set AUROC than other methods. Because DeBERTa has slightly lower recall than GPT-3.5/4, we use a modified set-up for which we say the answers mean the same as each other if at least one of them entails the other and neither is seen to contradict the other—a kind of ‘non-defeating’ bidirectional entailment check rather than true bidirectional entailment. The good performance of DeBERTa in this scenario is not surprising as both factual claims and regenerated answers are relatively short. We refer to Supplementary Notes 2 and 3 for ablations and experiments regarding our choice of entailment estimator for paragraph-length generations.

We implement two baselines. First, we implement a variant of the P (True) method, which is adapted to the new setting. For each factoid, we generate a question with answers in the same way as for semantic entropy. We then use the following prompt:

Question: {question} Here are some brainstormed ideas: {list of regenerated answers} Possible answer: {original answer} Is the possible answer true? Respond with “yes” or “no”.

As we cannot access the probabilities GPT-4 assigns to predicting ‘yes’ and ‘no’ as the next token, we approximate this using Monte Carlo samples. Concretely, we execute the above prompt ten times (at temperature 1) and then take the fraction of answers which was ‘yes’ as our unbiased Monte Carlo estimate of the token probability GPT-4 assigns to ‘yes’.

As a second, simpler, baseline we check if the model thinks the answer is true. We simply ask:

Following this text: {text so far} You see this statement: {proposition} Is it likely that the statement is true? Respond with ‘yes’ or ‘no’.

It is interesting that this method ought to perform very well if we think that the model has good ‘self-knowledge’ (that is, if “models mostly know what they don’t know” 24 ) but in fact semantic entropy is much better at detecting confabulations.

Data availability

The data used for the short-phrase and sentence-length generations are publicly available and the released code details how to access it. We release a public version of the FactualBio dataset as part of the code base for reproducing the paragraph-length experiments.

Code availability

We release all code used to produce the main experiments. The code for short-phrase and sentence-length experiments can be found at github.com/jlko/semantic_uncertainty and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ). The code for paragraph-length experiments can be found at github.com/jlko/long_hallucinations and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ).

GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language generation. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021).

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association for Computational Linguistics, 2018).

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023).

Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng . 146 , 102182 (2023).

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307 , e230163 (2023).

Article   PubMed   Google Scholar  

Schulman, J. Reinforcement learning from human feedback: progress and challenges. Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/watch?v=hhiLw5Q_UFg (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 , 248 (2023).

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 (Association for Computational Linguistics, 2020).

Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020).

Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7 , 225–241 (1998).

Article   CAS   PubMed   Google Scholar  

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Transact. Mach. Learn. Res. (2022).

Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021).

Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Transact. Assoc. Comput. Linguist. 9 , 962–977 (2021).

Article   Google Scholar  

Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020).

Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational Linguistics, 2021).

Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10 , 680–696 (2022).

Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference on Automatic Face and Gesture Recognition . 83–88 (IEEE, Catalogue no PR00580, 2002).

Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that has got to stop. Forbes Magazine (24 August 2022).

Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery 67 , 68–79 (2024).

MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4 , 590–604 (1992).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 (2022).

Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27 , 986–1005 (1956).

Article   MathSciNet   Google Scholar  

Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Neural Information Processing Systems (NeurIPS, Vancouver, 2019).

Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit .

Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 670–679 (Association for Computational Linguistics, 2011).

Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022).

Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019).

Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10 , 163–177 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics 1601–1611 (Association for Computational Linguistics. 2017).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016).

Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16 , 138 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6086–6096 (Association for Computational Linguistics, 2019).

Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Transact. Assoc. Comput. Linguist. 7 , 452–466 (2019).

Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math word problems? In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2080–2094 (Assoc. Comp. Linguistics, 2021).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In Proc. 36th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 2023)

Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box hallucination detection for generative large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 9004–9017 (Assoc. Comp. Linguistics, 2023).

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023).

Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for Computational Linguistics, 2022).

Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1 (2020).

Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/abs/1805.00899 (2018).

Der Kiureghian, A. & Ditlevsen, O. Aleatory or epistemic? Does it matter? Struct. Saf. 31 , 105–112 (2009).

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=jN5y-zb5Q7m (2021).

Murray, K. & Chiang, D. Correcting length bias in neural machine translation. In Proc. Third Conference on Machine Translation (eds Bojar, O. et al.) 212–223 (Assoc. Comp. Linguistics, 2018).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=rygGQyrFvH (2020).

Fan, A., Lewis, M. & Dauphin, Y. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).

Speaks, J. in The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford Univ., 2021).

Culicover, P. W. Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11 , 78–88 (1968).

Google Scholar  

Padó, S., Cer, D., Galley, M., Jurafsky, D. & Manning, C. D. Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach. Transl. 23 , 181–193 (2009).

Androutsopoulos, I. & Malakasiotis, P. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 , 135–187 (2010).

MacCartney, B. Natural Language Inference (Stanford Univ., 2009).

He, P., Liu, X., Gao, J. & Chen, W. Deberta: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Williams, A., Nangia, N. & Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 1112–1122 (Assoc. Comp. Linguistics, 2018).

Yu, L., Hermann, K. M., Blunsom, P. & Pulman, S. Deep learning for answer sentence selection. Preprint at https://arxiv.org/abs/1412.1632 (2014).

Socher, R., Huang, E., Pennin, J., Manning, C. D. & Ng, A. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th Conference on Neural Information Processing Systems (eds Shawe-Taylor, J. et al.) (2011)

He, R., Ravula, A., Kanagal, B. & Ainslie, J. Realformer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zhong, C., et al.) 929–943 (Assoc. Comp. Linguistics, 2021).

Tay, Y. et al. Charformer: fast character transformers via gradient-based subword tokenization. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=JtBRnrlOEFN (2022).

Kane, H., Kocyigit, Y., Abdalla, A., Ajanoh, P. & Coulibali, M. Towards neural similarity evaluators. In Workshop on Document Intelligence at the 32nd conference on Neural Information Processing (2019).

Lebret, R., Grangier, D. & Auli, M. Neural text generation from structured data with application to the biography domain. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1203–1213 (Association for Computational Linguistics, 2016).

Kossen, J., jlko/semantic_uncertainty: Initial release v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.10964366 (2024).

Download references

Acknowledgements

We thank G. Irving, K. Perlin, J. Richens, L. Rimell and M. Turpin for their comments or discussion related to this work. We thank K. Handa for his help with the human evaluation of our automated accuracy assessment. We thank F. Bickford Smith and L. Melo for their code review. Y.G. is supported by a Turing AI Fellowship funded by the UK government’s Office for AI, through UK Research and Innovation (grant reference EP/V030302/1), and delivered by the Alan Turing Institute.

Author information

These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn

Authors and Affiliations

OATML, Department of Computer Science, University of Oxford, Oxford, UK

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal

You can also search for this author in PubMed   Google Scholar

Contributions

S.F. led the work from conception to completion and proposed using bidirectional entailment to cluster generations as a way of computing entropy in LLMs. He wrote the main text, most of the Methods and Supplementary Information and prepared most of the figures. J.K. improved the mathematical formalization of semantic entropy; led the extension of semantic entropy to sentence- and paragraph-length generations; wrote the code for, and carried out, all the experiments and evaluations; wrote much of the Methods and Supplementary Information and prepared drafts of many figures; and gave critical feedback on the main text. L.K. developed the initial mathematical formalization of semantic entropy; wrote code for, and carried out, the initial experiments around semantic entropy and its variants which demonstrated the promise of the idea and helped narrow down possible research avenues to explore; and gave critical feedback on the main text. Y.G. ideated the project, proposing the idea to differentiate semantic and syntactic diversity as a tool for detecting hallucinations, provided high-level guidance on the research and gave critical feedback on the main text; he runs the research laboratory in which the work was carried out.

Corresponding author

Correspondence to Sebastian Farquhar .

Ethics declarations

Competing interests.

S.F. is currently employed by Google DeepMind and L.K. by OpenAI. For both, this paper was written under their University of Oxford affiliation. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Mirella Lapata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 algorithm outline for bidirectional entailment clustering..

Given a set of outputs in response to a context, the bidirectional entailment answer returns a set of sets of outputs which have been classified as sharing a meaning.

Supplementary information

Supplementary information.

Supplementary Notes 1–7, Figs. 1–10, Tables 1–4 and references. Includes, worked example for semantic entropy calculation, discussion of limitations and computational cost of entailment clustering, ablation of entailment prediction and clustering methods, discussion of automated accuracy assessment, unaggregated results for sentence-length generations and further results for short-phrase generations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0

Download citation

Received : 17 July 2023

Accepted : 12 April 2024

Published : 19 June 2024

Issue Date : 20 June 2024

DOI : https://doi.org/10.1038/s41586-024-07421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on logistic regression

IMAGES

  1. (PDF) Application of Binary Logistic Regression in Clinical Research

    research paper on logistic regression

  2. The description of the variables in the binary logistic regression

    research paper on logistic regression

  3. An Overview of Logistic Regression Analysis

    research paper on logistic regression

  4. (PDF) Large-scale sparse logistic regression

    research paper on logistic regression

  5. (PDF) Logistic Regression Analysis and Reporting: A Primer

    research paper on logistic regression

  6. (PDF) Research on Life Expectancy Prediction Based on Logistic

    research paper on logistic regression

VIDEO

  1. Logistic Regression?

  2. Does AI really help you to write an academic paper?

  3. Logistic Regression

  4. Logistic Regression?

  5. How to find Image Rotation&Scale using Automated Feature Matching in MATLAB |Python|Tutorial|

  6. 7- R programming for Research paper : Univariate Logistic regression

COMMENTS

  1. (PDF) Logistic regression in data analysis: An overview

    2 The Logistic Regression Model. Let X∈Rn×dbe a data matrix where nis the number of instances (examples) and dis the number of features (parameters or attributes), and ybe a binary. outcomes ...

  2. Understanding logistic regression analysis

    Abstract. Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest.

  3. Logistic Regression in Medical Research

    Logistic regression is used to estimate the association of one or more independent (predictor) variables with a binary dependent (outcome) variable. 2 A binary (or dichotomous) variable is a categorical variable that can only take 2 different values or levels, such as "positive for hypoxemia versus negative for hypoxemia" or "dead versus ...

  4. Primer on binary logistic regression

    Binary logistic regression is one method that is particularly appropriate for analysing survey data in the widely used cross-sectional and case-control research designs. 7-9 In the Family Medicine and Community Health (FMCH) journal, 35 out of the 142 (24.6%) peer-reviewed published original research papers between 2013 and 2020 reported ...

  5. PDF Logistic Regression: From Art to Science

    Logistic regression is a common classification method when the response variable is binary. Given a response vector yn×1, a model matrix X =[X1,..., X n]∈Rn×p, and regression coefficients β ∈Rp×1,the logistic regression model assumes log(P(yi =1 |xi)/ P(yi =0 |xi))=β xi. Logistic regression minimizes the negative log-likelihood of ...

  6. PDF Logistic Regression

    Logistic regression accomplishes this by using a link function to generalize the linear model for non-continuous outcomes. You may be wondering why linear regression cannot be implemented when the categorical outcome is dummy coded as outlined in chapter "Data Preparation". In a binary case, in which the categorical response has been coded ...

  7. Logistic Regression: A Brief Primer

    Regression analysis is a valuable research method because of its versatile application to different study contexts. For instance, one may wish to examine associations between an outcome and several independent variables (also commonly referred to as covariates, predictors, and explanatory variables), 1 or one might want to determine how well an outcome is predicted from a set of independent ...

  8. Logistic Regression: A Basic Approach

    For example, if the research sample is 50 persons and the Logistic Regression analysis contains 50 Independent Variable, the outcome is an overfit (and hence unstable) model. In general, the beta coefficients of Independent Variable in an overfit model are significantly higher than they otherwise would be, and the standard errors are larger ...

  9. An Introduction to Logistic Regression Analysis and Reporting

    Illustration of Logistic Regression Analysis. and Reporting. For the sake of illustration, we constructed a hypothetical. data set to which logistic regression was applied, and we. interpreted its results. The hypothetical data consisted of reading scores and genders of 189 inner city school children.

  10. Logistic Regression: A Brief Primer

    Regression techniques are versatile in their application to medical research because they can measure associations, predict outcomes, and control for confounding variable effects. As one such technique, logistic regression is an efficient and powerful way to analyze the effect of a group of independent vari-ables on a binary outcome by ...

  11. PDF CHAPTER Logistic Regression

    gistic regression its name. The sigmoid has the following equation, sh. 1. =1+e1=z 1+exp( z)(5.4)(For the rest of the book, we'll use the. otation exp(x) to mean ex.) The sigmoid has a number of advantages; it takes a real-valued number and maps it into the range (0;1), which is just wha.

  12. An Introduction to Logistic Regression Analysis and Reporting

    The authors evaluated the use and interpretation of logistic regression presented in 8 articles published in The Journal of Educational Research between 1990 and 2000. They found that all 8 ...

  13. Logistic regression

    Logistic regression models the log odds ratio as a linear combination of the independent variables. For our example, height ( H) is the independent variable, the logistic fit parameters are β0 ...

  14. PDF An Introduction to Logistic Regression: From Basic Concepts ...

    Logistic regression sometimes called the logistic model or logit model, analyzes the relationship between multiple independent variables and a categorical dependent variable, and estimates the probability of occur-rence of an event by fitting data to a logistic curve. There are two models of logistic regression, binary logistic regression and ...

  15. Logistic regression: A simple primer : Cancer Research ...

    Logistic regression is used to obtain the odds ratio in the presence of more than one explanatory variable. This procedure is quite similar to multiple linear regression, with the only exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest.

  16. A logistic regression investigation of the relationship between the

    These results extend the research base regarding the relationship between the LA program and positive student outcomes. ... we use logistic regression with pre-existing institutional data to investigate the relationship between exposure to LA support in large introductory STEM courses and general failure rates in these same and other ...

  17. Linear logistic regression: an introduction

    This paper provides an introduction to Linear Logistic Regression and its value in semiconductor yield and reliability analysis. The reliability community has become well experienced in fitting of survival distributions, the use of design of experiments (DOE) and the associated general linear model (linear regression and analysis of variance methods) approach to analysis. This method provides ...

  18. Linear and logistic regression models: when to use and how to interpret

    Linear and logistic regressions are widely used statistical methods to assess the association between variables in medical research. These methods estimate if there is an association between the independent variable (also called predictor, exposure, or risk factor) and the dependent variable (outcome). 2. The association between two variables ...

  19. Application of logistic regression models to assess ...

    The logistic regression model allows to examine the influence of many independent variables ð '‹ð '‹1 ,… ,ð '‹ð '‹ð '˜ð '˜the dependent variable Y. The variable Y takes only two values and is dichotomous. ... .†Research Papers of WrocÅ‚aw University of Economics/ Prace Naukowe Uniwersytetu Ekonomicznego ...

  20. Predicting Student Success: A Logistic Regression Analysis of Data From

    RESEARCH PAPER APPROVAL PREDICTING STUDENT SUCCESS: A LOGISTIC REGRESSION ANALYSIS OF DATA FROM MULTIPLE SIU-C COURSES By Patrick B. Soule A Research Paper Submitted in Partial Ful llment of the Requirements for the Degree of Master of Science in the eld of Mathematics Approved by: Dr. B. Bhattacharya, Chair Dr. M. Wright Dr. R. Habib Graduate ...

  21. An Introduction to Logistic Regression Analysis and Reporting

    Summary In this paper, we demonstrate that logistic regression can be a powerful analytical technique for use when the outcome variable is dichotomous. The effectiveness of the logistic model was shown to be supported by (a) significance tests of the model against the null model, (b) the significance test of each predictor, (c) descriptive and ...

  22. TOPICS IN LOGISTIC REGRESSION ANALYSIS

    ABSTRACT OF DISSERTATION. TOPICS IN LOGISTIC REGRESSION ANALYSIS. Discrete-time Markov chains have been used to analyze the transition of subjects from intact cognition to dementia with mild cognitive impairment and global impair-ment as intervening transient states, and death as competing risk.

  23. PDF An Introduction to Logistic Regression Analysis and Reporting

    tion of logistic regression applied to a data set in testing a research hypothesis. Recommendations are also offered for appropriate reporting formats of logistic regression results and the minimum observation-to-predictor ratio. The authors evaluated the use and interpretation of logistic regression pre-

  24. Food consumption score and predictors among pregnant women attending

    STATA 14 was used to analyse the data. Ordinal logistic regression was used to identify independent predictors of food consumption score. Those variables having p value < 0.25 in the bivariable ordinal logistic regression were considered for the final model. Crude and Adjusted Odds Ratio were used to assess the strength of the association.

  25. Detecting hallucinations in large language models using ...

    As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a ...