EHR data extracts were mapped and loaded into an OHDSI Observational Medical Outcomes Partnership database [ 40 ]. Medication, observation, and procedure data extracts were requested and loaded into the database only for patients who would not be disqualified by other algorithm criteria. For patients selected to be in the validation sample, these data along with the clinical data for allergies, immunizations, and clinical notes were pulled from the EHR’s FHIR API endpoints, patient by patient, using a custom Python script to loop through the patients in the sample. The data were loaded into a Health Level 7 API (HAPI) FHIR server. We only pulled FHIR data for cases not initially disqualified by the vaccination and diagnosis filters to avoid unnecessary large data transfers and storage. The algorithm flagged potential AESIs that met the specified criteria. Samples of these cases were sent to physicians for validation.
Once the algorithm identified cases, a random sample was drawn for each AESI for clinician adjudication. We used stratified sampling to ensure cases during pre– and post–COVID-19 EUA periods were represented ( Figure 2 ). This was due to concerns regarding potential confounding introduced by the COVID-19 vaccines, when attention to possible AESIs or medical charting of AESIs may have shifted. Where possible for each AESI, 100 cases were sampled from the pre–COVID-19 EUA period and 35 from the post–COVID-19 EUA period. If there were <100 or <35 cases during these periods, respectively, the sample would contain all cases the algorithm selected. Febrile seizure was the exception, as we believe the COVID-19 vaccine EUA should not affect the algorithm’s performance because febrile seizure AEs are usually associated with pediatric populations, and the COVID-19 vaccine was not approved for these populations during the study period [ 27 ].
Case counts sampled in each period were based on the incidence of diagnosis code occurrence within each period, as well as the period covered. In addition, we added negative controls selected randomly from every encounter in the period to establish a baseline comparison for the case validation process. We included negative controls as a quality control step to reduce the chance of quality issues with the data and to review the methods our clinicians were following and not for the purpose of making inferences about the phenotypes’ performance for non-AE cases (eg, through metrics such as sensitivity, negative predictive power, or an overall metric for performance). This study did not focus on the algorithmic identification of undetected AEs or AEs that were not coded properly. The focus of this study was to determine the phenotypes’ PPV. Given the expense of clinicians’ time for validations and the rarity of the AESIs, there would be minimum benefit to this study to have a negative control sample large enough to draw strong inferences. Furthermore, negative case controls would not further validate the utility of the phenotypes as tools for identifying probable AESIs through distributed surveillance. We added 20 negative controls from the pre–COVID-19 EUA period and 7 from the post–COVID-19 EUA period. Physicians were blinded to which cases were controls and which were not.
The sample of cases used to validate the algorithm was loaded into a chart review tool for clinician review. This allowed the clinicians to sort through the clinical information for a case and record the determination. Each case was assigned to 2 clinicians for review. The clinical validation used a patient’s full clinical history, which included EHR data, including all clinical notes for each case. The full EHR data used for clinician review included data unused by the detection algorithm described in the Computable Phenotype Development section, including different types of data (eg, allergies and clinical notes) and data filtered out (eg, admitting diagnosis and encounters with different care settings).
For each case, the clinician evaluated whether the clinical data evidence met the specified case definition criteria. Relevant patient data for the case window were available and presented to the clinicians in an easy-to-use, browser-based tool with a custom user interface. In the tool, clinicians were able to group items by type, search across all items and text, and request additional chart data to expand the window and access any available historical patient data, if desired.
All suspected AEs were validated using published case definitions [ 21 - 23 , 25 - 27 ] according to the levels of diagnostic certainty: level 1 (definite), level 2 (probable), and level 3 (possible). If a case did not meet one of the levels in the case definition, it was assigned as level 4 (doubtful) or level 5 (ruled out). “Ruled out” is distinct from “doubtful” in that “ruled out” cases have definitive evidence disqualifying them from being a correct diagnosis. If a case was determined to be “definite” or “probable,” it was considered a positive case of the AESI.
In the event of a disagreement between a positive and negative clinical review, a third clinician made a final determination by reviewing the case EHR data. If the clinicians found the structured or unstructured EHR data was insufficient, they marked this in their review by creating a level 3 (possible, insufficient evidence) designation, where an AESI could have occurred, but where there was not enough documentation to fulfill the requirements of the case definition.
Ppv of algorithms.
Each algorithm’s PPV was the proportion of positive AEs the algorithm identified that were confirmed by clinical adjudication. PPVs were calculated for each AESI overall, as well as stratified by pre– and post–COVID-19 EUA periods and care setting (inpatient, emergency department, or outpatient). Sensitivity analyses were performed to evaluate the impact of medication use, different case definitions, and levels of evidence. PPVs were calculated in 2 different ways for each AESI algorithm. The first PPV calculated removed all possible cases with insufficient evidence from the denominator (cases labeled “definite” and “probable”/total cases minus any labeled “possible, insufficient evidence” by clinicians). PPV was then calculated with the cases with insufficient evidence added back into the denominator (cases labeled “definite” and “probable”/total cases). Reporting both PPV calculations can help with understanding the performance for different algorithm uses. Algorithm performance should ideally be compared with past literature of detection algorithms for the same AESI.
Because PPV is a binominal proportion, we calculated CIs for the PPV using the Agresti-Coull interval [ 41 ], which is the recommended method for estimating accurate CIs for binomial proportions such as PPV [ 42 ].
Interrater reliability was used to measure the extent to which 2 physicians agreed in their AESI assessment. It was calculated using Cohen κ between the first 2 reviewers. Cohen κ measures the agreement between 2 raters classifying instances into mutually exclusive groups [ 43 ].
After validation was completed, we conducted a stratification and sensitivity analysis. We selected 2 stratification variables that could reasonably impact the generalizability of the results. First, we stratified the data by pre- and post-EUA date to confirm that the algorithm behavior did not change for AESIs after the COVID-19 vaccine was approved and administered to a large portion of the population. Ideally, the algorithms would perform consistently across these eras, but there are multiple factors that could impact the performance over these time periods. We also stratified the data by the care setting of the AE diagnosis, given that care settings may be associated with varying EHR data elements (eg, emergency departments compared with inpatient settings). Algorithm performance was computed using PPV within each stratum.
We also completed a post hoc sensitivity analysis where we investigated whether the algorithm could be improved, as measured by PPV, through small changes to it or by updating the process for evaluation. These changes were based on insights from clinicians or data analysts reviewing validation results, so results may not generalize to other data sets. However, we did attempt to limit our analysis to decisions that could have been feasibly made without postvalidation insights. The changes to the algorithms were either removing medications, observations, procedures, or diagnosis codes that are not specific enough to the AESI in question or adding logic to further filter out cases by requiring more supporting evidence ( Table 3 ).
The stratification or sensitivity analyses are meant as exploratory analyses to prompt additional research, but subgroups often have too small a sample size that have narrow enough CIs for meaningful results.
We also completed a sensitivity analysis on the GBS algorithm to calculate the PPV if we relaxed some of the specific case definition evaluation criteria and if more general evidence was available. We found that the 2 pieces of evidence that the case definition required were often missing in the chart review tool: lack of cerebrospinal fluid (CSF) white blood cell (WBC) count in cases of elevated CSF protein and limited or inconsistent documentation of diminished or absent reflexes. In some of these cases, we saw evidence that a neurologist was consulted and felt there was strong suspicion of GBS despite the missing documentation for these tests. This could be explained by 2 mechanisms.
First, and most likely, this could be due to data loss during the delivery or translation of EHR data to our chart review tool. Because we did not have direct access to the data, our process for obtaining, translating to different common data models or standards, and presenting the data to clinicians using the chart review tool could cause the data for these tests to be incorrectly mapped.
AESI | Data type | Sensitivity analysis | Reasoning |
Myocarditis/pericarditis | Medication | Removal of NSAIDs from our list of qualifying medication supporting evidence | NSAIDs are medications that can be used to treat many different conditions besides myocarditis and pericarditis. |
Myocarditis/pericarditis | Diagnostic code | Stratification by diagnostic code (myocarditis vs pericarditis) | Diagnostic criteria differ for these related conditions and may lead to different performance. |
GBS | Medication | Removal of gabapentin from our list of qualifying medication supporting evidence | Gabapentin was originally used as supporting evidence of a GBS episode due to its use for nerve pain associated with GBS events [ ]. However, it is also used for a variety of other conditions with neuropathic pain and is not specific to GBS. |
GBS | Case definition | Update case definition criteria to allow for a case to be validated as positive if there is a missing documentation for absent or diminished reflexes in the weak limbs, CSF WBC count with neurology consult, or clinical note indicating evidence of the test result of GBS more generally | Documentation required for definite or probable GBS as defined by the case definition diagnosis was often missing from our data set due to failure to capture in EHR or failure to translate to our data set and can be supplemented by an expert’s judgment (eg, a neurologist). |
Febrile seizure | Medication | Addition of medications used to treat fever | The original febrile seizure algorithm did not filter out cases without suggested evidence, but we believed adding suggested evidence could improve PPV . |
Febrile seizure | Observation | Addition of observation of clinician describing the symptoms of seizure activity | The original febrile seizure algorithm did not filter out cases without suggested evidence, but we believed adding suggested evidence could improve PPV. |
TTS | Diagnostic code | Stratification by most prevalent diagnostic code I81 versus all other codes | Diagnostic criteria differ for these related conditions and may lead to different performance. |
a NSAID: nonsteroidal anti-inflammatory drug.
c CSF: cerebrospinal fluid.
d WBC: white blood cell.
e EHR: electronic health record.
f PPV: positive predictive value.
g TTS: thrombosis with thrombocytopenia syndrome.
Second, case definition requirements for GBS are extremely strict, and physicians in this study believed that some of these might have represented valid GBS cases while not meeting every requirement. For example, several of the cases with missing CSF WBC count did mention cytoalbuminologic dissociation (or similar); in the presence of such a clinical statement, we might infer that CSF WBC count was performed and acceptable to meet the case definition criteria despite a missing test result.
Furthermore, in cases where a neurologist felt strongly that GBS was a likely diagnosis, along with other supporting evidence, it may be acceptable to rely on documented progressive and significant muscle weakness, especially with conflicting reflex findings. In these instances, we placed more weight on the clinician review (which may account for any unforeseen difficulties in data processing and the strictness of the case definition), not relying solely on the available (nonmissing) data types of the algorithm for assigning case diagnostic certainty.
Figure 2 illustrates the identification of the study populations and validation sample. From the study population of 20.7 million medical encounters for 2,666,974 patients over the study period, the algorithm selected 1195 (0.04%) cases of myocarditis/pericarditis, 550 (0.02%) of anaphylaxis, 123 (0.005%) of GBS, 626 (0.02%) of febrile seizure, and 395 (0.01%) of TTS. Of these patient cases (n=2,666,974), a stratified, random sample of 135 (0.01%) cases each was selected from myocarditis/pericarditis, 135 (0.01%) from anaphylaxis, and 135 (0.01%) from TTS populations. All 75 pre-EUA cases of GBS and a random sample of 35 post-EUA cases were selected to be validated. A random selection of 100 cases from the pre-EUA period were sampled to validate febrile seizure. An additional 27 negative control cases were sampled for each algorithm from the roughly 20.7 million medical encounters not selected by the algorithm in our study period. In total, 20 of these cases were sampled from the period before the COVID-19 vaccine EUA, and the remaining 7 came from the period after the EUA.
Table 4 presents algorithm performance measured by PPV for each of the 5 AESIs using cases that had sufficient evidence and all cases (ie, including cases unable to be confirmed as positive by clinicians due to insufficient evidence). Counts for the number of cases included in each PPV calculation can be found in Table S3 in Multimedia Appendix 1 [ 21 - 23 , 25 - 27 ].
Overall PPVs, when removing all cases with insufficient evidence, were highest for anaphylaxis (93.3%, 95% CI 86.4%-97%) and febrile seizure (89%, 95% CI 80%-94.4%), followed by myocarditis/pericarditis (83.5%, 95% CI 74.9%-89.6%) and TTS at unusual sites (70.2%, 95% CI 61.4%-77.6%). The lowest was for GBS (47.2%, 95% CI 35.8%-58.9%). All negative control cases across the 5 phenotypes were correctly classified by the algorithms.
The PPV results from the chart reviews of the validation sample for each AESI are reported for all cases as well as for only cases with sufficient evidence to make a clear by chart reviewers. The frequencies and percentages for insufficient evidence are presented with the stratification results in Table 5 . The interrater reliability scores for clinician chart reviews all showed substantial agreement between the clinicians ( Table 6 ). Interrater reliability, measured by Cohen κ, suggests substantial reliability when the value is >0.61, with many similar texts recommending a higher threshold of 0.80 [ 43 ].
AESI and metric | Detected cases, PPV % (95% CI) | ||
Cases with sufficient evidence only | 83.5 (74.9-89.6) | ||
All cases | 63.7 (55.2-71.4) | ||
Cases with sufficient evidence only | 93.3 (86.4-97) | ||
All cases | 72.6 (64.4-79.5) | ||
Cases with sufficient evidence only | 47.2 (35.8-58.9) | ||
All cases | 30.9 (22.9-40.3) | ||
Cases with sufficient evidence only | 70.2 (61.4-77.6) | ||
All cases | 64.4 (55.9-72.1) | ||
Cases with sufficient evidence only | 89 (80-94.4) | ||
All cases | 89 (80-94.4) |
AESI and metric | Detected cases | Pre-EUA period | Post-EUA period | Inpatient | Outpatient | Emergency department | |||||||
Total cases, n | 135 | 100 | 35 | 91 | 26 | 18 | |||||||
Total TP cases, n (PPV %; 95% CI) | 86 (63.7; 55.2-71.4) | 68 (68.0; 58.1-76.5) | 18 (51; 35-68) | 72 (79; 69-86) | 10 (38; 21-59) | 4 (22; 7-48) | |||||||
Total cases with sufficient evidence, n (PPV % for TP cases with sufficient evidence; 95% CI) | 103 (83.5; 74.9-89.6) | 79 (86; 76-92) | 24 (75; 53-89) | 79 (91; 82-96) | 16 (63; 36-84) | 8 (50; 15-85) | |||||||
Total cases, n | 135 | 100 | 35 | 27 | — | 108 | |||||||
Total TP cases, n (PPV %; 95% CI) | 98 (72.6; 64.4-79.5) | 70 (70; 60.2-78.3) | 28 (80; 63-90.9) | 17 (63; 42.9-79.7) | — | 81 (75; 65.8-82.4) | |||||||
Total cases with sufficient evidence, n (PPV %; 95% CI) | 105 (93.3; 86.4-97) | 74 (94.6; 86.2-98.4) | 31 (90.3; 73.4-98) | 19 (89.5; 65.6-99.7) | — | 86 (94.2; 86.6-97.9) | |||||||
(n=110) | |||||||||||||
Total cases, n (%) | 110 | 65 | 45 | 110 | — | — | |||||||
Total TP cases, n (PPV %; 95% CI) | 34 (30.9; 22.9-40.3) | 24 (40; 25.9-49.5) | 20 (44; 30.4-59.4) | 34 (30.9; 22.8-40.3) | — | — | |||||||
Total cases with sufficient evidence, n (PPV %; 95% CI) | 72 (47.2; 35.8-58.9) | 52 (46.2; 32.9-60) | 20 (50; 28.1-71.9) | 72 (47.2; 35.8-58.9) | — | — | |||||||
(n=135) | |||||||||||||
Total cases, n | 135 | 100 | 35 | 133 | 1 | 1 | |||||||
Total TP cases, n (PPV %; 95% CI) | 87 (64.4; 55.9-72.1) | 64 (64; 54-72.9) | 23 (66; 48.2- 80) | 86 (64.7; 56.1-72.4) | 1 (100; 0-100) | 0 (0; 0-100) | |||||||
Total cases with sufficient evidence, n (PPV %; 95% CI) | 124 (70.2; 61.4-77.6) | 91 (70.3; 60-78.9) | 33 (70; 51.6-83.5) | 122 (70.5; 61.7-78) | 1 (100; 0-100) | 1 (100; 0-100) | |||||||
Total cases, n | 100 | 100 | — | 1 | — | 99 | |||||||
Total TP cases, n (PPV %; 95% CI) | 73 (73; 63.3-80.9) | 73; (73; 63.3-80.9) | — | 0 (0; 0-100) | — | 73 (74; 64.1-81.6) | |||||||
Total cases with sufficient evidence, n (PPV %; 95% CI) | 83 (88; 78.8-93.6) | 83 (88; 78.8-93.6) | — | 1 (0; 0-100) | — | 82 (89; 80-94.4) |
b EUA: emergency use authorization.
c TP: true positive.
d PPV: positive predictive value.
e Not applicable.
f GBS: Guillain-Barré syndrome.
AESI | Total cases validated, n | Interrater reliability |
Myocarditis/pericarditis | 162 | 0.814 |
Anaphylaxis | 162 | 0.770 |
GBS | 137 | 0.832 |
TTS at unusual sites | 162 | 0.851 |
Febrile seizure | 120 | 0.965 |
To evaluate consistency across pre- and post-EUA periods and care settings, we reported true positive and PPV results for each stratum ( Table 5 ).
None of the algorithms had notable differences between the pre- and post-EUA periods since all 95% CIs had some overlap. However, there were some differences between the PPVs for the 2 periods that could be significant with a larger validation sample. The difference in PPV for myocarditis/pericarditis varied from 68% in the pre-EUA period to 51.4% in the post-EUA period, while anaphylaxis showed the opposite pattern with a 70% PPV in the pre-EUA period that increased to 80% PPV in the post-EUA period.
We also reported stratified results by care setting ( Table 5 ). For myocarditis/pericarditis, the PPV of cases with an inpatient care setting (79.1%, 95% CI 69.4%-86.4%) was notably higher than that from the outpatient (38.5%, 95% CI 21.2%-58.8%) or emergency department (22.2%, 95% CI 6.7%-47.9%) care settings.
Anaphylaxis did not have a large difference across care settings, as the 95% CIs overlapped between the 2 care settings. However, they did show better performance with cases in an emergency department (PPV 75%, 95% CI 65.8%-82.4%) care setting over cases with an inpatient care setting (PPV 63%, 95% CI 42.9%-79.7%). The other AESI algorithms filtered for only 1 care setting or had a vast majority of cases in 1 care setting.
Medication and observation algorithm changes.
We analyzed whether changes to medication code lists for the myocarditis/pericarditis and GBS algorithms could improve performance. For the myocarditis/pericarditis algorithm, removal of nonsteroidal anti-inflammatory drugs from the medication code lists showed no change in PPV at 83.5% ( Table 7 ), but PPV values were higher for cases selected with the pericarditis instead of myocarditis ICD-10 codes.
For the GBS algorithm, when cases were removed where gabapentin (used for post-GBS pain management) was the only supporting evidence, PPV increased to 38.1% (95% CI 28.2%-49.1%) from 30.9% (95% CI 22.9%-40.3%; Table 8 ).
Our initial febrile seizure algorithm did not use any supporting evidence to filter out possible false positives since we believed we could get adequate PPV without it.
For our sensitivity analysis, we tested requiring supporting evidence in the condition period, such as the presence of medications for reducing fever such as acetaminophen, observation evidence when the patient’s chief complaint was related to fever or seizure, or the presence of both. When filtered to only cases with either medication or observation evidence, febrile seizure PPV increased significantly to 93.3% (95% CI 84.7%-97.6%) from the original algorithm PPV of 73% (95% CI 63.3%-80.9%), with no overlap in 95% CIs and a P value of <.001 ( Table 9 ). When the algorithm required both medication and observation evidence, it performed even better (PPV 96.9%, 95% CI 88.5%-99.9%).
AESI and sensitivity analysis | Total TP cases, n | Selected cases, n (change, n) | PPV , % (95% CI; change) | Selected cases with sufficient evidence, n (change, n) | PPV, % (95% CI; change) |
Removal of NSAIDs | 86 | 135 (0) | 63.7 (55.2-71.4; 0) | 103 (0) | 83.5 (74.9-89.6; 0) |
Pericarditis diagnosis | 59 | 82 (–53) | 72 (61.1-80.8; +8.3) | 67 (–36) | 88.1 (77.6-94.3; +4.6) |
Myocarditis diagnosis | 27 | 53 (–82) | 50.9 (37.4-64.3; –12.8) | 36 (–67) | 75 (57.9- 87.1; –8.5) |
b TP: true positive.
c PPV: positive predictive value.
d Values in parentheses reflect the change due to the modified algorithm features.
e NASID: nonsteroidal anti-inflammatory drug.
f All International Classification of Diseases, Tenth Revision, Clinical Diagnosis codes that the algorithm used were broken into 2 groups: myocarditis (I40.0 infective myocarditis, I40.1 isolated myocarditis, I40.8 other acute myocarditis, I40.9 acute myocarditis, unspecified, and I51.4 Viral myocarditis) and pericarditis (B33.22 viral pericarditis, B33.23 acute rheumatic pericarditis, I30.0 acute nonspecific idiopathic pericarditis, I30.1 infective pericarditis, I30.8 other forms of acute pericarditis, I30.9 acute pericarditis, unspecified, I32 pericarditis in diseases classified elsewhere, and I41 meningococcal pericarditis).
AESI and sensitivity analysis | Total TP cases | Selected cases (change) | PPV , % (95% CI; change) | Selected cases with sufficient evidence (change) | PPV, % (95% CI; change) |
Removal of gabapentin | 33 | 86 (–24) | 38.4 (28.6-49.2; 7.5) | 53 (–19) | 62.3 (48.3-74.5; +15) |
Adjusted case definition | 49 | 110 (0) | 44.5 (35.4-54; +13.6) | 72 (0) | 68.1 (56.3-78; +20.8) |
Adjusted case definition+removal of gabapentin | 49 | 86 (–26) | 57.1 (46.2-67.4; +26.2) | 68 (–4) | 72.1 (60-81.6; +24.8) |
AESI and sensitivity analysis | Total TP cases, n | Selected cases, n (change, n) | PPV , % (95% CI; change) | Selected cases with sufficient evidence, n (change, n) | PPV , % (95% CI; change) |
Cases with either medication or observation | 70 | 75 (–25) | 93.3 (84.7-97.6; +20.3) | 73 (–10) | 95.9 (87.9-99.2; +7.9) |
Cases with both medication and observation evidence | 63 | 65 (–35) | 96.9 (88.5-99.9; +23.9) | 63 (–20) | 100 (92.8-100; +12) |
c Values in parentheses reflect the change due to the modified algorithm features.
We also analyzed if changing diagnostic codes that were used to identify the AESI might lead to higher performance for the myocarditis/pericarditis and TTS algorithms.
For myocarditis/pericarditis, we found that an algorithm only looking for the myocarditis code (PPV 50.9%, 95% CI 37.4%-64.3%) underperformed an algorithm with just pericarditis codes (PPV 72%, 95% CI 61.1%-80.8%; Table 7 ). For TTS, we found that the main ICD-10 code I81 for “portal vein thrombosis” (73.5%, 95% CI 64%-81.3%) outperformed all other codes in our code list, including G08 (intracranial and intraspinal phlebitis and thrombophlebitis), I82.0 (Budd-Chiari syndrome), I82.3 (embolism and thrombosis of renal vein), and I82.890 (acute embolism and thrombosis of other specified veins), with a PPV of 36.4% (95% CI 21.3%-54.4%; Table 10 ).
AESI and sensitivity analysis | Total TP cases, n | Selected cases, n (change, n) | PPV , % (95% CI; change) | Selected cases with sufficient evidence, n (change, n) | PPV, % (95% CI; change) |
I81 | 75 | 102 (–33) | 73.5 (64-81.3; +9.1) | 96 (–28) | 78.1 (68.6-85.4; +8) |
All other TTS codes | 12 | 33 (–102) | 36.4 (21.3-54.4; –28) | 28 (–96) | 42.9 (25.4-62.1; –27.3) |
e ICD: International Classification of Diseases.
f All other TTS ICD codes include G08, I82.0, I82.3, and I82.890.
Finally, we analyzed whether a small update to our case definition criteria for the GBS algorithms described in the Stratification Analysis and Sensitivity Analysis section would improve reported performance in Table 7 . When we applied both changes, the validation criteria change to the algorithm and removal of gabapentin, as discussed in the Medication and Observation Algorithm Changes section, the algorithm achieved a PPV of 57.1% (95% CI 46.2%-67.4%).
The results of this study show that for 4 out of 5 AESIs, we can build an interoperable computable phenotype with comparable or increased performance to algorithms in the existing literature. These algorithms are developed using a rules-based approach to facilitate their application and increase the generalizability of performance across EHR databases. For the phenotypes with poorer performance, the issues were often that the case definition required documentation of a test that was lost in our data pipeline, or was not completed, or was not recorded by the treating physician or nurse. While these cases are marked as false positives based on our methodology, they may be true AEs that are lacking the documentation to meet the case definition. Some small updates to the algorithms or the case definition evaluation method could be made to potentially improve the algorithms’ performances, but a more important next step would be to validate our algorithms on other data partners to ensure generalizability of the original algorithms and any updates. Given the need for active AE surveillance, this study is still an important first step toward building an algorithm that can be distributed and implemented on health provider EHR databases and can accurately detect AEs.
The PPV results of the phenotypes, negative control groups, and stratification and sensitivity analysis are discussed in more detail in the following sections. Note our negative control groups and many of the stratification and sensitivity analyses have sample sizes too small to draw strong conclusions as illustrated by the width of the 95% CIs for those results. These were exploratory analyses completed as a supplement to the main findings of the study around the PPV of the algorithms.
The myocarditis/pericarditis algorithm showed strong PPV performance using cases with sufficient evidence. The literature appears to lack good comparison studies against which to evaluate this algorithm’s performance. A meta-analysis from 2013 reviewed myocarditis/pericarditis algorithm studies and found that none of them evaluated their algorithm by calculating PPV [ 45 ].
When myocarditis/pericarditis was segmented via care settings, algorithm performance was highest for inpatient settings, with a PPV of 79.1%. This can be attributed to the availability of supporting clinical data needed for accurate case detection in such settings. Given that inpatient testing is necessary to meet the criteria of the case definition, the algorithm performance matches clinical expectations and adds to its public health importance.
In emergency care settings, myocarditis/pericarditis is often diagnosed for patients with a history of inpatient visits to one or more other health systems. This increases the probability of these patients having additional documentation necessary to meet the case definition. This highlights the role of health information exchanges in supporting public health use cases, improving AE reporting, and enhancing postmarket surveillance.
Myocarditis/pericarditis had a notable difference in PPVs for pre- and post-EUA date. The post-EUA date strata of the sample had a higher percentage of cases coming from the emergency department, which had few cases before EUA. This could be explained by patients being diagnosed during previous inpatient stays in other health systems and a lower threshold to provide a preliminary diagnosis with limited information. This category had a lower PPV on average for myocarditis/pericarditis, likely due to less documentation in an emergency care setting than in an inpatient care setting. This highlights the need for further validation of the algorithm in these settings for an effective public health benefit and to gain confidence that our algorithm is fit for purpose. Because the aim of the algorithms is postvaccination AESI detection in support of public health safety surveillance, any potential degradation in performance in the post-EUA period is a concern. If performance decrease in the post-EUA period is driven by postvaccination myocarditis/pericarditis being more likely to have confounding physical findings that could affect how quickly and in which care setting it gets diagnosed, the PPV from this study may not be applicable to a postvaccination version of the phenotype. There is a small overlap in the 2 periods’ PPV 95% CI, and a 2-sample proportion test returns a P value of .08. This suggests that the difference could also be due to statistical noise. However, given the importance of the post-EUA period to the algorithm’s future task and the size of the difference, we suggest validating additional cases in the post-EUA period to confirm whether the algorithm is actually less effective.
In cases with sufficient evidence, our anaphylaxis algorithm performed strongly with a PPV score of 93.3% (95% CI 86.4%-97%). This shows a possible slight improvement over previous anaphylaxis research, although both results were within the 95% CI [ 33 , 34 ]. When stratified by care setting, the algorithm performed better in emergency department care settings. This can be explained due to the anaphylaxis symptoms and treatment being more likely to be well-documented in this setting. Availability of additional evidence increases the PPV of the algorithm. Since anaphylaxis cases related to vaccination are more likely to culminate in visits to the emergency department, the better performance of the algorithm would provide a better public health benefit.
Overall, the performance of the algorithm was moderate compared with that seen in literature. With no obvious avenues for improvement available, no additional sensitivity analyses were applied.
Our initial GBS algorithm showed weak performance for GBS with a PPV of 47.2% (95% CI 35.8%-58.9%). Given existing research on GBS validations, this result is not surprising, since our result is comparable with a study result showing GBS algorithm validation PPV of 29% (95% CI 24%-34%) [ 35 ]. We hoped that our algorithm would improve on this study’s results, allowing us to meet the “moderate” performance threshold defined in the Methods section, given that we added additional logic to require suggested evidence and filter out historical diagnoses. However, we believe that the algorithm’s performance could be improved based on the sensitivity analysis results.
An increase in performance was observed when adjusting the case definition interpretation of GBS to allow for more general written clinical notes or neurology consult evidence to replace specific documented test results. The lack of standardization in laboratory results is fraught with challenges such as inconsistent data. The observed improvement in the GBS phenotype highlighted the need for further standardization to have a better impact on public health benefit.
Furthermore, the performance of the GBS algorithm was improved by the exclusion of nonspecific medications such as gabapentin, increasing its public health benefit. Gabapentin is often used to treat generalized neuropathic pain for a variety of conditions other than GBS, including diabetes, and can confound the results.
With both case definition and medication adjustments to the algorithm, the PPV rose to be closer to the moderate performance threshold and an increase over the cited historical study [ 35 ]. Because these changes were informed by the cases in the validation study post hoc, they might be overfitted to this validation sample and may not be generalizable. They should be tested in other EHR systems.
The GBS algorithm performed slightly better in the post-EUA period, but the performance of both periods was well within the 95% CI of the other period. The GBS algorithm only applies to the inpatient care setting; therefore, no care setting stratification analysis was performed.
Our febrile seizure algorithm performed strongly, with a PPV score of 89% using cases with sufficient evidence. This performance is in line with existing febrile seizure algorithm validation research [ 32 ], where a febrile seizure validation study on the FDA Sentinel database showed a PPV of 70% (95% CI 64%-76%). Our sensitivity analysis suggests that even better performance could possibly be achieved by adding additional filters to select cases with supporting medication and observation evidence, which are well-documented in EHRs. The better performance of the algorithm provides better public health benefits and further supports the use of EHRs in the detection of AEs. For cases that met either or both criteria, the PPV increased. Since these changes to the algorithm happened after the validation was completed, they overstate the general performance increases when applied to a new EHR setting but offer avenues for a future validation study. Future research can test whether stronger performance is possible with these filters and focus on reviewing the algorithm’s application to AEs following pediatric vaccinations.
The TTS algorithm showed moderate performance for PPV at 70.2% which is similar to a separate FDA TTS validation study which estimated the performance at 76.1% (95% CI 67.2-83.2%) [ 29 ]. TTS had consistent performance across both pre- and post-EUA periods and did not have enough cases in the outpatient and emergency department care settings for any defensible findings around diagnosis care setting stratification. Our sensitivity analysis revealed that when the AESI was diagnosed with the ICD-10 code I81 (portal vein thrombosis), the algorithm showed a significant increase when compared with the performance of all other ICD codes (73.5%, 95% CI 64%-81.3%, compared with 36.4%, 95% CI 21.3%-54.4%). Although if an increase to specificity is desired at the cost of some sensitivity, the TTS algorithm could be limited to only select the higher performing I81 diagnosis code.
There are several limitations to this study. First, it only evaluates general AESIs and not postvaccination AESIs specifically since the algorithms do not require the evidence of vaccine administration as criteria. While this was necessary due to the rareness of the postvaccination AESIs in our data, it is possible that the algorithms perform worse detecting postvaccination AESIs specifically since they will often present slightly differently in different populations when occurring after a vaccine administration. For example, the major presenting symptoms appeared to resolve faster in cases of myocarditis after COVID-19 vaccination than in typical viral cases of myocarditis [ 9 ]. To guard against this, we included both pre– and post–COVID-19 EUA data with the hope that post-EUA cases would include some postvaccination AESIs. However, we did not have enough post-EUA cases available to build a large enough sample size for a comparison with sufficient statistical power to provide definitive evidence on this topic. Another limitation in this vein is the general small sample size for all stratification, sensitivity, and negative control analyses. We make sure to state that these analyses are exploratory in nature, and the reader should not form strong conclusions from them given their small samples size and large CI range. Future research could address these concerns by identifying a data source with enough postvaccination AESI cases to complete a comparably large validation study.
An additional limitation of this study is that it only measures algorithms’ PPVs instead of investigating other metrics that could give a better picture of the algorithm’s holistic performance such as sensitivity and specificity. Specifically, these other metrics would estimate how many of the total positive cases are being identified and how well the algorithm is able to identify cases without the AESIs. However, we believe that this limitation is necessary for the following reasons: (1) the main purpose of this study was to assess the PPV of phenotypes because it answers the most relevant public health question, if the algorithms will generate a quality detected set of AE cases for the public health surveillance and (2) a much higher cost and more extensive data sharing are needed to properly estimate sensitivity and specificity because of the required validation sample size necessary for a negative control group. To calculate PPV, one only needs a sample of the cases selected by the algorithm. To estimate the sensitivity and specificity, however, it would be necessary to also validate an extremely large negative control group sample since the AESI conditions that the algorithms try to detect are often rare events. We would expect it to be even more rare for these conditions of interest for AESIs to happen and not be recorded with types of structured data elements that are being used in the phenotypes. In fact, the lack of structured data elements in some negative control cases led to a clinician asking the research team if something was wrong because their case had no relevant charted events to be reviewed. A much larger validation study would also expose clinicians to a larger set of patient data for cases that have a low likelihood of having an AE. This approach limits the interaction with protected health information data until the algorithms’ PPVs support continued research with broader samples and methodologies.
Another limitation is that although they were designed to be simple to deploy, the algorithms are still time-consuming to apply to different EHR systems. Although a hallmark of this algorithm is its interoperability, the algorithm logic still must be applied to the EHR common data model or extracted and translated into another common data model as was done for this study. Interoperable codes should be available for all patients, given the requirement to provide patient data in an interoperable FHIR standard. However, given the recency of the requirement, they might not be available in all systems and require some code translation on the health organization side, especially when analyzing at the population level. In addition, since the interoperable codes will only be available through a FHIR API, this adds another data pull and integration with the EHR system to obtain these codes for the algorithm.
In the future, the evolving landscape of health IT may facilitate the public health use cases of detecting and reporting postvaccination AESIs in a safe and secure manner that protects patient privacy. This could be achieved by EHRs supporting secure querying of patient cohorts with probable postvaccination AESIs using clinical query language [ 46 ] or other interoperable query language. Reducing the burden of automatic detection of postvaccination AESIs would help public health organizations improve AE surveillance with minimal additional burden to health care organizations and providers.
A final limitation of this study is that the algorithms were only applied to 1 site. Going forward, algorithm performance should be validated at other sites to ensure their generalizability. Although the algorithms were generated without prior input from the data, the study is still limited to 1 health care organization, and this method could have different operating characteristics (PPV, sensitivity, etc) at a second location.
Future research can be performed to improve algorithm accuracy and as stated previously would require additional partner EHR data systems. To create a better performing algorithm, machine learning techniques could be used to train the model to identify specific patterns of data instead of relying on rules-based methods that incorporate published case definition criteria and clinical subject matter expert experience. When given enough data, machine learning approaches generally outperform rules-based approaches across domains, and some prior research suggests that this is true in the medical domain as well [ 47 ].
However, machine learning methods will not generalize across EHR systems because the data patterns that machine learning identifies could be specific to an individual health care organization. Trying to build a large data set that combines multisite data is extremely difficult and costly due to concerns over infrastructure, regulations, privacy, and data standardization. A method such as federated learning could be explored to alleviate this problem. Federated learning allows multiple sites to collaboratively train a global model without directly sharing data and has been used to train machine learning algorithms at EHR sites previously [ 48 ].
In summary, this study presents strong initial evidence that creating simple, interoperable, rules-based phenotypes can detect AESIs on a new data source and that the phenotypes outperform the PPV outcomes for historical validations studies for these conditions. The study validates 5 different AESIs to prove that this approach can work for a broad range of AESIs, while also highlighting where the approach might be less successful. For example, the GBS algorithm was built using ICD-10 codes that previous validation studies have demonstrated are not accurate predictors of a GBS case that meets case definition criteria; subsequently, our GBS algorithm performed poorly. The validation study sample sizes for all AESIs allowed for adequate precision to evaluate algorithm PPV against historical studies.
An active surveillance system can enhance vaccine safety and aid in the development and use of safer vaccines and recommendations to minimize the AE risks after vaccination [ 49 ]. The algorithms were developed using a method that should be able to be applied to and generalize performance for new EHR databases, but more research is needed to confirm this. If the methodology can be successfully used to detect postvaccination AESI cases across EHR databases, these algorithms could be deployed widely to inform FDA decision-making, promote public safety, and improve public confidence. Going forward, further research and investigation are needed to enhance algorithm performance and integrate the algorithms across health care organizations for active surveillance in the interest of public health.
Development of the manuscript benefitted from significant engagement with the Food and Drug Administration (FDA) Center for Biologics Evaluation and Research (CBER) team members and their partners. The authors thank them for their contributions. Additional feedback on the manuscript was provided by IBM Consulting (Stella Muthuri and Brian Goodness), Accenture Consulting (Shayan Hobbi), and Korrin Bishop (writing and editing). This research was funded through the FDA CBER Biologics Effectiveness and Safety Initiative. Several coauthors hold commercial affiliations with Accenture, IBM Consulting, and MedStar Health Research Institute. Accenture (PSH); IBM Consulting (AAH, JP, JB, AS, EM, LDJ, and MD); and MedStar Health Research Institute (AZH and JB) provided support in the form of salaries for authors but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The data sets generated during and analyzed during this study are not publicly available, and were only made available to the Food and Drug Administration for the purpose of evaluating algorithms for adverse events of special interest outcomes. For inquiries or questions regarding the data, individual queries should be directed to the corresponding author.
Authors AAH, JP, JB, AS, EM, LDJ, and MD are or were employed by IBM while participating in the study. PSH is employed by Gevity Consulting, Inc, a part of Accenture. Authors AZH and HJB are employed by MedStar Health Research Institute, and AZH holds an appointment with Georgetown University School of Medicine. These authors have delivered clinical and epidemiology consulting engagement for public and private sector partners. These affiliations did not impact the study design, data collection and analysis, decision to publish, or preparation of the manuscript and do not alter our adherence to JMIR policies on sharing data and materials. The opinions expressed are those of the authors and do not necessarily represent the opinions of their respective organizations. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Search terms and code lists for 5 developed phenotypes and detailed case definitions.
adverse event |
adverse event of special interest |
application programming interface |
Biologics Effectiveness and Safety Initiative |
Center for Biologics Evaluation and Research |
Centers for Disease Control and Prevention |
cerebrospinal fluid |
electronic health record |
emergency use authorization |
Food and Drug Administration |
Fast Healthcare Interoperable Resources |
Guillain-Barré syndrome |
Health Level 7 application programming interface |
International Classification of Diseases, 10th Revision, clinical modification |
Observational Health Data Sciences and Informatics |
positive predictive value |
real-world data |
thrombosis with thrombocytopenia syndrome |
United States Core Data for Interoperability |
Vaccine Adverse Event Reporting System |
white blood cell |
Edited by A Mavragani, T Sanchez; submitted 09.06.23; peer-reviewed by B Ru, AS Bhagavathula; comments to author 20.01.24; revised version received 24.02.24; accepted 26.05.24; published 15.07.24.
©Ashley A Holdefer, Jeno Pizarro, Patrick Saunders-Hastings, Jeffrey Beers, Arianna Sang, Aaron Zachary Hettinger, Joseph Blumenthal, Erik Martinez, Lance Daniel Jones, Matthew Deady, Hussein Ezzeldin, Steven A Anderson. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 15.07.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
Published on 27.6.2024 in Vol 12 (2024)
Authors of this article:
1 Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), National Drug Clinical Trial Center, Peking University Cancer Hospital & Institute, , Beijing, , China
2 Yidu Tech Inc, , Beijing, , China
3 Pfizer (China) Research & Development Co, , Shanghai, , China
4 State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers, Beijing Key Laboratory of Carcinogenesis and Translational Research, National Drug Clinical Trial Center, Peking University Cancer Hospital & Institute, , Beijing, , China
Min Jiang, MS
Background: The traditional clinical trial data collection process requires a clinical research coordinator who is authorized by the investigators to read from the hospital’s electronic medical record. Using electronic source data opens a new path to extract patients’ data from electronic health records (EHRs) and transfer them directly to an electronic data capture (EDC) system; this method is often referred to as eSource. eSource technology in a clinical trial data flow can improve data quality without compromising timeliness. At the same time, improved data collection efficiency reduces clinical trial costs.
Objective: This study aims to explore how to extract clinical trial–related data from hospital EHR systems, transform the data into a format required by the EDC system, and transfer it into sponsors’ environments, and to evaluate the transferred data sets to validate the availability, completeness, and accuracy of building an eSource dataflow.
Methods: A prospective clinical trial study registered on the Drug Clinical Trial Registration and Information Disclosure Platform was selected, and the following data modules were extracted from the structured data of 4 case report forms: demographics, vital signs, local laboratory data, and concomitant medications. The extracted data was mapped and transformed, deidentified, and transferred to the sponsor’s environment. Data validation was performed based on availability, completeness, and accuracy.
Results: In a secure and controlled data environment, clinical trial data was successfully transferred from a hospital EHR to the sponsor’s environment with 100% transcriptional accuracy, but the availability and completeness of the data could be improved.
Conclusions: Data availability was low due to some required fields in the EDC system not being available directly in the EHR. Some data is also still in an unstructured or paper-based format. The top-level design of the eSource technology and the construction of hospital electronic data standards should help lay a foundation for a full electronic data flow from EHRs to EDC systems in the future.
Source data are the original records from clinical trials or all information recorded on certified copies, including clinical findings, observations, and records of other relevant activities necessary for the reconstruction and evaluation of the trial [ 1 ]. Electronic source data are data initially recorded in an electronic format (electronic source data or eSource) [ 2 , 3 ].
The traditional clinical trial data collection process requires a clinical research coordinator (CRC) who is authorized by the investigators to read from the hospital’s electronic medical record and other clinical trial–related data from the hospital information system and then manually enter the patient’s data into the electronic data capture (EDC) system. After data entry, the clinical research associate visits the site to perform source data verification and source data review. The drawbacks of collecting data by manual transcription are that data quality and timeliness cannot be guaranteed and that it is a waste of human and material resources. Using electronic source data opens a new path to extract patients’ data from electronic health records (EHRs) and transfer it directly to EDC systems (often the method is referred to as eSource) [ 4 ]. eSource technology in a clinical trial data flow can improve data quality without compromising timeliness [ 5 ]. At the same time, improved data collection efficiency reduces clinical trial costs [ 6 ].
eSource can be divided into two levels. The first level is to enable the hospital information system to obtain complete data sets; the second level is to allow direct data transfer to EDC systems based on the clinical trial patients’ electronic data in hospitals to avoid the electronic data being transcribed manually again, which is the core purpose of eSource [ 7 ]. This project will explore the use of eSource technology to extract clinical trial data from EHRs, send it to the sponsor data environment, and discuss the issues and challenges occurring in its application process.
This study was approved by the Ethics Committee and Human Genetic Resource Administration of China (2020YW135). During the ethical review process, the most significant challenges were patients’ informed consent, privacy protection, and data security. The B7461024 Informed Consent Form (Version 4) states that “ interested parties may use subjects’ personal information to improve the quality, design, and safety of this and other studies,” and “Is my personal information likely to be used in other studies? Your coded information may be used to advance scientific research and public health in other projects conducted in future.” This project is an exploration of using electronic source data technology instead of traditional manual transcription in the process of transferring data from hospital EHRs to EDC systems, which will improve the data quality of clinical trials and will improve the data flow in the future. Therefore, this project is within the scope of the informed consent form for study B7461024, which was approved by the ethics committee after clarification.
This project was conducted from December 15, 2020, to November 19, 2021, which was before China’s personal information protection law and data security law were introduced. The data for this project were obtained from an ongoing phase 2, multicenter, open-label, dual-cohort study to evaluate the efficacy and safety of Lorlatinib (pf-06463922) monotherapy in anaplastic lymphoma kinase (ALK) inhibitor–treated locally advanced or metastatic ALK-positive non–small cell lung cancer patients in China (B7461024), registered by the sponsor on the Drug Clinical Trials Registration and Disclosure Platform (CTR20181867). The data extraction involved 4 case report form (CRF) data modules: demographics, concomitant medication, local lab, and vital signs, which were collected in the following ways:
All information was collected from 6 patients in a total of 29 fields ( Textbox 1 ).
Demographics
Concomitant medication
Vital signs
The study chosen in our project used the traditional manual data entry method to transcribe patients’ CRF data into the EDC system. This project proposes testing the acquisition of data directly from the hospital EHR, deidentification of the patients’ electronic data on the hospital medical data intelligence platform, mapping and transforming the data based on the sponsor’s EDC data standard, and transferring the data into the sponsor’s environment. The data was transferred from the hospital to the sponsor’s data environment and compared to data that was captured by traditional manual entry methods to verify the availability, completeness, and accuracy of the eSource technology.
In the network environment of this project, the technology provider accessed the hospital network through a virtual private network (VPN) and a bastion host, and processed the data of this project as a private cloud, thus ensuring the security of the hospital data.
The hospital information system involved in this project has reached the national standards of “Level 3 Equivalence,” “Electronic Medical Record Level 5,” and “Interoperability Level 4.” The medical data intelligence platform in this project is deployed in a hospital intranet, isolated from external networks. Integrated data from different information systems, including the hospital information system, LIMS, picture archiving and communication system, etc, were deidentified from the platform and transferred to a third-party private cloud platform for translation and data format conversion after authorization by the hospital through a VPN.
The scope of data collection in this project was limited to patients who signed Informed Consent Form (Version 4) for study B7461024. The structured data of four CRF data modules (demographic, concomitant medications, local lab, and vital signs) were extracted from the source data in hospital systems, and data processing was completed.
In this project, three layers of deidentification were performed on the electronic source data to ensure data security. The first layer of deidentification was performed before the certified copy of data was loaded to the hospital’s medical data intelligence platform. The second layer of deidentification follows the Health Insurance Portability and Accountability Act (HIPAA) by deidentifying 18 data fields at the system level. A third layer of deidentification was performed when mapping and transforming third-party databases for the clinical trial data (demographics, concomitant medications, laboratory tests, and vital signs) collected for this study, as required by the project design.
Collected data did not contain any sensitive information with personal identifiers of the patients, and all deidentification processes were conducted in the internal environment of the hospital. In addition to complying with the relevant laws and regulations, we followed the requirements of Good Clinical Practice regarding patient privacy and confidentiality, and further complied with the requirements of HIPAA to deidentify the 18 basic data fields. Data fields outside the scope of HIPAA will be deidentified and processed in accordance with the TransCelerate guidelines published in April 2015 to ensure the security of patients’ personal information and to eliminate the possibility of patient information leakage [ 8 ].
The general rules for the third layer of deidentification were as follows:
In addition, all data flows keep audit trails throughout and are available for audit.
After three layers of deidentification, the data was transferred from a hospital to a third-party private cloud platform through a VPN, where translation from Chinese to English and data format conversion were implemented. The whole transfer process was performed for the data that was collected for the clinical trial of this study. Standardization of data is a crucial task during the data preparation phase. This process involves consolidating data from different systems and structures into a consistent, comprehensible, and operable format. First, a thorough examination of data from various systems is necessary. Understanding the data structure, format, and meaning of each system is essential. The second step involves establishing a data dictionary that clearly outlines the meaning, format, and possible values of each data element. Next, selecting a data standard is necessary to ensure consistency and comparability. In this study, we adopted the Health Level 7 (HL7) standard. Additionally, data cleansing and transformation are needed to meet standard requirements, including handling missing data, resolving mismatched data formats, or performing data type conversions. Extract, transform, and load tools were used to integrate data from different systems. Data security must be ensured throughout the data integration process. This includes encrypting sensitive information and strictly managing data permissions. Data verification and validation steps were then performed by professional staff on the translated data. The data from the hospital’s medical data intelligence platform were then converted from JSON format to XML and Excel formats. The processed data was transferred back to the hospital via a VPN to a designated location for final adjudication before loading to the sponsor’s environment.
After the hospital received the processed data, it was then pushed by the hospital to the sponsor’s secure and controlled environment ( Figure 1 ). All data deidentification processes were conducted in the hospital’s environment, and none of the data obtained by the sponsor can be traced back to patients’ personal information to ensure their privacy and information security.
The data quality of this project was assessed using industry data quality assessment rules [ 9 ], which are shown in Table 1 .
Data validation methods | Dimension | Method description | Cases |
Data availability verification | Field dimension | The ratio of the total number of data fields in the clinical trial CRF available in the hospital EHR to the total number of data fields required in the electronic CRF: EHR /CRF × 100% | Based on the electronic CRF, 6 data fields in the demography need to be captured, and 3 of them have records in the EHR. Data availability: 3/6 × 100% = 50% |
Data availability verification | Field dimension | The ratio of the total number of data fields in the clinical trial CRF (eSource) that can be transmitted electronically in the hospital’s EHR to the total number of data fields required in the electronic CRF: eSource /CRF × 100% | Based on the electronic CRF, 6 data fields in the demography need to be captured, and 2 data fields can be captured by the eSource method. Data availability: 2/6 × 100% = 33.33% |
Data completeness verification | Numerical dimension | The ratio of the total number of nonnull data (eSourceV) captured (processed and sent to the sponsor) via the eSource method to the total number of data fields requested on the electronic CRF: eSourceV /CRF × 100% | Based on the clinical trial design, 38 concomitant medication pages need to be collected: 7 pages were collected via eSource and 2 fields was entered per page. Data completeness: 7 × 2/(2 × 38) × 100% = 18.42% |
Data accuracy verification | Numerical dimension | Matching of data field values in the hospital’s EHR with data field values that can be captured by eSource (data fields that are processed and sent to the sponsor) | 4 fields of demography were successfully transmitted through eSource, with 4 data points in each. After comparing with the data in the electronic data capture system, there were no mismatches for one data point. Data accuracy: 8/(2 × 4) × 100% = 100% |
a CRF: case report form.
b EHR: electronic health record.
c Total number of data fields in the hospital’s EHR.
d Total number of data fields requested in the electronic CRF.
e Total number of data fields captured (processed and sent to the sponsor) through the eSource method.
f Total number of nonempty data fields captured (processed and sent to the sponsor) through the eSource method.
In this project, we collected patients’ demographics, vital signs information, local laboratory data, and concomitant medication data from EHRs, successfully pushed the data directly to the designated sponsor environment, and evaluated the data quality from three perspectives including availability, completeness, and accuracy ( Table 2 ).
CRF domain | CRF-EHR data availability, n/N (%) | CRF-eSource data availability, n/N (%) | Data completeness (preliminary findings), n/N (%) | Data accuracy (preliminary findings), n/N (%) | |
Definition | Study CRF data elements available in hospital EHR | Study CRF data elements available in hospital EHR and able to be electronically transferred through eSource technology | Study CRF data elements available and entered into hospital EHR and transferred through eSource technology | Study CRF data elements available and entered into hospital EHR and transferred through eSource technology with expected result (eg, matches what was entered directly in form) | |
Demographics | 3/6 (50.00) | 2/6 (33.33) | 12/12 (100.00) | 12/12 (100.00) | |
Vital signs | 10/10 (100.00) | 9/10 (90.00) | 24/1812 (1.32) | 20/20 (100.00) | |
Blood biochemical tests | 6/10 (60.00) | 5/10 (50.00) | 12,968/13,540 (95.78) | 7767/7767 (100.00) | |
Urine sample tests | 6/9 (66.67) | 5/9 (55.56) | 15/40 (37.56) | 15/15 (100.00) | |
Concomitant medication | 10/10 (100.00) | 9/10 (90.00) | 14/76 (18.42) | 6/6 (100.00) |
c Checks were made with the relevant clinical research associates (CRAs) regarding the original data collection and CRF completion methods for the following reasons: vital signs were obtained using paper tracking forms provided by the sponsor as the original data source, and the data may not be transcribed into the hospital information system (HIS) by the researcher. Therefore, data from many visits are not available in the HIS.
d A total of 2708 blood biochemistry tests were involved.
e Concomitant medication uses tracking forms to record adverse event and ConMed (a paper source), and data may not be transcribed into the HIS. As confirmed by the CRA, the percentage of paper ConMed sources was approximately 80%.
Although EHRs have been widely used, the degree of structure of EHR data varies substantially among different data modules. In EHRs, demographics, vital signs, local lab data, and concomitant medications are more structured than patient history or progress notes and often contain unstructured text [ 10 ]. Therefore, we selected these 4 well-structured data modules for exploration in this project.
For demographics data, among the 6 required fields (subject ID, date of birth, sex, ethnicity, race, and age), subject ID (subject code number/identifier in the trial, not the patient code number/identifier in the EHR system), ethnicity, and race were not available in the EHR, so the EHR-CRF availability score was 50%. Since this was an exploratory project, the date of birth field was also deidentified and thus could not be collected based on our deidentification rule, so the eSource-CRF availability score was 33%. In the future, the availability score can reach close to 100% by bidirectional design of the EHR and CRF under the premise of obtaining compliance for industrial-level applications.
The low availability score of local laboratory data on EHR-CRFs is due to the lack of required fields in the hospital system; “Lab ID” and “Not Done” do not exist in the LIMS, and for the “Clinically Significant” field, the meaning of laboratory test results needs to be manually interpreted by an investigator, so they cannot be transcribed directly. The availability score of eSource-CRFs was further decreased because the field “Laboratory Name and Address” is not an independent structured field in the EHR. The completeness score of urine sample test data was only 37.56% because during the actual clinical trial, especially amid the COVID-19 pandemic period, patients completed study-related laboratory tests at other sites, and those test results were collected via paper-based reports, so the complete data sets cannot be extracted from the site’s system.
To improve data availability in future applications, clinical trial–specific fields need to be added to EHR designs for those data that require an investigator’s interpretation such as “Clinically Significant,” and data transfer and mapping processes for the determination of the scope of data collection also needs to be optimized. Based on these two conditions, the completeness score can be improved to over 90%.
The availability and accuracy of vital signs data are ideal. However, since not all vital signs data collection was recorded by the electronic system during the actual study visit, many vital signs data were collected in “patient diary” and other types of paper-based documents during the study, resulting in a serious limitation in data completeness. With the development of more clinical trial–related electronic hardware and enhancements in products intelligence, more vital signs data will be directly collected by electronic systems, and the completeness of vital signs data transferred from EHR to EDC will be greatly improved in the future.
In the concomitant medication module, there was a good score for availability and accuracy because the standardization and structuring of prescriptions are well done in this hospital system. However, the patient’s medication use period during hospitalization is recorded in unstructured text, so the data could not be captured for this study, resulting in a low completeness score of 18.42% for concomitant medication.
In summary, the accuracy score of eSource data in this study was high (100% for all fields). A study by Memorial Sloan Kettering Cancer Center and Yale University confirmed that the error rate of automatic transcription reduced from 6.7% to 0% compared to manual transcription [ 10 ]. However, data availability and completeness have not reached a good level. Data availability varies widely across studies, ranging from 13.4% in the Retrieving EHR Useful Data for Secondary Exploitation (REUSE) project [ 11 ] to 75% in The STARBRITE Proof-of-Concept Study [ 12 ], mainly related to the coverage and structure of the EHR.
National drug regulatory agencies (eg, US Food and Drug Administration [FDA], European Medicines Agency, Medicines and Healthcare products Regulatory Agency, and Pharmaceuticals and Medical Devices Agency) have developed guidelines to support the application of eSource to clinical trials [ 3 , 13 - 15 ]. The new Good Clinical Practice issued by the Center for Drug Evaluation in 2020 encourages investigators to use clinical trials’ electronic medical records for source data documentation [ 1 ]. Despite this, we still encountered challenges, including ethical review and data security, during this study’s implementation process. Without knowing the precedents, the project team decided to follow the requirements for clinical trials to control the quality of the study. There were no existing regulatory policies or national guidance on eSource in China at the time of this study. The project team provided explanations for inapplicable documents and communicated several times to ensure the approval of relevant institutional departments before finally becoming the first eSource technology study to be approved by the Ethics Committee and Human Genetic Resource Administration of China.
In the absence of regulatory guidelines, our eSource study, the first in China’s International Multi-center Clinical Trial, navigated challenges in data deidentification. We adopted HIPAA and TransCelerate’s guidelines [ 8 ]. Securing approval under “China International Cooperative Scientific Research Approval for Human Genetic Resources,” we answered queries and achieved unprecedented recognition. For transferring data from the hospital to the sponsor’s environment, we prioritized security and obtained necessary approvals. Iterative revisions ensured a robust data flow design. Challenges in mapping hospital EHR to EDC standards highlighted the need for a scalable mechanism. This study pioneers eSource tech integration in China, emphasizing the importance of seamless data mapping. In the process of executing data standardization, several challenges may arise, including inconsistent data definitions. Data from different systems may use different definitions due to the independent development of these systems, leading to varied interpretations of even identical concepts. To address this issue, establishing a unified data dictionary is crucial to ensure consensus on the definition of each data element. Different systems might also use distinct data formats such as text encodings. Preintegration format conversion is required, and extract, transform, and load tools or scripts can assist in standardizing these formats. During the integration of data from multiple systems, it is possible to discover data in one system that is not present in another. In the data standardization process, considerations must be made on how to handle missing data, which may involve interpolation, setting default values, etc. Quality issues like errors, duplicates, or inaccuracies may exist in data from different systems. Data cleansing, involving deduplication, error correction, logical validation, etc, is necessary to address these quality issues. Different systems may generate data based on diverse business rules and hospital use scenarios. In data standardization, unifying these rules requires collaboration with domain experts to ensure consistency.
Internationally, multiple research studies and publications have been released on regulations, guidelines, and validation of eSource. The FDA provided guidance on the use of electronic source data in clinical trials in 2013 that aims to address barriers to capturing electronic source data for clinical trials, including the lack of interoperability between EHRs and EDC systems. The European-wide Electronic Health Records for Clinical Research (EHR4CR) project was launched in 2011 to explore technical options for the direct capture of EHR data within 35 institutions, and the project was completed in 2016 [ 16 ]. The second phase of the project connected the EHRs to EDC systems [ 17 ] and aimed to realize the interoperability of EHRs and EDC systems. The US experience focuses more on improving and standardizing the existing EHRs to make them more uniform.
In Europe, the experience focuses on breaking down the technical barrier of interoperability between EHRs and EDC systems. In China, the current industry trends focus on the governance of existing EHR data in the hospital and the building of clinical data repository platforms [ 7 ]. Clinical data repository platforms focus on data integration and cleaning between EHRs and other systems in hospital environments and on unstructured data normalization and standardization by natural language processing and other AI technology [ 18 ]. At the national level, China is also actively promoting the digitization of medical big data and is committed to the formation of regional health care databases [ 19 ], which lays the foundation for the future implementation of eSource in China [ 20 ].
This study evaluates the practical application value of eSource in terms of availability, completeness, and accuracy. To improve availability, the structure of the CRF needs to be designed according to the information of the EHR data at the design stage of clinical trials. Even so, since EHRs are designed for the physicians to conduct daily health care activities, certain fields in clinical trials (eg, judgment of normal or abnormal values of laboratory tests and judgment of correlations of adverse events and combined medications) are still not available, and clinical trial–specific fields need to be added to EHR designs for those data that require investigators’ interpretation to improve data availability. Completeness could be improved by the development of hospital digitalization that ensures patients’ data is collected electronically rather than on paper. Additionally, 2708 blood test records were successfully collected from only 6 patients via eSource in this study, which indicates that laboratory tests often contain large amounts of highly structured data that are suitable for eSource. EHR-EDC end-to-end automatic data extraction by eSource is suitable for laboratory examinations and can improve the efficiency and accuracy of data extraction significantly as well as reduce redundant manual transcriptions and labor costs. Processing unstructured or even paper-based data in eSource is still a big challenge. Using machine learning tools (eg, natural language processing tools) for autostructuring can be explored in the future. The goal is to have common data standards and better top-level design to facilitate data integrity, interoperability, data security, and patient privacy protection in eSource applications. During deidentification, we processed certain data with a specific logic to protect privacy. The accuracy assessment was performed during the deidentification step to ensure that the data was still sufficiently accurate while meeting privacy requirements. Reversible methods need to be used when performing deidentification as well as providing controlled access mechanisms to the data so that the raw data can be accessed when needed. It is worth noting that different regions and industries may have different privacy regulations and compliance requirements. When deidentifying, you need to ensure that you are compliant with the relevant regulations and understand the limitations of data use. This may require working closely with a legal team.
In the future, we can consider adding performance analysis, including an assessment of data import performance. This involves evaluating the speed and efficiency of data import to ensure it is completed within a reasonable timeframe. Additionally, analyzing data query performance is crucial in practical applications to ensure that the imported data meets the expected query performance in the application. For long-term applications involving a larger size of patients, it is advisable to consider adding analyses related to maintainability and cost-effectiveness. This includes implementing detailed logging and monitoring mechanisms to promptly identify and address potential issues. Furthermore, for the imported data, establishing a version control mechanism is essential for tracing and tracking changes in the data. Simultaneously, for overall resource use, evaluating the resources required during the data import process ensures completion within a cost-effective framework. It is also important to consider the value of imported data for clinical trial operations and related decision-making, providing a comparative analysis between cost and value.
This research was supported by the Capital's Funds for Health Improvement and Research (grant No. CFH2022-2Z-2153), and the Beijing Municipal Science & Technology Commission (grant No. Z211100003521008).
None declared.
anaplastic lymphoma kinase |
clinical research coordinator |
case report form |
electronic data capture |
electronic health record |
Electronic Health Records for Clinical Research |
Food and Drug Administration |
Health Insurance Portability and Accountability Act |
Health Level 7 |
laboratory information management system |
Retrieving EHR Useful Data for Secondary Exploitation |
virtual private network |
Edited by Christian Lovis; submitted 19.09.23; peer-reviewed by Hareesh Veldandi, Yujie Su; final revised version received 20.12.23; accepted 18.04.24; published 27.06.24.
© Yannan Yuan, Yun Mei, Shuhua Zhao, Shenglong Dai, Xiaohong Liu, Xiaojing Sun, Zhiying Fu, Liheng Zhou, Jie Ai, Liheng Ma, Min Jiang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 27.6.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/ , as well as this copyright and license information must be included.
Discover the world's research
BMC Medicine volume 22 , Article number: 288 ( 2024 ) Cite this article
199 Accesses
Metrics details
Ethnicity is known to be an important correlate of health outcomes, particularly during the COVID-19 pandemic, where some ethnic groups were shown to be at higher risk of infection and adverse outcomes. The recording of patients’ ethnic groups in primary care can support research and efforts to achieve equity in service provision and outcomes; however, the coding of ethnicity is known to present complex challenges. We therefore set out to describe ethnicity coding in detail with a view to supporting the use of this data in a wide range of settings, as part of wider efforts to robustly describe and define methods of using administrative data.
We describe the completeness and consistency of primary care ethnicity recording in the OpenSAFELY-TPP database, containing linked primary care and hospital records in > 25 million patients in England. We also compared the ethnic breakdown in OpenSAFELY-TPP with that of the 2021 UK census.
78.2% of patients registered in OpenSAFELY-TPP on 1 January 2022 had their ethnicity recorded in primary care records, rising to 92.5% when supplemented with hospital data. The completeness of ethnicity recording was higher for women than for men. The rate of primary care ethnicity recording ranged from 77% in the South East of England to 82.2% in the West Midlands. Ethnicity recording rates were higher in patients with chronic or other serious health conditions. For each of the five broad ethnicity groups, primary care recorded ethnicity was within 2.9 percentage points of the population rate as recorded in the 2021 Census for England as a whole. For patients with multiple ethnicity records, 98.7% of the latest recorded ethnicities matched the most frequently coded ethnicity. Patients whose latest recorded ethnicity was categorised as Other were most likely to have a discordant ethnicity recording (32.2%).
Primary care ethnicity data in OpenSAFELY is present for over three quarters of all patients, and combined with data from other sources can achieve a high level of completeness. The overall distribution of ethnicities across all English OpenSAFELY-TPP practices was similar to the 2021 Census, with some regional variation. This report identifies the best available codelist for use in OpenSAFELY and similar electronic health record data.
Peer Review reports
Ethnicity is known to be an important determinant of health inequalities, particularly during the COVID-19 outbreak where a complex interplay of social and biological factors resulted in increased exposure, reduced protection and increased severity of illness in particular ethnic groups [ 1 , 2 ]. The UK has a diverse ethnic population (The 2021 Office for National Statistics (ONS) Census estimated 9.6% Asian, 4.2% Black, 3.0% Mixed, 81.0% White, 2.2% Other [ 3 ]), which can make health research conducted in the UK generalisable to countries. Complete and consistent recording of patients’ ethnic group in primary care can support efforts to achieve equity in service provision and reduces bias in research [ 4 , 5 ]. Ethnicity recording for new patients registering with general practice across the UK has improved following Quality and Outcomes Framework (QOF) financial incentivisation between 2006/07 and 2011/12 [ 6 , 7 ]. As a result, ethnicity is now being captured for the majority of the population in routine electronic healthcare records and is comparable to the general population [ 6 ]. The uptake and utilisation of healthcare services still varies across ethnic groups, and the recently established NHS Race and Health Observatory have led calls for a dedicated drive by NHS England and NHS Digital to emphasise the importance of collecting and reporting ethnicity data [ 8 ].
OpenSAFELY is a secure health analytics platform created by our team on behalf of NHS England. OpenSAFELY provides a secure software interface allowing analysis of pseudonymised primary care patient records from England in near real-time within highly secure data environments.
In primary care data, patient ethnicity is recorded via clinical codes, similar to how any other clinical condition or event is recorded. In OpenSAFELY-TPP, both Clinical Terms Version 3 (CTV3 (Read)) codes and Systematised Nomenclature of Medicine Clinical Terms (SNOMED CT) codes are used. SNOMED CT is an NHS standard, widely used across England.
Ethnicity is also recorded in secondary care, when patients attend emergency care, inpatient or outpatient services, independently of ethnicity in the primary care record. This is available via NHS England’s Secondary Uses Service (SUS) [ 9 ]. It is common practice in OpenSAFELY to supplement primary care ethnicity, where missing, with ethnicity data from SUS [ 10 , 11 ]. Throughout this paper, we refer to ethnicity rather than race as recommended by the ONS: ‘The word “race” places people into categories based on physical characteristics, whilst ethnicity is self-defined and includes aspects such as culture, heritage, religion and identity’. However, we recognise that the distinction between and use of these terms may differ in different settings.
In this paper, we study the completeness, consistency and representativeness of routinely collected ethnicity data in primary care.
Retrospective cohort study across 25 million patients registered with English general practices in OpenSAFELY-TPP.
This study uses data from the OpenSAFELY-TPP database, covering around 40% of the English population. The database includes primary care records of patients in practices using the TPP SystmOne patient information system and is linked to other NHS data sources, including in-patient hospital records from NHS England’s Secondary Use Service (SUS), where ethnicity is also recorded independently of ethnicity in the primary care record.
All data were linked, stored and analysed securely within the OpenSAFELY platform https://opensafely.org/ . Data include pseudonymized data such as coded diagnoses, medications and physiological parameters. No free text data are included. All code is shared openly for review and re-use under MIT open licence (opensafely/ethnicity-short-data-report at notebook). Detailed pseudonymised patient data is potentially re-identifiable and therefore not shared.
Patients were included in the study if they were registered at an English general practice using TPP on 1 January 2022.
In primary care data, there is no categorical ‘ethnicity’ variable to record this information. Rather, ethnicity is recorded using clinical codes—entered by a clinician or administrator with a location and date—like any other clinical or administrative event, with specific codes relating to each ethnic group [ 12 , 13 , 14 ]. This means ethnicity can be recorded by the practice in multiple, potentially conflicting, ways over time.
We created a new codelist, SNOMED:2022 [ 13 ], by identifying relevant ethnicity SNOMED CT codes and ensuring completeness by comparing the codelist to the following: another OpenSAFELY created codelist (CTV3:2020) [ 13 ], a combined ethnicity codelist from SARS-CoV2 COVID19 Vaccination Uptake Reporting Codes published by Primary Care Information Services (PRIMIS) [ 12 , 15 ] and a codelist from General Practice Extraction Service (GPES) Data for Pandemic Planning and Research (GDPPR) [ 16 ]. Codes which relate to religion rather than ethnicity (e.g. ‘Muslim—ethnic category 2001 census’) and codes which do not specify a specific ethnicity (e.g. ‘Ethnic group not recorded’) were excluded. In total, 258 relevant ethnicity codes were identified. We then created a codelist categorisation based on the 2001 UK Census categories, which are the NHS standard for ethnicity [ 17 ], and cross referenced it against the CTV3, PRIMIS and GDPPR codelists. The ‘Gypsy or Irish Traveller’ and ‘Arab’ groups were not specifically listed in 2001 however we categorised them as `White` and `Other` respectively as per the 2011 Census grouping [ 18 ]).
The codelist categorisation consists of two ethnicity groupings based on the 2001 census (Table 1 ): all analyses used the 5-group categorisation unless otherwise stated.
If a SNOMED:2022 ethnicity code appeared in the primary care record on multiple dates, the latest entry was used unless otherwise stated.
In OpenSAFELY, the function ethnicity_from_sus combines SUS ethnicity data from admitted patient care statistics (APCS), emergency care (EC) and outpatient attendance (OPA) and selects the most frequently used ethnicity code for each patient. In hospital records from SUS, recorded ethnicity is categorised as one of the 16 categories on the 2001 UK census. This accords with the 16-level grouping described above.
We looked at the completeness of ethnicity coding in the whole population and across each of the following demographic and clinical subgroups:
Patient age was calculated as of 1 January 2022 and grouped into 5-year bands, to match the ONS age bands.
We used categories ‘male’ and ‘female’, matching the ONS recorded categories; patients with any other/unknown sex were excluded.
Overall deprivation was measured by the 2019 Index of Multiple Deprivation (IMD) [ 19 ] derived from the patient’s postcode at lower super output area level. IMD was divided by quintile, with 1 representing the most deprived areas and 5 representing least deprived areas. Where a patient’s postcode cannot be determined the IMD is recorded as unknown.
Region was defined as the Nomenclature of Territorial Units for Statistics (NUTS 1) region derived from the patient’s practice postcode.
As the rate of ethnicity recording would be expected to be lower in patients with fewer clinical interactions, and therefore fewer opportunities for ethnicity to be recorded, completeness was also compared in the clinical subgroups of dementia, diabetes, hypertension and learning disability which are more likely to require additional clinical interactions. Clinical subgroups were defined as the presence or absence of relevant SNOMED CT codes in the GP records for dementia [ 20 ], diabetes [ 21 ], hypertension [ 22 ] and learning disabilities [ 23 ] as of 1 January 2022.
Completeness and distribution of ethnicity recording.
The proportion of patients with either (i) primary care ethnicity recorded (that is, the presence of any code in the SNOMED:2022 codelist in the patient record) or (ii) primary care ethnicity supplemented, where missing, with ethnicity data from secondary care [ 24 ] was calculated. Completeness was reported overall and within clinical and demographic subgroups.
Amongst those patients where ethnicity was recorded, the proportion of patients within each of the 5 groups was calculated, within each clinical and demographic subgroup. We also calculated the distribution of complete ethnicity recording across practices with at least 1000 registered patients.
Discrepancies may arise due to errors whilst entering the data or if a patient self-reports a different ethnic group from their previously recorded ethnic group. We calculated the proportion of patients with any ethnicity recorded which did not match their ‘latest’ recorded grouped ethnicity for each of the five ethnic groups.
We also calculated the proportion of patients whose latest recorded ethnicity did not match their most frequently recorded ethnicity for each of the five ethnic groups.
We calculated the proportion of patients whose latest recorded ethnicity in primary care matched their ethnicity as recorded in secondary care for each of the five ethnic groups, where both primary and secondary care are recorded.
The UK Census collects individual and household-level demographic data every 10 years for the whole UK population. Data on ethnicity were obtained from the 2021 UK Census for England. The most recent census across the UK was undertaken on 27 March 2021. Ethnic breakdowns for the population of England were obtained via NOMIS [ 25 ].
The ethnic breakdown of the census population was compared with our OpenSAFELY-TPP population and the relative difference was calculated using the ONS value as the baseline proportion and OpenSAFELY as the comparator. In the 2021 UK Census, the Chinese ethnic group was included in the Asian ethnic group, whereas in the 2001 census, it was included in the Other ethnic group [ 26 ]. In order to provide a suitable comparison with primary care data, we regrouped the 2021 census data as per the 2001 groups. As an additional analysis, we also compared the primary care data with the census data using the 2021 census categories.
19,618,135 of the 25,102,210 patients (78.2%) registered in OpenSAFELY-TPP on 1 January 2022 had a recorded ethnicity, rising to 92.5% when supplemented with secondary care data (Fig. 1 , Additional file 1: Table S1).
Bar plot showing proportion of registered TPP population with a recorded ethnicity by clinical and demographic subgroups, based on primary care records (solid bars) and when supplemented with secondary care data (pale bars)
Primary care ethnicity recording completeness was lowest for patients aged over 80 years (80.1%) and under 30, whereas ethnicity recording was highest in those over 80 when supplemented with secondary care data (97.1%). Women had a higher proportion of recorded ethnicities than men (79.8% and 76.5% respectively, 94% and 91.1% when supplemented with secondary care data). The completeness of primary care ethnicity recording ranged from 77% in the South East of England to 82.2% in the West Midlands. IMD was within 1.2 percentage points for known values (77.7% in the least deprived group 5 to 78.9% in group 3) and was lowest for the unknown group (71.6%). Primary care ethnicity recording was at least 4 percentage points higher in all of the clinical subgroups compared to the general population.
Using ethnicity recorded in primary care only, 6.8% of the population were recorded as Asian, 2.3% Black, 1.5% Mixed, 65.6% White and 1.9% Other, and ethnicity was not recorded for 21.8%. When supplementing with hospital-recorded ethnicity data, corresponding percentages were 7.8% Asian, 2.6% Black, 1.9% Mixed, 77.9% White, 2.3% Other and 7.5% not recorded, representing a percentage point increase ranging from 0.3% in the Black group to 12.3% in the White group.
Older patients tended to have a higher rate of recorded White ethnicity (e.g. 76.3% in the 80 + group vs 50.0% in the 0–19 group), whereas younger patients had a higher rate of recording for Asian, Black, Mixed and Other groups. The higher proportion of women with recorded ethnicity was reversed in the Asian group where men (7.0% and 8.0% with secondary care data) had a higher proportion of recording than women (6.6% and 7.6% with secondary care data). The proportion of ethnicity reporting was lower for patients with dementia, hypertension or learning disabilities in every ethnic group other than White (Fig. 2 /Additional file 1: Table S2). The breakdown by 16 group ethnicity is shown in Additional file 1: Table S3. There was considerable variation in the completeness of ethnicity recording across practices with at least 1000 registered patients (Fig. 3 ).
Boxplot showing the 5th, 25th, 50th, 75th and 95th percentiles of completeness of ethnicity recording across practices with at least 1000 registered patients
3.1% [260, 611] of the 19,618,135 patients with a recorded ethnicity had at least one ethnicity record that was discordant with the latest recorded ethnicity (Table 3 ). Patients whose latest recorded ethnicity was categorised as Mixed were most likely to have a discordant ethnicity recording (32.2%, 118,560), of whom 17.0% (62,565) also had a recorded ethnicity of White. 5.7% (33,205) of the 583,770 patients with the latest recorded ethnicity of Black also had a recorded ethnicity of White (Table 2 ).
Overall, for 19,364,120 (98.7%) of patients, their latest recorded ethnicity in primary care matched their most frequently recorded ethnicity in primary care (Table 3 ). 16,390,425 (99.5%) patients with the most recent ethnicity ‘White’ had matching most frequently recorded ethnicity. Other was the least concordant group, just 81.6% (399,440) of patients with the most recent ethnicity ‘Mixed’ had matching most frequently recorded ethnicity. 0.9% (5450) of patients with latest ethnicity ‘Black’ had the most frequently recorded ethnicity ‘White’ (Additional file 1: Table S4).
Of the 19.6 million total patients with a primary care ethnicity record, 12.9 million (66.0%) also had a secondary care ethnicity record. The proportion of patients with no secondary care coded ethnicity ranged from 31.9% in the White group to 58.6% in the Other group (Additional file 1: Table S5). SNOMED:2022 and secondary care coded ethnicity matched for 93.5% of patients with both coded ethnicities, ranging from 34.8% in the Mixed group to 96.9% in the White group (Fig. 4 , Additional file 1: Table S6).
Sankey plot comparing the categorisation of ethnicity in primary care and secondary care
The proportion of patients in each ethnicity group based on primary care records as of January 2022 was within 2.9 percentage points of the 2021 Census estimate (amended to the 2001 grouping) for the same ethnicity group across England as a whole (Asian: 8.7% primary care, 8.8% Census, relative difference (RD) − 1.5; Black: 3.0%, 4.2%, RD − 29.4; Mixed: 1.9%, 3.0% RD − 36.5; White: 84.0%, 81.0% RD 3.6; Other: 2.5%, 2.9%, RD − 15.1). When supplemented secondary care data, this increased to 3.2% (Fig. 5 , Additional file 1: Table S7). In primary care records, the White population was underrepresented in all regions other than the North West (7.1% percentage points higher than Census estimates), South East (2.8%) and South West (0.6%) and was most severely underestimated in the West Midlands (− 12.5%). The Asian population was overrepresented in all regions other than the North West (− 3.6%) and South East (− 1.6%) (Fig. 6 , Additional file 1: Table S8). We also compared the primary care data to the 2021 Census estimates using 2021 rather than 2001 ethnicity groups (Additional file 1: Figs. S1 and S2 and Additional file 1: Table S9).
Bar plot showing the proportion of 2021 Census and primary care populations per ethnicity grouped into 5 groups (excluding those without a recorded ethnicity (21.8% SNOMED:2020 and 7.5% supplemented with ethnicity data from secondary care)). Data labels indicate the percentage point difference between 2021 Census and TPP populations
Bar plot showing the proportion of 2021 Census and TPP populations in each ethnicity group by region (excluding those without a recorded ethnicity (21.8% in primary care and 7.5% supplemented with ethnicity data from secondary care)). Data labels indicate percentage point difference between 2021 Census and TPP populations
This study reported ethnicity recording quality in around 25 million patients registered with a general practice in England and available for analysis in the OpenSAFELY-TPP database. Over three quarters of all patients had at least one ethnicity record in primary care data. When supplemented with hospital records, ethnicity recording was 92.5% complete, which is consistent with previously reported England-wide primary care data sources [ 27 , 28 ]. 98.7% of patients’ latest and most frequently recorded ethnicity matched. As the latest recorded ethnicity is computationally more efficient within OpenSAFELY, we recommend the use of the latest recorded ethnicity. The reported concordance of primary and secondary care records of 93.5% is consistent with those previously reported [ 29 ]. Despite regional variations, the overall ethnicity breakdown across all English OpenSAFELY-TPP practices was similar to the 2021 Census; however, larger relative differences were observed, in particular for the Mixed and Black groups. Therefore, relative to the size of certain ethnic groups, discrepant ethnicity recording practices may be a concern.
This study provides a breakdown of primary care coding in OpenSAFELY-TPP by key clinical and demographic characteristics. The key strengths of this study are the use of large Electronic Health Record (EHR) datasets representing roughly 40% of the population of England registered with a GP, which enabled us to assess the quality of ethnicity data against a variety of important clinical characteristics.
Practices may utilise differing strategies for collecting ethnicity information from patients. Typically ethnicity is self-reported by the patient at registration or during consultation [ 30 ] but may not always be self-reported and may reflect an assumption made by the person entering the data. OpenSAFELY-TPP was missing ethnicity for 21.8% of patients, and the missingness of ethnicity data in EHRs may not be random [ 6 ].
This study focussed on the 5 Group ethnicity of the SNOMED:2022 codelists categorisation. However, there can be important variations in clinical care within these broad categories, as seen in COVID vaccine uptake [ 31 , 32 ]. More detailed categorisations, alternative coding systems and codelists have been further explored in the OpenSAFELY-TPP Ethnicity short data report.
It is common for OpenSAFELY-TPP studies to supplement the primary care recorded ethnicity, where missing, with ethnicity data from secondary care [ 10 , 11 , 33 ]. The representativeness of the CTV3:2020 coded ethnicity supplemented with SUS data has been reported previously [ 33 ]. However, secondary care data is only available for people attending hospital within the time period that data were available (currently April 2019 onwards in OpenSAFELY). The population who still have no ethnicity record after supplementation are likely very different to the wider population, for example having a much lower chance of having been admitted to hospital, or interacting with healthcare services generally.
This study represents a snapshot of ethnicity recording as of 1 January 2022 and does not provide insights into temporal trends in ethnicity recording. Trends in ethnicity recording over time are difficult to investigate due to loss of record date during transfer of clinical records when patients register with a new practice (Additional file 1: Fig. S4). Therefore, we are unable to assess the impact of QOF financial incentives being rescinded in 2011/12.
The most up-to-date formal estimates of England’s population by ethnic group currently available are from the 2021 Census. Accuracy of the 2021 Census ethnicity estimates may vary by region. The 2021 census response rate was not even between regions, ranging from 95% in London to 98% in the South East, South West and East of England [ 34 ]. The 2021 census used multiple imputation to account for missing ethnicity [ 35 ]; the percentage of eligible persons who had an ethnicity value imputed or edited was not even between regions. Imputation rate was highest in London (2.0%) and lowest in the North East (1.0%) [ 34 ].
There are limitations in comparing the GP-registered population with the census population as differences naturally arise. For example, patients registered with a GP may have left the country some years ago and hence not be counted in the census; certain populations are less likely to be registered with a GP (such as Gypsy, Roma and Traveller communities [ 36 ] and migrants [ 37 , 38 ]); not everyone responds to the census but some may be registered with a GP; and regional differences occur, for example due to students moving to cities during term-time. We looked at the GP-registered population in January 2022, whereas the census was taken in March 2021; therefore, some small changes in population also may have occurred during this time.
Over 20 studies have been conducted using the OpenSAFELY framework. It is important to understand the data issues with using ethnicity in OpenSAFELY. Whilst ethnicity data has been shown to be more complete for the CTV3:2020 codelist than the SNOMED:2022 codelist [ 13 ], the CTV3:2020 codelist included codes such as ‘XaJSe: Muslim—ethnic category 2001 census’ which relate to religion rather than ethnicity and were, therefore, excluded from the SNOMED:2022 codelist. The common practice of supplementing CTV3:2020 coded ethnicity with either secondary care data or the PRIMIS codelists could lead to inconsistent classification as both secondary care data and PRIMIS codelists follow the 2001 census categories.
Recording ethnicity is not straightforward. Indeed, despite often being used as a key variable to describe health, the idea of ‘ethnicity’ has been disputed [ 39 ]. Ethnicity is a complex mixture of social constructs, genetic make-up and cultural identity [ 40 ]. Self-identified ethnicity is not a fixed concept and evolving socio-cultural trends could contribute to changes in a person’s self-identified ethnic group, particularly for those with mixed heritage [ 41 ]. It is therefore perhaps not surprising to see lower levels of concordance between latest ethnicity and most common ethnicity in those with latest ethnicity coded as ‘mixed’. It is not clear to what extent this would explain all the discordance we identified or whether other factors such as data error are involved. Our findings agree with previous literature, both from the US and UK [ 5 , 41 ], which suggest that the consistency of ethnicity information tends to be highest for white populations, and lowest for Mixed or Other racial/ethnic groups [ 42 ].
The 2001 census categories are the NHS standard for ethnicity [ 17 ], but we have not been able to find any explanation for the continued use of the 2001 census categories as the standard.
Due to the significant differences experienced by ethnic groups in terms of health outcomes, accurate ethnicity coding to the most granular code possible is crucial. Although we have focussed on codelist categorisations based on the 2001 census categories, ethnicity can be extracted for each of the component codes (Additional file 1: Table S8), so researchers have the option to use custom categorisations as required.
We believe that the SNOMED:2022 codelist and codelist categorisation provides a more consistent representation of ethnicity as defined by the 2001 census categories than the CTV3:2020 codelist and should be the preferred codelist and categorisation for primary care ethnicity.
This paper is principally to inform interpretation of the numerous current and future analyses completed and published using OpenSAFELY-TPP and similar UK electronic healthcare databases. The practice of supplementing primary care ethnicity with secondary care ethnicity from SUS can, depending on the study design, introduce bias and should be used with caution. For example, patients who have more clinical interactions are more likely to have a recorded ethnicity and therefore patients with a recorded ethnicity in secondary care data may tend to be sicker than the general population. Ethnicity recording has been found to be more complete for patients who died in hospital compared with those discharged [ 5 ].
This report describes the completeness and consistency of primary care ethnicity in OpenSAFELY-TPP and suggests the adoption of the SNOMED:2022 codelist and codelist categorisation as the best standard method.
Access to the underlying identifiable and potentially re-identifiable pseudonymised electronic health record data is tightly governed by various legislative and regulatory frameworks, and restricted by best practice. The data in OpenSAFELY is drawn from General Practice data across England where TPP is the Data Processor. TPP developers (CB, JC, JP, FH and SH) initiate an automated process to create pseudonymised records in the core OpenSAFELY database, which are copies of key structured data tables in the identifiable records. These are linked onto key external data resources that have also been pseudonymised via SHA-512 one-way hashing of NHS numbers using a shared salt. Bennett Institute for Applied Data Science developers and PIs (BG, CEM, SB, AJW, KW, WJH, HJC, DE, PI, SD, GH, BBC, RMS, ID, KB, EJW and CTR) holding contracts with NHS England have access to the OpenSAFELY pseudonymised data tables as needed to develop the OpenSAFELY tools. These tools in turn enable researchers with OpenSAFELY Data Access Agreements to write and execute code for data management and data analysis without direct access to the underlying raw pseudonymised patient data and to review the outputs of this code. All code for the full data management pipeline—from raw data to completed results for this analysis—and for the OpenSAFELY platform as a whole is available for review at github.com/OpenSAFELY.
Admitted patient care statistics
Clinical Terms Version 3
Emergency care
Electronic health record
General Practice Extraction Service Data for Pandemic Planning and Research
General practitioner
General Practice Extraction Service
Index of Multiple Deprivation
Nomenclature of Territorial Units for Statistics
Office for National Statistics
Outpatient attendance
Primary Care Information Services
Quality and Outcomes Framework
Systematised Nomenclature of Medicine Clinical Terms
Secondary Uses Service
Irizar P, Pan D, Kapadia D, Bécares L, Sze S, Taylor H, et al. Ethnic inequalities in COVID-19 infection, hospitalisation, intensive care admission, and death: a global systematic review and meta-analysis of over 200 million study participants. EClinicalMedicine. 2023;57:101877.
Article PubMed PubMed Central Google Scholar
Mathur R, Rentsch CT, Morton CE, Hulme WJ, Schultze A, MacKenna B, et al. Ethnic differences in SARS-CoV-2 infection and COVID-19-related hospitalisation, intensive care unit admission, and death in 17 million adults in England: an observational cohort study using the OpenSAFELY platform. Lancet. 2021;397(10286):1711–24.
Article CAS PubMed PubMed Central Google Scholar
Garlick S. Ethnic group, England and Wales - Office for National Statistics. Office for National Statistics; 2022. Available from: https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/ethnicity/bulletins/ethnicgroupenglandandwales/census2021 . Cited 2023 May 24.
Knox S, Bhopal RS, Thomson CS, Millard A, Fraser A, Gruer L, et al. The challenge of using routinely collected data to compare hospital admission rates by ethnic group: a demonstration project in Scotland. J Public Health. 2020;42(4):748–55.
Article CAS Google Scholar
Scobie S, Spencer J, Raleigh V. Ethnicity coding in English health service datasets. Available from: https://www.nuffieldtrust.org.uk/files/2021-06/1622731816_nuffield-trust-ethnicity-coding-web.pdf . Cited 2023 Feb 12.
Mathur R, Bhaskaran K, Chaturvedi N, Leon DA, vanStaa T, Grundy E, et al. Completeness and usability of ethnicity data in UK-based primary care and hospital databases. J Public Health. 2014;36(4):684–92.
Article Google Scholar
Contract changes 2011/12. Available from: https://web.archive.org/web/20110504084616/http://www.nhsemployers.org/PayAndContracts/GeneralMedicalServicesContract/GMSContractChanges/Pages/Contract-changes-2011-12.aspx . Cited 2023 May 24.
Kapadia, Zhang, Salway, Nazroo, Booth. Ethnic inequalities in healthcare: a rapid evidence review. NHS Race and Health. 2022. Available from: https://www.nhsrho.org/research/ethnic-inequalities-in-healthcare-a-rapid-evidence-review-3/ . Cited 2024 June 27
NHS Digital. Secondary Uses Service (SUS). Available from: https://digital.nhs.uk/services/secondary-uses-service-sus . Cited 2023 May 16.
Fisher L, Hopcroft LEM, Rodgers S, Barrett J, Oliver K, Avery AJ, et al. Changes in English medication safety indicators throughout the COVID-19 pandemic: a federated analysis of 57 million patients’ primary care records in situ using OpenSAFELY. BMJ Med. 2023;2(1):e000392. https://doi.org/10.1136/bmjmed-2022-000392 .
Nab L, Parker EP, Andrews CD, Hulme WJ, Fisher L, Morley J, Mehrkar A, et al. Changes in COVID-19-Related Mortality across Key Demographic and Clinical Subgroups in England from 2020 to 2022: A Retrospective Cohort Study Using the OpenSAFELY Platform. Lancet Public Health. 2023;8(5):e364–77.
OpenCodelists: Ethnicity codes. Available from: https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/eth2001/v1/ . Cited 2022 Sep 13.
OpenCodelists: ethnicity (SNOMED). Available from: https://www.opencodelists.org/codelist/opensafely/ethnicity-snomed-0removed/2e641f61/ . Cited 2022 Sep 13.
OpenCodelists: Ethnicity. Available from: https://www.opencodelists.org/codelist/opensafely/ethnicity/2020-04-27/ . Cited 2022 Sep 13.
PRIMIS develops the national Covid-19 Vaccination Uptake Reporting Specification. Available from: https://www.nottingham.ac.uk/primis/about/news/newslisting/primis-develops-the-national-covid-19-vaccination-uptake-reporting-specification.aspx . Cited 2022 Aug 19.
NHS Digital. General Practice Extraction Service (GPES) Data for pandemic planning and research: a guide for analysts and users of the data. Available from: https://digital.nhs.uk/coronavirus/gpes-data-for-pandemic-planning-and-research/guide-for-analysts-and-users-of-the-data . Cited 2022 Aug 19.
Ethnic Category. Available from: https://www.datadictionary.nhs.uk/data_elements/ethnic_category.html?hl=ethnic . Cited 2022 Aug 22.
Gypsy, Roma and Irish Traveller ethnicity summary. Available from: https://web.archive.org/web/20220213182343/https://www.ethnicity-facts-figures.service.gov.uk/summaries/gypsy-roma-irish-traveller . Cited 2023 Jun 6.
McLennan D, Noble S, Noble M, Plunkett E, Wright G, Gutacker N. The English Indices of Deprivation 2019 : technical report. 2019. Available from: https://dera.ioe.ac.uk/id/eprint/34259 . Cited 2022 Aug 4.
OpenCodelists: Dementia (SNOMED). Available from: https://www.opencodelists.org/codelist/opensafely/dementia-snomed/2020-04-22/ . Cited 2022 Sep 13.
OpenCodelists: Diabetes (SNOMED). Available from: https://www.opencodelists.org/codelist/opensafely/diabetes-snomed/2020-04-15/ . Cited 2022 Sep 13.
OpenCodelists: Hypertension (SNOMED). Available from: https://www.opencodelists.org/codelist/opensafely/hypertension-snomed/2020-04-28/ . Cited 2022 Sep 13.
OpenCodelists: Wider learning disability. Available from: https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/learndis/v1/ . Cited 2022 Sep 13.
Variable reference. Available from: https://docs.opensafely.org/study-def-variables/ . Cited 2022 Nov 18.
Mortality statistics - underlying cause, sex and age - Nomis - Official Labour Market Statistics. Available from: https://www.nomisweb.co.uk/datasets/mortsa . Cited 2022 Jan 28.
List of ethnic groups. Available from: https://www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups . Cited 2023 Apr 17.
Wood A, Denholm R, Hollings S, Cooper J, Ip S, Walker V, et al. Linked electronic health records for research on a nationwide cohort of more than 54 million people in England: data resource. BMJ. 2021;7(373):n826.
Pineda-Moncusí M, Allery F, Delmestri A, Bolton T, Nolan J, Thygesen JH, Handy A, et al. Ethnicity Data Resource in Population-Wide Health Records: Completeness, Coverage and Granularity of Diversity. Sci Data. 2024;11(1):221.
Shiekh SI, Harley M, Ghosh RE, Ashworth M, Myles P, Booth HP, et al. Completeness, agreement, and representativeness of ethnicity recording in the United Kingdom’s Clinical Practice Research Datalink (CPRD) and linked Hospital Episode Statistics (HES). Popul Health Metr. 2023;21(1):3.
Hull SA, Mathur R, Badrick E, Robson J, Boomla K. Recording ethnicity in primary care: assessing the methods and impact. Br J Gen Pract. 2011;61(586):e290–4.
Watkinson RE, Williams R, Gillibrand S, Sanders C, Sutton M. Ethnic inequalities in COVID-19 vaccine uptake and comparison to seasonal influenza vaccine uptake in greater Manchester, UK: a cohort study. PLoS Med. 2022;19(3):e1003932.
Curtis HJ, Inglesby P, Morton CE, MacKenna B, Green A, Hulme W, et al. Trends and clinical characteristics of COVID-19 vaccine recipients: a federated analysis of 57.9 million patients’ primary care records in situ using OpenSAFELY. Br J Gen Pract. 2022;72(714):e51–62.
Article PubMed Google Scholar
Andrews C, Schultze A, Curtis H, Hulme W, Tazare J, Evans S, et al. OpenSAFELY: representativeness of electronic health record platform OpenSAFELY-TPP data compared to the population of England. Wellcome Open Res. 2022;18(7):191.
Measures showing the quality of Census 2021 estimates. Available from: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/methodologies/measuresshowingthequalityofcensus2021estimates . Cited 2023 Feb 16.
Wardman L, Aldrich S, Rogers S. Census item edit and imputation process. Disponible en ligne:[06/01/2015]. http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census-userguide/quality-and-methods/quality/quality-measures/response-and-imputation-rates/item-edit-andimputation-process.pdf . 2011. Cited 2022 Feb 3.
Tackling inequalities faced by Gypsy, roma and traveller communities. Available from: https://publications.parliament.uk/pa/cm201719/cmselect/cmwomeq/360/full-report.html . Cited 2023 May 25.
Kang C, Tomkow L, Farrington R. Access to primary health care for asylum seekers and refugees: a qualitative study of service user experiences in the UK. Br J Gen Pract. 2019;69(685):e537–45.
Knights F, Carter J, Deal A, Crawshaw AF, Hayward SE, Jones L, et al. Impact of COVID-19 on migrants’ access to primary care and implications for vaccine roll-out: a national qualitative study. Br J Gen Pract. 2021;71(709):e583–95.
Bradby H. Ethnicity: not a black and white issue. A research note. Sociol Health Illn. 1995;17(3):405–17.
Lee C. “Race” and “ethnicity” in biomedical research: how do scientists construct and explain differences in health? Soc Sci Med. 2009;68(6):1183–90.
Saunders CL, Abel GA, El Turabi A, Ahmed F, Lyratzopoulos G. Accuracy of routinely recorded ethnic group information compared with self-reported ethnicity: evidence from the English Cancer Patient Experience survey. BMJ Open. 2013;3(6):e002882.
Arday SL, Arday DR, Monroe S, Zhang J. HCFA’s racial and ethnic data: current accuracy and recent improvements. Health Care Financ Rev. 2000;21(4):107–16.
CAS PubMed PubMed Central Google Scholar
Aitken M, Tully MP, Porteous C, Denegri S, Cunningham-Burley S, Banner N, et al. Consensus statement on public involvement and engagement with data intensive health research. Int J Popul Data Sci. 2019;4(1):586.
PubMed PubMed Central Google Scholar
NHS Digital. BETA – Data Security Standards - NHS Digital. Available from: https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-data-and-technology-standards/framework/beta---data-security-standards . Cited 2020 Apr 30.
NHS Digital. Data Security and Protection Toolkit - NHS Digital. Available from: https://digital.nhs.uk/data-and-information/looking-after-information/data-security-and-information-governance/data-security-and-protection-toolkit . Cited 2020 Apr 30.
NHS Digital. ISB1523: Anonymisation Standard for Publishing Health and Social Care Data. Available from: https://digital.nhs.uk/data-and-information/information-standards/information-standards-and-data-collections-including-extractions/publications-and-notifications/standards-and-collections/isb1523-anonymisation-standard-for-publishing-health-and-social-care-data . Cited 2023 Jul 20.
Secretary of State for Health and Social Care - UK Government. Coronavirus (COVID-19): notification to organisations to share information. 2020. Available from: https://web.archive.org/web/20200421171727/https://www.gov.uk/government/publications/coronavirus-covid-19-notification-of-data-controllers-to-shareinformation . Cited 2022 Nov 3.
Download references
We are very grateful for all the support received from the TPP Technical Operations team throughout this work and for generous assistance from the information governance and database teams at NHS England and the NHS England Transformation Directorate.
BG is guarantor.
Data management was performed using Python 3.8, with analysis carried out using Python and R. Code for data management and analysis, as well as codelists are archived online https://github.com/opensafely/ethnicity-short-data-report/ .
This analysis relies on the use of large volumes of patient data. Ensuring patient, professional and public trust is therefore of critical importance. Maintaining trust requires being transparent about the way OpenSAFELY works, and ensuring patient and public voices are represented in the design and use of the platform. Between February and July 2022, we ran a 6-month pilot of Patient and Public Involvement and Engagement activity designed to be aligned with the principles set out in the Consensus Statement on Public Involvement and Engagement with Data-Intensive Health Research [ 43 ]. Our engagement focused on the broader OpenSAFELY platform and comprised three sets of activities: explain and engage, involve and iterate and participate and promote. To engage and explain, we have developed a public website at opensafely.org that provides a detailed description of the OpenSAFELY platform in language suitable for a lay audience and are co-developing an accompanying explainer video. To involve and iterate, we have created the OpenSAFELY ‘Digital Critical Friends’ Group, comprised of approximately 12 members representative in terms of ethnicity, gender and educational background; this group has met every 2 weeks to engage with and review the OpenSAFELY website, governance process, principles for researchers and FAQs. To participate and promote, we are conducting a systematic review of the key enablers of public trust in data-intensive research and have participated in the stakeholder group overseeing NHS England’s ‘data stewardship public dialogue’.
The OpenSAFELY platform is principally funded by grants from:
NHS England [2023–2025];
The Wellcome Trust (222,097/Z/20/Z) [2020–2024];
MRC (MR/V015737/1) [2020–2021].
Additional contributions to OpenSAFELY have been funded by grants from:
MRC via the National Core Study programme, Longitudinal Health and Wellbeing strand (MC_PC_20030, MC_PC_20059) [2020–2022] and the Data and Connectivity strand (MC_PC_20029, MC_PC_20058) [2020–2022];
NIHR and MRC via the CONVALESCENCE programme (COV-LT-0009, MC_PC_20051) [2021–2024];
NHS England via the Primary Care Medicines Analytics Unit [2021–2024].
The views expressed are those of the authors and not necessarily those of the NIHR, NHS England, UK Health Security Agency (UKHSA), the Department of Health and Social Care or other funders. Funders had no role in the study design, collection, analysis and interpretation of data; in the writing of the report and in the decision to submit the article for publication.
Authors and affiliations.
Nuffield Department of Primary Care Health Sciences, Bennett Institute for Applied Data Science, Oxford University, Oxford, OX2 6GG, UK
Colm D. Andrews, Jon Massey, Robin Park, Helen J. Curtis, Lisa Hopcroft, Amir Mehrkar, Seb Bacon, George Hickman, Rebecca Smith, David Evans, Tom Ward, Simon Davy, Peter Inglesby, Iain Dillingham, Steven Maude, Thomas O’Dwyer, Ben F. C. Butler-Cole, Lucy Bridges, Ben Goldacre, Brian MacKenna, Alex J. Walker & William J. Hulme
London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
Rohini Mathur & Laurie A. Tomlinson
TPP, TPP House, 129 Low Lane, Horsforth, Leeds, LS18 5PX, UK
Chris Bates, John Parry, Frank Hester, Sam Harper & Jonathan Cockburn
Wolfson Institute for Population Health, University of London, London, Queen Mary, E1 2AT, UK
Rohini Mathur
You can also search for this author in PubMed Google Scholar
Contributions.
Conceptualisation: CDA, BM, RP, RM, JM and WJH. Data curation: CDA, RP, RM and JM. Formal analysis: CDA, RP, RM, JM and WJH. Funding acquisition: BG. Methodology: CDA, BM, RP, RM, JM and WJH. Project administration: CDA, RP, RM and JM. Resources: CDA, RM, JM, RP, HJC, LH, LAT and BG. Software: CDA, RM, JM, RP, HJC, LH, AM, SB, GH, RS, DE, TW, SD, PI, ID, SM, TO’D, BFCBC, LB, CB, JP, FH, SH, JC, BG, BM, AJW and WJH. Supervision: AJW, LAT and WJH. Validation: CDA, BM, RP, RM, JM and WJH. Visualisation: CDA, RP, BM, BG, AJW and WJH. Writing—original draft: CDA. Writing—review and editing: CDA, AJW, BM, HJC and WJH.
Colm D Andrews-@colmresearcher.
Correspondence to Colm D. Andrews .
Ethics approval and consent to participate.
NHS England is the data controller; TPP is the data processor; and the researchers on OpenSAFELY are acting with the approval of NHS England. This implementation of OpenSAFELY is hosted within the TPP environment which is accredited to the ISO 27001 information security standard and is NHS IG Toolkit compliant [ 44 , 45 ]; patient data has been pseudonymised for analysis and linkage using industry standard cryptographic hashing techniques; all pseudonymised datasets transmitted for linkage onto OpenSAFELY are encrypted; access to the platform is via a virtual private network (VPN) connection, restricted to a small group of researchers; the researchers hold contracts with NHS England and only access the platform to initiate database queries and statistical models; all database activity is logged; only aggregate statistical outputs leave the platform environment following best practice for anonymisation of results such as statistical disclosure control for low cell counts [ 46 ]. The OpenSAFELY research platform adheres to the obligations of the UK General Data Protection Regulation (GDPR) and the Data Protection Act 2018. In March 2020, the Secretary of State for Health and Social Care used powers under the UK Health Service (Control of Patient Information) Regulations 2002 (COPI) to require organisations to process confidential patient information for the purposes of protecting public health, providing healthcare services to the public and monitoring and managing the COVID-19 outbreak and incidents of exposure; this sets aside the requirement for patient consent [ 47 ]. Taken together, these provide the legal bases to link patient datasets on the OpenSAFELY platform. GP practices, from which the primary care data are obtained, are required to share relevant health information to support the public health response to the pandemic and have been informed of the OpenSAFELY analytics platform.
This study was approved by the Health Research Authority (REC reference 20/LO/0651) and by the LSHTM Ethics Board (reference 21863).
Not applicable.
All authors declare the following: BG has received research funding from the Bennett Foundation, the Laura and John Arnold Foundation, the NHS National Institute for Health Research (NIHR), the NIHR School of Primary Care Research, NHS England, the NIHR Oxford Biomedical Research Centre, the Mohn-Westlake Foundation, NIHR Applied Research Collaboration Oxford and Thames Valley, the Wellcome Trust, the Good Thinking Foundation, Health Data Research UK, the Health Foundation, the World Health Organisation, UKRI MRC, Asthma UK, the British Lung Foundation, and the Longitudinal Health and Wellbeing strand of the National Core Studies programme; he is a Non-Executive Director at NHS Digital; he also receives personal income from speaking and writing for lay audiences on the misuse of science. BMK is also employed by NHS England working on medicines policy and clinical lead for primary care medicines data.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
12916_2024_3499_moesm1_esm.pdf.
Additional file 1: Fig. S1. Bar plot showing the proportion of 2021 Census and TPP populations (amended to 2021 grouping) per ethnicity grouped into 5 groups (excluding those without a recorded ethnicity). Annotated with percentage point difference between 2021 Census and TPP populations. Fig S2. Bar plot showing the proportion of 2021 Census and TPP populations (amended to 2021 grouping) per ethnicity grouped into 5 groups per NUTS − 1 region (excluding those without a recorded ethnicity). Annotated with percentage point difference between 2021 Census and TPP populations. Fig. S3. Recording of ethnicity over time for latest and first recorded ethnicity. Unknown dates of recording may be stored as ‘1900 − 01 − 01’. Table S1. Count of patients with a recorded ethnicity in OpenSAFELY-TPP (proportion of registered TPP population) by clinical and demographic subgroups. All counts are rounded to the nearest 5. Table S2. Count of patients with a recorded ethnicity in OpenSAFELY TPP by ethnicity group (proportion of registered TPP population) and clinical and demographic subgroups. All counts are rounded to the nearest 5. Table S3. Count of patients with a recorded ethnicity in OpenSAFELY TPP by ethnicity group (proportion of registered TPP population) and clinical and demographic subgroups. All counts are rounded to the nearest 5. Table S4. Count of patients’ most frequently recorded ethnicity (proportion of latest ethnicity). Table S6. Count of patients with a recorded ethnicity in Secondary Care by ethnicity group excluding Unknown ethnicites (proportion of Primary Care population). All counts are rounded to the nearest 5. Table S7. Count of patients with a recorded ethnicity in OpenSAFELY TPP by ethnicity group (proportion of registered TPP population) and 2021 ONS Census counts [amended to 2001 grouping] (proportion of 2021 ONS Census population). All counts are rounded to the nearest 5. Table S8. Count of patients with a recorded ethnicity in OpenSAFELY TPP [amended to the 2021 ethnicity grouping] (proportion of registered TPP population) and 2021 ONS Census counts (proportion of 2021 ONS Census population). All counts are rounded to the nearest 5. Table S9. Count of individual ethnicity code use.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Reprints and permissions
Cite this article.
The OpenSAFELY Collaborative., Andrews, C.D., Mathur, R. et al. Consistency, completeness and external validity of ethnicity recording in NHS primary care records: a cohort study in 25 million patients’ records at source using OpenSAFELY. BMC Med 22 , 288 (2024). https://doi.org/10.1186/s12916-024-03499-5
Download citation
Received : 31 January 2024
Accepted : 24 June 2024
Published : 10 July 2024
DOI : https://doi.org/10.1186/s12916-024-03499-5
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1741-7015
The Electronic Health Records (EHR) project is a massive one and so a demonstrator was to be built for a particular county. This would function as a test bed for proving the concept and the operations would yield new ideas for actual fulltime implementation. Technically, the EHR would enable the General Practitioner (GP) to view a patient's full details and also help doctors giving emergency treatment to quickly determine the allergies and previous history of diseases but until a demonstrator was built and deployed, the actual evidence would not emerge. Infosys had to build a demonstrator to show what was actually possible and also suggest improvements.
Complexities were overwhelming and the ownership of the patient was confusing
Patient care services were impacted by rising costs
An increasing proportion of the ageing population had long-term diseases
Data provided would come from disparate legacy systems
To enable doctors to keep abreast of the latest developments, Infosys developed a Knowledge Management (KM) portal subsystem
Infosys used a Web Services model to ensure that EHR could utilize the information to deliver accurate results.
Integrate disparate legacy systems
Looking for a breakthrough solution?
Infosys relied on the Global Delivery Model to conclusively depict the benefits of EHR. As a first step, Infosys analyzed the existing high-level user requirement documents and assisted in preparing requirements for defining the solution of the EHR demonstrator.
Digital Healthcare ecosystem
Next Case Study
COMMENTS
We describe surveillance case studies and future directions for enhancing opportunities to use EHR data for public health surveillance. ... Electronic health records (EHRs) contain extensive longitudinal health information about patients and populations (1). Over the last decade, prompted by federal meaningful use guidelines and incentives ...
Abstract. Implementing an electronic health record (EHR) can be a difficult task to take on and planning the process is of utmost importance to minimize errors. Evaluating the selection criteria and implementation plan of an EHR system, intending interoperability, confidentiality, availability, and integrity of the patient health information ...
Despite the great advances in the field of electronic health records (EHRs) over the past 25 years, implementation and adoption challenges persist, and the benefits realized remain below expectations. This scoping review aimed to present current knowledge ...
Read EHR case studies and hear stories of doctors, practices and organizations that have implemented EHRs.
In this chapter, we present each of our four case study sites in turn. We summarise key facts about the four sites in Table 9. Our case study ambulance services used a range of terminology for describing their electronic records systems, including electronic PRFs, electronic PCRs and electronic patient records. For consistency, we use the expression 'electronic health records' ('EHRs ...
A qualitative study design was used. We collected the opinions from different groups of clinicians (physicians, hospitalists, nurse practitioners, nurses, and patient safety officers) using semi-structured interviews. Organizations represented were trauma hospitals, academic medical centers, medical clinics, home health centers, and small hospitals.
With the increasing availability of rich, longitudinal, real-world clinical data recorded in electronic health records (EHRs) for millions of patients, there is a growing interest in leveraging ...
The Shenandoah Community Health Center (SCHC) in Martinsburg, WV, is a federally qualified health center with about 30,000 patients that averages 129,000 visits per year, and has been working on implementing electronic health records and achieving meaningful use. It is also a Level 1 NCQA Certified Medical Home and Joint Commission Accredited.
Abstract. There is a national focus on the adoption and use of electronic health records (EHRs) with electronic prescribing (e-Rx) for the goal of providing safe and quality care. Although there is a large body of literature on the benefits of adoption, there is also increasing evidence of the unintentional consequences resulting from use.
This retrospective cohort study used electronic health records from TriNetX US Collaborative Network, covering >100 million patients in the USA.
Well before the Covid-19 pandemic struck, electronic health records were the bane of physicians' existences. In all too many cases, EHRs seemed to create a huge amount of extra work and generate ...
This mixed-methods study was conducted among CMC, their family caregivers, and physicians in SC. Electronic health records data from a primary care clinic within a large health system (7/1/2022-6/30/2023) was analyzed. Logistic regression examined factors associated with hospitalizations among CMC.
This cross-sectional study examines whether electronic health record (EHR) system safety performance is associated with EHR frontline user experience in a national sample of hospitals.
Rural Health Clinic Exchanges Information with Hospitals and Physicians for Improved Coordination of Care
One of the structures that centralizes many of the tools for achieving comprehensive and convenient sleep health lifelong is the electronic health record (EHR). EHRs that are widely distributed and provide patient access and communication can be designed to foster best practices, coordinate more eficient care long-term, and provide proof of efectiveness in achieving outcomes. After examining ...
The surge in electronic health records (EHRs), big data analytics, and telemedicine raises significant questions about privacy, consent, data ownership, and equity. Integrating these technologies into our healthcare systems is crucial to navigating these ethical dilemmas thoughtfully.
The second part describes two implementations of electronic medical record systems and compares the theory against the findings of these two case studies. The paper provides implementers with research-informed guidance about effective implementation, contributes to developing implementation theory and notes policy implications for current ...
The purpose of this paper is to do a case study on the implementation of an EHR system in a local healthcare institution, and then to analyze this case study to give directions so as to avoid some arising issues. Download to read the full chapter text.
Objective: We aimed to present a descriptive case study of the implementation of an open source electronic health record system in public health care facilities in Kenya.
This cohort study assesses the ability of a natural language processing model to classify patient-initiated electronic health record (EHR) messages and triage COVID-19 cases to reduce clinician response time and improve access to antiviral treatment.
Background In contrast to the acute hospital sector, there have been relatively few implementations of integrated electronic health record (EHR) systems into specialist mental health settings. The National Programme for Information Technology (NPfIT) in England was the most expensive IT-based transformation of public services ever undertaken, which aimed amongst other things, to implement ...
High-quality data are fundamental to healthcare research, future applications of artificial intelligence and advancing healthcare delivery and outcomes through a learning health system. Although routinely collected administrative health and electronic medical record data are rich sources of information, they have significant limitations.
Background: Electronic health records (EHRs) are a promising approach to document and map (complex) health information gathered in health care worldwide. However, possible unintended consequences during use, which can occur owing to low usability or the lack of adaption to existing workflows (e.g., high cognitive load), may pose a challenge. To prevent this, the involvement of users in the ...
The purpose of this study was to develop and validate an algorithm for identifying Veterans with a history of traumatic brain injury (TBI) in the Veterans Affairs (VA) electronic health record usin...
Mobile health (mHealth) technologies, such as wearable devices and sensors that can be placed in the home, allow for the capture of physiologic, behavioral, and environmental data from patients between clinic visits. The inclusion of these data in the medical record may benefit patients and providers. Most health systems now have electronic health records (EHRs), and the ability to pull and ...
Methods: AESI phenotype algorithms were developed to apply to electronic health record data at health provider organizations across the country by querying for standard and interoperable codes. The codes queried in the rules represent symptoms, diagnoses, or treatments of the AESI sourced from published case definitions and input from clinicians.
At the same time, improved data collection efficiency reduces clinical trial costs. Objective: Explore how to extract clinical trial-related data from hospital electronic health record system (EHR), transform the data into an electronic data capture system (EDC) required format, and transfer it into sponsor's environment.
PDF | The main objective of this study was to explore the impact of EHRs on healthcare quality at the Asamankese Government Hospital. The research used... | Find, read and cite all the research ...
The key strengths of this study are the use of large Electronic Health Record (EHR) datasets representing roughly 40% of the population of England registered with a GP, which enabled us to assess the quality of ethnicity data against a variety of important clinical characteristics.
The Electronic Health Records (EHR) project is a massive one and so a demonstrator was to be built for a particular county. This would function as a test bed for proving the concept and the operations would yield new ideas for actual fulltime implementation. Technically, the EHR would enable the General Practitioner (GP) to view a patient's ...