Improving Assessment of EHR Data Quality By Explicitly Accounting For Underlying Biases

Susan Jenkins, PhD
Feb 9
2 min read

Electronic Health Records (EHRs) have become a cornerstone of modern health research and evaluation. Their promise lies in the vast, real-time, and longitudinal data they provide that can be used to improve care, track outcomes, and inform policy. But as the use of EHR data for research grows, so does the need for rigorous data quality assessment.

The Limits of Traditional Data Quality Checks

Most data quality frameworks for EHRs focus on technical dimensions: completeness, correctness, concordance, plausibility, and currency. These are essential as missing or inaccurate data can undermine any analysis. However, these checks alone are not enough. They do not address the social and systemic forces that shape which data get recorded, how they are recorded, and whose stories are included in and excluded from in the data.

Bias Is Baked Into the Data

EHRs are not neutral repositories. They reflect the realities of healthcare access, provider practices, and institutional policies. For example, EHR systems are more likely to include data from patients who are older, White, more educated, or more likely to seek care. Historically underserved patients are often underrepresented or misclassified due to inconsistent or biased data collection practices. This selection bias means that even “complete” datasets may not be representative, and research findings may not generalize to the populations most in need.

Information Bias and Misclassification

Beyond who is included, there’s the issue of how information is captured. Race and ethnicity, for example, may be recorded based on provider perception rather than patient self-identification, leading to misclassification. Or the records may be missing important nuance if it is captured at the major category level (i.e., not at the detailed category level) or if data systems and analytic models have not been designed to analyze data from patients that select multiple race or ethnicity categories. Documentation practices can vary widely, and implicit biases among healthcare providers can influence what gets recorded and how. These information biases can skew research findings, especially when studying health disparities or evaluating interventions intended to promote universal public health.

The Consequences of Ignoring Bias

If data quality assessments focus only on technical metrics and ignore bias, they risk giving a false sense of confidence in the data. Studies may miss or misrepresent disparities, interventions may be less effective for marginalized groups, and policies based on biased data may inadvertently reinforce inequities. In short, data quality without bias assessment is not true quality.

Toward a More Complete Approach

A complete assessment of EHR data quality must include:

Evaluation of representativeness: Who is in the data, and who is missing?
Analysis of documentation practices: How might provider or system-level biases affect what is recorded?
Attention to misclassification: Are key variables like race, ethnicity, and social needs accurately captured?
Transparency about limitations: Are potential biases and their implications clearly reported?

By integrating bias assessment into data quality frameworks, researchers and evaluators can produce findings that are not only technically sound but also equitable and actionable.