The Importance of High Quality Data for Artificial Intelligence Reliability

Garbage in = Garbage Out = High Risk

Lindsay Westervelt Artificial Intelligence (AI) is a hot topic right now in medical practice. Though there are many reasons why AI solutions have become so popular, one of the biggest reasons is that AI has the potential to reduce clinical burnout and fatigue by improving Clinical Decision Support in electronic health record (EHR) systems. To understand how this can be achieved, we must first understand what AI is and how it works.

AI is “the science of training machines to perform human tasks” by creating predictive logic so that it "acts" accordingly. In order to train machines, developers use Machine Learning, which is the process of entering data into an algorithm and teaching a machine how to learn and provide answers based on that data. According to SAS Insights, Machine Learning has a continuous life cycle wherein developers ask questions, collect data, test their algorithms, collect feedback, and improve the algorithm.

For Healthcare, AI and Machine Learning algorithms must rely on copious amounts of granular, high-quality data. That's why we've asked J P Systems' nationally recognized Clinical Data Quality expert, Sandi Mitchell, to weigh in on how data quality affects AI algorithms and what healthcare organizations and developers can do to maximize AI-supported outcomes.

Sandi Mitchell is Healthcare Data Quality Analyst SME at J P Systems. Ms. Mitchell worked in pharmacy for years, becoming an expert in clinical data. She now leads J P System’s Clinical Data Quality team, who work tirelessly at identifying clinical data problems and root causes to help eliminate missing, miscoded, and misplaced data, all of which pose risks to patient safety. Among her other experiences and accolades, Ms. Mitchell has worked at Johns Hopkins, and she taught pharmacy students for seven years.

Question: What constitutes data quality and how does it affect healthcare?

Sandi Mitchell: Data quality work involves analyzing clinical data using a variety of tools. In our work, we assess data formats, completeness, adherence to prescribed standards, required value sets, and message constraints. We also review healthcare domains, which include not only demographics and payor values, but all of the clinical domains (for example, lab tests, results, or Allergies).

The goal of this work is to improve the quality of medical data exchanged between health networks and their community partners. Medical providers need complete patient data so they can make critical decisions about patient care. However, 50-70% of exchanged patient records aren't usable due to missing, miscoded, or misplaced data. This increases clinicians’ burden if they don't trust the data, which is often the case when internal and external data are mixed within systems that don't require high quality clinical data.

Ultimately, poor quality clinical data exacts a cost on everyone. Physicians become frustrated or struggle with increased burnout because they have to repeat tests or spend valuable time searching for data. The clinical environment suffers when physicians schedule multiple appointments for one patient to repeat tests and deny other patients those time slots. Patients don't receive proper care, but they do receive multiple bills that their insurance might not cover. Finally, researchers and developers have to create workarounds, define and implement data transformations, or hire more people to figure out data problems.

Sandi MitchellQ: How could data quality affect Machine Learning?

Mitchell: There are multiple types of Machine Learning, but all of them require inputting data into an algorithm to create AI logic. To support Machine Learning and AI efforts in medical care, developers work closely with clinicians to refine the process by testing the algorithm with clinical data.

Today, clinical data is stored in EHRs and can be exchanged using HL7 Consolidated Clinical Document Architectures (C-CDAs).  A C-CDA is a type of clinical document that contains a structured set of clinical notes about a given patient and an area of concern. C-CDAs are broken up into clinically logical groups and domains, such as demographics, allergies, medications, documents, problem lists, and/or discharge information. C-CDAs are organized in a way that reflects clinician workflows and EHR vendors' system architecture.

To properly structure data and system architecture, an organization must abide by rules, which are called data standards. These standards are used to provide guidance, explain data elements and groupings, and improve data exchange between various healthcare providers. Previously, these rules fell under the Office of the National Coordinator (ONC) Common Core; now they are moving to the United States Core Data for Interoperability (USCDI) standard set and HL7 FHIR® resources. As noted in the official USCDI webpage, USCDI is a standardized set of health data classes and constituent data elements for nationwide, interoperable health information exchange. (

Under the current HL7 C-CDA standard, a receiving EHR system has no ability to reject exchanged clinical data, which means that patient data can be incorrect and unusable. Under FHIR®, an HL7 data standard, EHR systems will need to have all FHIR® resources in place so that a receiving system can accept exchanged data. Otherwise, the receiving system will reject the sender's data and return it with an error message. This could affect Machine Learning because AI subsets like Natural Language Processing (NLP) rely on data to parse text for patterns and irregularities.

There's a direct relationship between the quality of clinical data and the functionality of AI tools. If an AI doesn't have the proper clinical data or format to return a correct answer, it can't learn well, and it can't find innovative pattern matching. For example, say there are 10 allergies in a given data set. A person would acknowledge that there are 10 allergies; however, an AI engine might only return 2 allergies if the data set didn’t have structured, standardized data.

AI has a lot of functions, and none of them can be done with partial data.

Q: What would it take for AI to function as needed in healthcare

Mitchell: When people think about interoperability, they're usually only thinking as far as patient matching or harmony with value sets. However, there is more to it than that. EHR systems have been around since the 1970s and didn't start exchanging data until the late 1990s/early 2000s. That means we have 25 to 30 years of clinical data in systems that are old, irrelevant, or formatted incorrectly, and it’s challenging to migrate historic data into current standards.

We encourage healthcare organizations to start with baby steps; don't just jump into AI and turn it on. Discuss the need to have clinical data quality thresholds, clinical data defects, or established taxonomies.

We still have a lot more work to do, and everyone has a part to play. We need to not just check the boxes on the Cures Act and data standards; we need to make sure we're doing all we can to exchange high quality medical data effectively. This includes auditing all types of data coming in from external partners. AI cannot reach its potential with low quality data. Period.