We live in an era of unprecedented data abundance and aggregation. The sheer variety of new information available on the Internet, in databases, and from other sources has changed the way we conduct business, undertake research, and communicate. Most of the changes are positive. Yet, increased reliance upon networked data has also introduced new challenges. One serious problem we need to address is that of “dirty data”—missing or inaccurate information that resides in (and, indeed, frequently results from) the abundance and aggregation of data in our lives today.
Dirty data can have several pernicious effects. In particular, it:
These concerns are particularly important in the medical field, where data problems represent the dark side of the tremendous potential offered by the adoption of health IT systems. In a “networked” medical setting, dirty data not only introduces economic inefficiencies; it may also cost lives. In addition, the lack of a data quality culture may be a core deterrent for many users in adopting and using health IT today.
As various regional and affinity-based information exchange networks around the country are developing and implementing strategies and architectures to link and share patients’ data, the issue of dirty data will have to be addressed. Inaccurate patient data, especially if it affects the data fields used to establish individual patient identity through a Record Locator Service1 (RLS), may be harmful if not mitigated from the outset. Dirty patient data has, for instance, the potential to undermine the matching capabilities of an RLS or to provide for an unacceptable level of false negatives. This document considers the growing need to develop a “data quality culture” at the network level and lists possible issues and options to consider.
By some estimates, the problem of dirty data in industry has reached epidemic proportions.2 The problem is equally prevalent and potentially even more alarming in health care.3
In a medical setting, dirty data has several consequences:
First and foremost, it can lead to medical errors, which can kill or cause long-term damage to the health of patients. A widely noted 2000 Institute of Medicine report4 estimates, for example, that between 44,000 and 98,000 lives are lost every year due to medical errors in hospitals alone, and that such errors result in an additional $17 to $29 billion in annual healthcare costs. Although not all these errors can be attributed to inaccurate data, a number of studies5 have shown a link between poor quality data (in databases) and medical errors and subsequent poor quality of care. Further, in a “networked” health care setting, the challenge of data accuracy becomes even more critical because a health professional immediately uses the information accessible, especially in the case of an acute illness or emergency intervention, without any built-in step or potential to review its accuracy.6
Conversely, improving data quality can increase the quality of care by initiating a positive chain reaction—improving the data that clinicians see when the patient is admitted can validate the need for services to the patient, and if followed up with the provision of those identified services, may provide for better outcomes. A study on child mental health services, for instance, showed that 58 percent of the patients had improved outcomes after a data quality improvement project was instituted.7
Poor data quality can also reduce the accuracy of insurance bills. A study analyzing Medicare data found that 2.7 percent of the nearly 11.9 million records in the database, approximately 321,300 records, contained coding errors.8 Such errors can impact the clinician’s and/or the patient’s insurance reimbursement and/or cause additional time to be spent correcting the errors. The study also identified the immediate benefits of addressing the errors. According to the Medicare study, the top 10 coding errors accounted for 70 percent of the total errors. By focusing on those 10 coding errors a high percentage of the problem can be addressed instantly, saving time and money.
Dirty data can also have serious consequences for patient privacy, especially in a networked environment. A single—and originally isolated—error in a data set can be magnified (and thus pose a more serious privacy risk) as it is “propagated” into various other data sets, systems and warehouses, while decreasing at each step the potential to redress the error.9 On the other hand, a networked and aggregated data environment obviously undermines the “privacy by obscurity” paradigm that was often the sole privacy protection available in an off-line world.
While poor quality data can erode privacy, strong privacy protections can enhance the quality of data and subsequent health care, for example, by increasing trust and therefore increasing the amount of data that patients are willing to share with medical providers.10 “Data accuracy” is therefore one of the nine principles underpinning “The Markle Connecting for Health Architecture for Privacy in a Networked Health Information Environment.”11
Despite the severity of the problem, the risks posed by dirty data often go unrecognized; in many ways, the problem of inaccurate data remains a low priority for companies and organizations.12 It is critical to understand the problem and to develop strategies for minimizing data inaccuracies and the potential harm they cause.
Data quality is broadly defined as “the totality of features and characteristics of a data set that bear on its ability to satisfy the needs that result from the intended use of the data.”13 Data accuracy is one of the “foundational features” that contribute to data quality14 (along with other attributes such as timeliness, relevancy, representation, and accessibility15). In addition, data quality has two essential components: content (i.e., the information must be accurate), and form (i.e., the data must be stored and presented in a manner that makes it usable). These definitions are important to keep in mind when considering ways to minimize data inaccuracies, as they illustrate why the task of fixing dirty data requires more than merely providing “right” information.
Equally important when developing a strategy to increase data quality is identification of the underlying causes of “dirty data.” Two broad categories of errors can be distinguished: systematic and random. Among the sources of systematic errors are: programming mistakes; bad definitions for data types or models; violations of rules established for data collection; poorly defined rules; and poor training. Random errors can be caused by: keying errors; data transcription problems; illegible handwriting; hardware failure (e.g., breakdown or corruption); and mistakes or deliberately misleading statements on the part of patients (or others) providing primary data. This is obviously not an exhaustive list, but a few examples of the types of errors that may occur. It is worth noting that according to the Data Warehousing Institute, 76 percent of all errors, across sectors and setting, result from “data entry.” This suggests the critical role played by human error; many of the strategies proposed below, therefore, focus on reducing the likelihood of human error.
To establish data quality within a health care setting and to prevent data quality errors in the system and limit their consequences, health care organizations should develop comprehensive strategies to establish a data quality culture. Ideally, such strategies should be developed from the outset and be embedded in the design of any networked health information exchange system.
Organizations can use a variety of tools and techniques to increase the cleanliness of data, both at the time of collection and during subsequent processing.
For the purposes of Markle Connecting for Health, data cleanliness efforts should be concentrated on those data elements required by the RLS. As the US moves towards widespread data standardization,16 data input quality control can improve the usability and quality of data outputs. It should be noted that the documentation of a clinician cannot, by law, be changed retroactively, as this constitutes a change to the documented medical record of an individual; adding corrected information is allowed.
For cases in which data cleansing techniques17 are applicable in health care, for example, detection (not resolution) of a single patient with two records, these techniques can be automated (e.g., in the form of software packages) or involve a human component (e.g., monitoring and training).
Ultimately, a well-thought-out and comprehensive data quality program should include both automated and human strategies, such as:
Each of these strategies will incur certain costs, but they are likely to be less expensive than addressing errors resulting from a system designed without data quality features. The US health care system has a unique window of opportunity to establish such an internal data quality culture when considering how to adopt health IT systems in the near future.
These “organizational strategies” should be complemented by external strategies, especially redress mechanisms, which encourage identification and correction of errors. Redress mechanisms are frequently built into laws and regulations, which, among other things, allow consumers to access and correct errors in personal information.
In the United States, legal systems for redress date back at least to the Fair Credit Reporting Act of 1970. In addition, redress is built into the Privacy Act of 1974, and the Health Insurance Portability and Accountability Act of 1996.
Common redress strategies include:
Implementing a data quality culture, as suggested above, poses various challenges. Without specifying the operational procedures that may be unique to each network design and RLS implementation, the following set of questions will need to be addressed:
__________
Markle Connecting for Health thanks Stefaan Verhulst, Chief of Research, Markle Foundation, for drafting this paper.
©2006-2012, Markle Foundation
These works were originally published as part of the Markle Connecting for Health Common Framework: Resources for Implementing Private and Secure Health Information Exchange. They are made available free of charge, but subject to the terms of a License. You may make copies of these works; however, by copying or exercising any other rights to the works, you accept and agree to be bound by the terms of the License. All copies of these works must reproduce this copyright information and notice.