Updated: Mar 17
16 March 2021 | Repost from Grant Brodie, Arbutus
DATA QUALITY MATTERS - PART 1
While everyone has heard of the concept "Garbage in, Garbage out" (GIGO), for most of us this is just an accepted, but abstract, concept. Today we'll explore Data Quality (DQ) more closely, and show how to minimize GIGO.
In an ideal world, every system would have the highest levels of built-in DQ checks, but sadly this is almost never the case. While, there are a number of reasons for this, here are just a few:
With the explosion of data, awareness of the importance of DQ is increasing. What this often means is that the older the system is the lower the quality of the data checks. Legacy systems and environments are particularly susceptible. This is because not only was there a lower importance placed on DQ historically, but there might have been limited resources/computing power to implement appropriately comprehensive tests.
Even where you are dealing a modern system, it's hard to think of everything. Think of a simple field like a date. It's easy to think of problems such as very old dates (did they really become a customer in 1919?), dates in the future, or just plain invalid dates ("Unknown", or the ubiquitous " / / "). There are a variety of other situations that aren't hard to think of, but may be overlooked. What about sequencing, like it was shipped before it was ordered, and similar situations? What about date formatting (D/M/Y vs M/D/Y vs, Y/M/D), or just different ways of typing it (slashes vs. dots vs. dashes)? Date pickers are a modern approach to this issue, but may not be available for legacy environments. Harder issues may be weekends, statutory holidays and the like, when the business is closed.
Dates are but one example, but they do show the level that one can investigate DQ. Many data elements in your systems can have the same depth of DQ issues, and virtually every data element offers some opportunity for DQ improvement. It is also important to remember that your uses of your data change over time. What was relatively unimportant at the time a system was implemented can become crucial to your current analyses. When a system was implemented it may not have mattered that a transaction was booked on a day the business was closed, but now it's crucial to identifying a fraud.
Being aware of this phenomenon is one step towards the awareness of DQ's importance.
Finally, DQ can be expensive. Not only is there a cost associated with thinking about (and implementing) DQ, but there can be a significant cost attached to the resources required to enforce it. If a production process runs twice as slowly as a result of DQ tests, there could be significant implications for the business.
No business wants to waste money, and even DQ becomes a cost/benefit analysis. The problem with this is that the subsequent costs and benefits associated with future use of the data are unknown at the time the implementation decisions are being made.
Here's where GIGO rears its head. Bad data can easily affect the results you find, or the quality of the decisions you make, based on that data. As mentioned, it is very challenging to anticipate in advance exactly what errors will subsequently affect your analyses and decision making.
This is particularly true in the dynamic age of big data we find ourselves in today. Perhaps tomorrow's work is based on the day of the week, or the time of day. This data may not have been tested in the past, because this is a new analysis. As an analyst starts the work they might think to test this themselves (if they have the right tools), but if they just trust the data and it's wrong, the results could be compromised.
One Approach - Testing After The Fact
Given that enterprise data encompasses a variety of systems, built in different environments, at different times, and by different personnel, it might be a bit naïve to expect that every system has the appropriate level of base DQ. Instead, a powerful approach can be to test the data after the fact.
Testing after the fact can give you the benefit of hindsight. Rather than anticipating every potential use for every piece of data when a system is implemented, you can design and run new DQ tests now, and identify the data quality issues that affect your current analyses. Further, as your analysis needs change (and therefore your DQ expectations), you can verify these new requirements, without necessarily taking the costly step of updating the source software.
The response to discovered issues will vary, depending on severity. You may choose to ignore the problem, fix the data, or even update the source system to prevent the error from being accepted. Each type of error has a unique impact on your analyses, but awareness that an issue exists is the first step.
Follow our next blog (PART 2) on how to properly implement the Data Quality (DQ) program.