17 March 2021 | Repost from Grant Brodie, Arbutus
DATA QUALITY MATTERS - PART 2
A properly implemented DQ program would:
Proactively monitor data and report data quality issues (with appropriate frequency) for all enterprise applications in order to ensure critical data for management information purposes is always clean
Be easily updated to meet the changing needs of the organization
Achieve better information systems derived from improved data quality detection and reporting
Achieve better business decisions and outcomes from the use of trusted and accurate enterprise data
While it may not be your job to monitor data quality, it may be your job to report on the state of DQ. With the correct tools you can test virtually every data field, and ensure that data quality issues are identified and addressed in a timely manner. By ensuring that the data is clean, and is kept clean, you minimize GIGO and help ensure that everyone can trust your data and its accuracy.
Where's The Error?
When testing DQ you often have two major sources: a data repository, like a data warehouse (DW) or data mart, or the source data itself. Both are different, and both are important to check. The DW should obviously be 'clean', but being clean doesn't mean it's right. Errors in loading the repository (ETL) can easily mask source data errors or issues, and result in incorrect DW contents. Only by looking at the source data can you know your data quality.
A simple explanatory example might help. A DW table might contain a field sex, relating to customers, which is created from an external system. The logic to load the DW field might be as simple as if source:sex="M" then DW:gender="M" otherwise DW:gender="F". This guarantees apparently clean data in the DW, but mis-represents any errors as female, which could have serious implications on your use of the DW. Whether it was unknown at the time of entry, or simply a typo, you never know what's in the source system unless you look.
Looking at the source data, it may just be a single character field that is expected to contain "M" or "F". But the source system may not validate this field, so it may contain "m", "?", "U" or even a type like "L". Identifying the quality of the source system is fundamental to any DQ exercise.
Reading The Data
Source data is traditionally difficult to access, which is one of the reasons warehouses evolved. That is why Arbutus' unique capabilities are so important. Arbutus offers a suite of products that can access all of your source data, in place, regardless of the age of the system or the complexity of the data.
Arbutus technologies are not only compatible with any relational database, but more importantly, our native mainframe application allows access to any of the legacy data on your corporate mainframe as well. This includes VSAM, IMS, DB2, ADABAS and even VSAM and flat files with variable record lengths and multiple record types!
Once written, you can regularly schedule data quality testing jobs that perform a thorough check of your legacy data on a timely basis. They can report data anomalies and errors for review and correction; this will prevent them from propagating throughout your systems. Most importantly, all of this can be realized with no ETL programming. The process is all point-and-click and typically takes just minutes to implement.
Simply put, a data quality (DQ) error implies that the item doesn't match the explicit or implicit metadata definition. An example of an explicit definition might be that the data is numeric, whereas an implicit definition example is that the employee age field shouldn't be less than 16 or greater than 70. Either type of problem could have significant implications.
A commonly accepted IT framework for DQ errors is:
While accurate, you may find these categories to be a little abstract. Instead, you could consider the following data categories:
Invalid data doesn't match the defined field type definition. If the data is supposed to be numeric, but actually contains "UNKNOWN" then the data is invalid. In the case below, there is invalid data in DateField; there are blanks and nonsense dates that have been detected.
Improper data is technically valid, but doesn't match the expected business rules. The field may be blank, improperly formatted (like a phone number), non-unique, improperly sorted, contain characters that while technically valid are inappropriate for the field (like 12345 for an address), or may be too small or large (like ages)
Incomplete data is all technically valid, but violates some higher-level business rule. One example of incomplete data would be sequential invoice or cheque numbers. There may be an expectation that the company's invoice numbers form a continuous set, with no gaps.
Another common example of incomplete data involves the matching of key values to other tables. For example, every customer transaction should have a customer number that is also represented in the customer master file. Identifying errors of this type will not only catch missing customer master records, but also DQ issues in the transaction or master file keys themselves.
Inconsistent data involves data that is internally inconsistent. This might be as simple as a table that contains quantity, price and value fields, where the value is not equal to the quantity times price. Most other inconsistence examples involve matching data (other than keys) between tables. The same value might be represented in two or more tables, and be different in one. Or the total of the customer transactions should total the customer master YTD total, but doesn't.
JUST DO IT
Whatever categories you prefer, Arbutus allows you to fully test all these characteristics, and to automatically schedule their ongoing testing. With a proper process, over time, not only will you consistently achieve more trusted and accurate data, inevitably you will facilitate better overall information systems as result of your ongoing and timely findings.
One last area worth mention is that while some organizations have formalized metadata for all data elements, many have not. If you have formalized rules, then the testing can almost be a check-list. For the majority of organizations that don't have complete formalized metadata, DQ can involve an interactive process to "discover" the rules. Even if you have formalized rules, the discovery process can be very valuable, as very few instances of metadata are complete and current. When discovering rules, you just look at the data and notice issues, but there are organized approaches that often yield results quickly. These include:
Outliers: If you use commands like Classify, you can quickly identify the common values in the table. More importantly though, you can identify the least common values. These can be places where errors hide.
Sorting: Sorting is very easy in Arbutus, just a right-click away. Once done, you can look at the highest and lowest values for each column. Again, data at the extremes is more likely to be incorrect.
Formatting: Where data is expected to have a known format e.g., North American phone numbers as "(999) 999-9999", you can easily test the data with the Format function. Just add a column to your table that is "Format(fieldname)". You can then Classify on this field to find the outliers based on the format of each data element. As with the other techniques, the least common formatting is either wrong by definition, or more likely to harbor errors.
An important part of any discovery process is the feedback loop into the metadata. If you find a problem, then by definition it violates some business rule. You will probably want to add this to your automated testing, to catch it in the future. It is important to ensure that all such rules are formally articulated, so that as you enhance your DQ infrastructure you also start to learn about your data.