Data Quality can sometimes feel like the Holy Grail, with those of us who generate, process, or use data doing our best impersonations of Sean Connery from “Indiana Jones and the Last Crusade.” We dedicate so much time, treasure, blood, sweat, tears… a lot of tears… trying to ensure the highest quality data.

But what makes “quality data”?

High-quality data yields accurate insights, whereas poor-quality data can lead to misleading results. Whether you use off-the-shelf tools or your own DIY approach to assess the quality of a data set, several key characteristics should be examined to ensure the data is suitable for analysis and decision-making.

Here are Seven (7) Data Characteristics to Check:

[1] Accuracy: Accuracy refers to how closely your data reflects the true values or real-world phenomena it represents. Inaccurate data can result from measurement errors, data entry mistakes, or outdated sources. To assess accuracy, compare your data set to trusted, master data.

[2] Completeness: A complete data set includes all necessary records and attributes. Missing values, incomplete records, or absent columns can compromise analysis. It’s essential to check for missing entries and evaluate their impact. For example, missing values in critical fields like dates or IDs can prevent correct analysis or modeling.

[3] Consistency: Consistency involves ensuring that data across different sources or within the same data set doesn’t conflict. Inconsistent data might show, for instance, a customer with two different birth dates or inconsistent naming conventions. Verifying that data values follow uniform formats and standards helps maintain consistency.

[4] Validity: Validity, sometimes referred to as “conformance”, refers to whether the data conforms to the expected formats, types, and ranges. For example, a field for email addresses should only contain properly formatted email strings, and a date of birth field should not include future dates. You can use rules generated from metadata of your data to help you check the correctness of data values with respect to business rules, range conformance, semantics, etc.

[5] Timeliness: Data should be up-to-date and relevant for the intended use. Outdated data may not reflect current realities, especially in fast-changing domains like finance or healthcare. Check the timestamps and collection dates to ensure your data set reflects the appropriate time period.

[6] Uniqueness: Redundant or duplicate records can distort statistics and analytical models. Identifying and removing duplicates is vital, especially in data sets where each entry is expected to represent a unique event or entity.

[7] Integrity: Data integrity refers to the coherence and reliability of relationships within the data set, particularly in structured data with foreign keys or relational connections. Broken links, orphan records, or inconsistencies between tables suggest poor integrity.

The Takeaway.

By systematically evaluating these characteristics, you can identify flaws, make necessary corrections, and ensure that your data-driven decisions rest on a solid foundation.

What about you? How do you currently evaluate the quality of your data? Please comment – I’d love to hear your thoughts.

Also, please connect with DIH on LinkedIn.

Thanks,
Tom Myers