High-Quality Data

According to the popular saying, "garbage in, garbage out," any ML model or BI reporting that uses low-quality data will generate incorrect results. In turn, these results lead to flawed business decisions.

06 January 2025 10:00
By Mateusz Wujec

In today's data-driven world, the reliability of data is more important than ever. Machine learning, LLM models, and data-driven decision-making supported by Business Intelligence solutions all share one often-overlooked requirement: good data quality. None of these solutions can enable users to fully leverage modern technologies if the data they rely on is inaccurate, partially missing, or improperly handled.

The Importance of Data Quality

As the saying goes, "garbage in, garbage out." Any ML model or BI reporting using low-quality data will yield incorrect results, leading to poor business decisions. The Compliance Reporting team understands the importance of the quality of the data we work with daily. Therefore, we decided to implement a set of data quality metrics and processes that allow us to understand its condition, monitor it over time, and take the necessary actions to improve data quality.

In this article, I will describe the past few months of our work, explain how our solutions operate, and share our future plans. All of our progress has been built on the APEX platform, so I will present the methodology and components we used. This may be of interest to other teams planning to implement similar solutions soon.

The Current State of Data Quality Monitoring

The current state of data quality monitoring in the Compliance Reporting team includes:

A system of alerts for data transmission delays
A system of alerts for data completeness
Pipelines for data quality checks

Alerts and Notifications for Data Delivery Delays

As we work with data from various units in different countries, where many processes are still manual, delays in data delivery are among the most common issues. Since stakeholders rely on the data products delivered by our team, we must stay informed about any delays in availability. To ensure this, we implemented a set of alerts on the Databricks platform that check the date of the last data transmission for the key datasets we import. Databricks alerts have proven to be a very useful tool for such solutions.

Databricks Alert System

The concept behind alerts is simple: it's essentially a SQL query that returns a single value as an indicator. A configurable threshold is set, and whenever the result does not meet this threshold, the alert changes its state to TRIGGERED and sends notifications to all individuals responsible for data delivery. These alerts can be triggered on a cyclical basis—daily, weekly, every 15 minutes, etc.

For the Compliance Reporting team, they are controlled by a Databricks workflow that triggers after all data retrieval processes are complete. Source data alerts function in the same way as transmission alerts. However, in this case, we are interested in the business date, which should be delivered daily, rather than the transmission date. Similarly, we issue alerts if the difference between the processing day and the last business date exceeds the number agreed upon with stakeholders.

Identifying Data Quality Issues

The solutions mentioned above ensure that our team is always notified when data is delayed or its scope is unsatisfactory. However, such an alert system is not sufficient. Data can often be duplicated or incomplete. In such cases, our team launched the Data Quality Checks project—a project aimed at periodically scanning the tables we maintain and generating metadata with details about the number of records loaded, the number of unique values, the dates of the last events, and delays—all divided by Hub and Unit for better clarity.

This process is performed weekly, scanning a predefined list of tables and preparing statistical data in two tables that present all the relevant facts. This allows us to analyze historical data delivery, gain insights into problematic data areas, and identify units that deliver lower-quality data.

Data Quality Checks – Process Details

Data Quality Checks is a Databricks workflow comprising two stages: preparing metadata for Core Banking Systems and Swift data. The code is packaged into a reusable Python module along with a CI/CD process.

Holistic Data Quality Assessment

Our latest achievement in data quality is the ambitious task of assessing the current state of data quality holistically. As we transition to the Neuron platform and real-time data ingestion becomes a reality, the need for data completeness and accuracy grows. We are therefore creating a process that will analyze the entire data history and create a comprehensive view, enabling us to observe trends in data quality.

Future Plans – Advanced Data Analysis

We are also working on metadata at the column level: value counts, null counts, averages, and deviations—all for statistical testing. Currently, the process is in its early stages, with a defined data scope and some metrics implemented. Additionally, we plan to identify anomalies in data values, perform statistical trend analysis, create groupings, and eventually introduce column-level rules for acceptable values.

All of this will be presented in a report that will enable benchmarking, identification of problematic datasets, and serve as a starting point for discussions on improving data quality.

Conclusion

Data quality is a complex topic that many companies struggle with. It requires substantial business knowledge, commitment, and effort. However, proper care for data quality ensures the accuracy of reporting and machine learning models, which are becoming increasingly important today.

Mateusz Wujec

Big Data Engineer
mateusz.wujec@rbinternational.com.pl