Skip to main content
Data Ingestion on the APEX Platform

Data ingestion on the APEX platform

It is no longer surprising that data has become a key resource, and its efficient acquisition and processing form the foundation of modern business processes. In the context of APEX, data ingestion enables secure and scalable data flow, from acquisition to integration and transformation, in order to deliver valuable insights for the business. In the RBI Group, the flow of data from numerous sources is a considerable challenge, and the solution is being tackled by the Data Team. This is an interdisciplinary group of data engineers from Poland, Kosovo, and Austria, working together on the data ingestion project on the APEX platform.

  • By Tomasz Zatoń
  • Case study

This article highlights how APEX enables data integration, which can later serve as a key element in artificial intelligence and advanced analytics projects. In this way, data acquired by the Data Team becomes a valuable source for both operational users of the platform and analysts developing AI and machine learning algorithms. The Data Team, working within APEX, focuses on providing
data to APEX users – from developers to machine learning model creators. Let’s explore how the team ensures smooth data flow and the challenges engineers face working at the intersection of three cultures and three locations.

Group Data Sources

Effective data management is not just a trend but a key element enabling companies to make faster and more precise decisions. In RBI, data ingestion has become the foundation for advanced AI and analytics projects. The central structure of Group Data Sources facilitates data management across the entire organization, reducing fragmentation and increasing consistency.

Group Data Sources (GDS) consists of data from various systems within the RBI group, centrally managed by the Data Team. The benefit of centralizing data is the standardization of its structure, optimization (ensuring the same data is not loaded multiple times by different teams), and providing a single source of truth.

Data within GDS is sourced from RBI's source systems and loaded into databases and tables in Databricks. Thanks to Apache Spark, processing large datasets is highly efficient, which is crucial when managing hundreds of tables across multiple banks within the Group. Access can be granted to any APEX platform user upon obtaining prior approval and permission from the APEX Service Management team.

Currently, GDS includes approximately 1000 tables from seven systems across 11 banks in the group. This is not a closed catalog but is continually expanded with new sources based on the needs raised by project teams. Detailed information about the quantity and scope of data within GDS is continuously updated on the Power BI dashboard.

Self-service and Flexibility in Data Acquisition

In addition to using the ready-made data within GDS, users also have the option to independently acquire other data sources of their choice using the same tools. This self-service approach provides greater flexibility and responsiveness to changing analytical needs, minimizing the burden on IT teams.

The source code for the data acquisition package is created from the very beginning not only for internal use by the Data Team but also to be delivered as a standard data ingestion tool within RBI. Platform users can fully configure the processing workflows through configuration files and execute pre-built workflows in Databricks, which significantly streamlines the process of making changes.

Data Ingestion – Technical Process

The process of acquiring and processing data between the source and the target table is implemented as a Databricks workflow. The starting point for each ingestion process is a metadata file in a specific format. The metadata includes a list of tables and columns for a given source, as well as additional information such as file formats. Based on this, the target data structure is created.

In addition to the metadata, it is also required to define:

  • The locations of the source and target data, 
  • The configuration of the Databricks cluster on which the process will run.

It is also possible to specify other optional parameters, such as the execution schedule or a mailing list for notifications.

Data Acquisition Sources

Data can be acquired from two types of sources:

  1. Files on Amazon S3, previously uploaded there by data providers (e.g., database exports),
  2. Direct connection and data loading from databases (on-premise or cloud).

In the coming months, the ability to acquire data from external APIs is also planned. Ultimately, the data is loaded into tables in Databricks databases (Delta Lake).

Each table has its own dedicated ingestion process (tables are loaded independently). Only one partition of source data (e.g., data from a single day) is processed at a time. If more new data is available, the workflow automatically re-queues itself until all data is loaded.

The structure of the workflow is uniform – for all tables, it consists of the same steps. The individual steps correspond to modules of the APEX-de-ingestion package, written in Python using Apache Spark. Workflows are automatically generated based on metadata. Deployment is done using the dbx library provided by Databricks. In the near future, migration to the new Databricks Asset Bundles tool is planned.

As new data sources and additional features are added, the project is continuously developed. At the same time, the Data Team provides support for users utilizing Group Data Sources and the APEX-de-ingestion package.

Summary and Perspectives

In the RBI Group, data ingestion on the APEX platform is a key element supporting analytical processes and AI projects, enabling efficient data flow between numerous sources. The APEX team, composed of data engineers from Poland, Kosovo, and Austria, has implemented a solution based on Group Data Sources (GDS), which centralizes data from various RBI systems. This allows APEX users to gain consistent, standardized access to data, which not only reduces redundancy but also enables scalable data resource management in an international environment. With such solutions, APEX supports advanced analytics and AI projects at RBI, increasing process efficiency and accelerating data access. The APEX team is continuously developing the system by adding new features and data sources, ensuring the platform meets changing organizational requirements and challenges.