Skip to main content
Tech Blog banner (4096 x 1024 px) - 1

APEX for (Un)Initiated – A Guide to the Data Layer

APEX is a great example of collaboration between engineers from Poland, Kosovo, and Romania. In this article, I will describe the fundamentals of how the data layer functions within the platform and introduce some key terms related to APEX.

  • By Robert Marek

What is APEX?

The name APEX stands for The Analytics Platform EXperience and serves as the successor to the RBI Group Advanced Analytics Data Lake and Workspace.

The APEX platform is distinguished by several key features that set it apart from traditional services provided by the Head Office (HO). First and foremost, it is characterized by a decentralized architecture, meaning it is distributed both logically and physically. In the context of tenancy, within each Network Unit, every team operates as a fully independent tenant, providing greater flexibility and autonomy in management. APEX is also a self-service platform, offering dedicated services that enable user management, data sharing with other tenants, and data importing and publishing.

Importantly, the platform is managed by the APEX team, which means that users do not need advanced technical knowledge of cloud infrastructure to effectively use its available functionalities. 

APEX in Raiffeisen Tech

APEX was developed to replace the previous data lake model used within Raiffeisen Tech, which relied on AWS along with processing via EMR and Airflow. This model required a very rigid approach to processing structures, making cooperation between Data Scientists and Data Engineers more difficult. The APEX platform aims to standardize the organization's approach to data, facilitating automation and scalability across the entire data science cycle (model training, testing, validation, etc.).

The initiative started in 2021, and the platform is continuously evolving. Some of the key milestones in its development include: 

  • Beginning of platform development – Q1 2022 
  • First penetration test – Q1/Q2 2022 
  • HO Onboarding – Q2 2022 
  • Full migration of HO to APEX – Q1 2023 
  • Adoption of the self-service platform for data access management – Q1/Q2 2023 
  • APEX release 2.0 (also known as K2) – Q2 2023, including: 
    • Enabling data consumption via JDBC/ODBC 
    • "Raw layer" replacing the previous "DL landing zone" 
    • Reformation of database creation logic representing existing datasets on AWS 
    • The ability to share data between use cases

APEX is used by many project teams, including RBI Head Office and Raiffeisen Tech Poland’s team. 

Infografika na Tech Bloga - 1
Source: internal materials

All of these teams leverage the benefits of the APEX platform to develop their products. The integrated development environment, along with broad access to data engineering automation methods and model creation, accelerates the delivery of new product versions. Integration with the organization's internal data ecosystem, along with the Data Unit and Data Share systems, enables seamless collaboration between Data Scientists and Data Engineers. The ability to scale computing resources allows for the optimization of technical solutions in terms of performance/price ratio.

The APEX platform is tightly integrated with the AWS cloud ecosystem. Additionally, it enables integration with other external systems, such as Power BI, using universal and standard data consumption methods like jdbc/odbc or sftp.

Key elements - Databricks

One of the key components of the APEX ecosystem is the Databricks platform, provided through a WebUI panel.

Within APEX, Databricks enables, among others:

  • Big Data processing with Apache Spark
  • MLflow: managing the machine learning model lifecycle, from experimentation to production
  • Databricks SQL: executing SQL queries for business analysis and data visualization, including the use of jdbc connectors     
  • Creating and managing Apache Spark clusters, allowing the selection of computing power based on needs
  • Creating scheduled tasks, with code creation using Jupyter Notebooks.

Data Layer

An important component of the data layer are the concepts of Data Unit and Data Share. Understanding these is crucial for understanding how data is shared between tenants.

Data Unit

A data unit, in simple terms, is a representation of data location. In the context of APEX, it refers to a location in the AWS cloud (S3), both within a given Network Unit account and the Data Lake.

The primary types of Data Units are:

  • Source system – dedicated to source data and resources used in later processing. File transfer to a given Source System can be shared via sftp and a dedicated AWS role.
  • Data Product – dedicated to final/intermediate products. Each Data Product is associated with a database, accessible via Databricks. Direct data access is possible through an S3 Access Point, provided to the end user (allowing access via the standard S3 interface as well).

Data Share

Data Share is a logical representation of data sharing for a given tenant. Simply creating a Data Unit does not grant data access to any tenants; a dedicated Data Share is required.

In simple terms:

  • Each data unit can be shared with multiple tenants using Data Share.
  • Data Share can assign ReadOnly or Read/Write permissions.
  • Additional sharing options (access via database, sftp) occur at the Data Share level.

User Interface and Functionality

A few words about the APEX user interface. Key functionalities are described, showing how users can interact with the platform. Screenshots can be added or configuration options described.

The current priority is the integration of Unity Catalog, which will provide many additional capabilities to platform users, such as:

  • Data virtualization layer
  • Enhanced data sharing
  • Data Governance
  • Data Lineage

In addition, the self-service layer is being actively developed, enabling the delegation of workspace management tasks to end users, without the need to contact the Head Office. Soon, the option to add a vector database to selected workspaces will also be made available.

Summary

Although this article provides only a general overview of the APEX data layer architecture, we hope it has brought some clarity to how the platform operates. With its decentralized architecture and flexible data management options, APEX is a modern and versatile solution for analytical needs within the RBI Group. More information about data acquisition and processing in APEX will be covered in the next article.