A remarkable journey from DataLake on AWS to Databricks

Three of our apex teammates took on huge responsibility. They took the floor at this year's data science summit at PGE Narodowy Stadium. It is the largest independent data science conference in the cee region. Raiffeisen Tech Poland could not miss it.

10 January 2025 10:00
By Jacek Cieślak
Case study

A venue full of people waiting to hear what we have to say. All eyes on us. Several cameras aimed at our direction, capturing our every move, microphones weighing a ton in our hands. This definitely is not the world we live in every day. But a great opportunity came along, enabling us to learn of this world from the inside out. There is no better way to discover the true nature of a tech conference, than to give a speech during one. So, brace yourselves for the story of how we held on tight at the top of the technological wave... for the 30 minutes!

About Data Science Summit 2024 conference

DSS 2024 was the very first conference of this scale in which we participated as the official representatives and speakers of Raiffeisen Tech Poland. Speaking of us, it's high time we introduce ourselves:

Jacek Cieślak – Service Manager in the APEX Service Managers team. I am responsible for supporting the APEX platform and its users, working closely with the DevOps team. Additionally, I coordinate testing of solutions before they are delivered to the platform.
Robert Marek – DevOps Engineer in the APEX DevOps team, works on the development of the APEX platform and the implementation of interesting solutions, such as Operational Database (Postgres) and Unity Catalog.
Mateusz Wujec – Data Engineer in the Compliance team, responsible for data quality and the automation of compliance-related processes. Recently he's been involved in the Quantexa platform implementation project.

Attending the Data Science Summit was a great opportunity to explore the latest trends in data management, machine learning, and digital transformation. We had a chance to exchange experiences with experts from various industries and learn about innovative solutions used by market leaders. We particularly appreciated the possibility of participating in workshop sessions, which allowed us to learn about the latest tools and technologies used in analytical processes and data management.

And how did we handle our presentation?

During Data Science Summit 2024, we gave a talk on ‘Navigating the Data Storm – Our Journey from Data Lake on AWS to Databricks’. We shared our experiences related to the transformation of data infrastructure. In our presentation, we discussed the challenges, strategies, and benefits of migrating from a traditional Data Lake on AWS to the Databricks platform. Below is a detailed summary of each of the three parts of the talk, with emphasis on the crucial aspects discussed.

Jacek Cieślak – Transitioning to Databricks

The first part of the presentation was done by me – Jacek. I discussed the transition from the traditional Data Lake on the AWS platform to the Databricks, emphasizing that we opted for Databricks because our existing internal solutions were not integrating seamlessly. The RBI group needed a unified platform capable of handling data ingestion, processing, and supporting machine learning and AI applications. Additionally, we chose to decentralize our Head Office-developed solutions, recognizing that both Head Office and Network Units should operate independently, own their processes and data, and maintain their own governance.

The benefits of adopting Databricks are:

Integrated data management: Databricks provides a comprehensive system to manage and control data. It uses a unified architecture – the Lakehouse data architecture - that combines the best features of data lakes and data warehouses (classical SQL Databases), supports tracking data originals and transformations (data lineage), is compatible with various data formats, and integrates easily with data storage and data processing tools.
Additional collaboration tools, like interactive notebooks that can be shared with others, resource management done directly by the user, scheduling and launching tasks (jobs).
Seamless data discovery and access: ability to discover and reuse data and data-driven resources, such as MLflow models. This can transform the way in which users work by making them faster, more independent. It encourages collaboration and the seamless integration of available information within the group.
Integrated security layer – data being encrypted both when stored and during transmissions, complex access controls, bring-your-own-key, auditing, and monitoring.
Scalability and integration: Databricks has simplified the integration of data from different sources and improved team collaboration by centralising data and analytical tools.
Industry Innovation: Databricks is today the leading innovator in the data and machine learning fields, being the largest contributor to standard toolkits worldwide, such as Apache Spark, MLflow, or Delta Lake.

I also discussed the problems associated with the rapid growth of the platform and its integration with legacy or on-premise solutions, which usually contributes to various obstacles, such as slowing down the process of provisioning and accessing resources, the complicated management of network connectivity and permissions models across the group, or data sharing between Head Office and NWB projects.

Finally, I referred to the Self-Service solution, which is currently under extensive development and will be delivered to users soon. But more on that in Robert's part.

Robert Marek – Self-Service Needs

During the second part of the presentation, the stage belonged to Robert Marek, who outlined the need for self-service in the context of autonomous data ingestion, data processing, machine learning, and access management. He discussed two alternative development paths for the area, weighing up the advantages and disadvantages of each.

Exploring alternative approaches:

The traditional approach: relies on centralised data processing, where data is delivered to users by specialised teams. Its advantage is better control over data quality, but it comes with longer implementation time and less flexibility.
The self-service approach: business teams gain greater autonomy to access data, data-derived information such as AI models, and build analytics solutions. Here, the main challenge is to ensure adequate control over data quality and security.

Robert also presented tools to support the self-service approach, including alerts for monitoring data delays to enable more efficient management of data. He also noted the importance of standardizing processes to avoid unnecessary delays and complexity.

Lastly, he pointed out that the platform was designed for seamless data access and user-friendly management, is inherently complex and must operate in accordance with the requirements of our RBI group. This necessitates meticulous planning and careful management.

Mateusz Wujec – Data Engineer’s Perspective

Last but not least – Mateusz Wujec – who discussed the challenges of data quality management and testing from the perspective of a data engineer. He emphasized the importance of regular data analysis, highlighting several key tools and processes:

Optimisation of data processing: Mateusz described the way migrating to Databricks has improved data management by using a layered data architecture (bronze, silver, and gold), each with a specific role in the data transformation process.
Data quality management: the implementation of a data quality monitoring system has allowed problems to be identified and resolved quickly. The use of shadow copies of data enabled a deeper quality analysis.
Shadow copies of tables: used to analyse data quality problems, enabling the identification of root causes and an assessment of their extent.
Data quality monitoring dashboards: with these tools, it is possible to monitor data quality on a regular basis and react quickly to problems as they arise.

Mateusz also indicated that the biggest challenge was managing the quality of data from external systems. He stressed that full quality improvement requires not only better tools but also collaboration with data providers.

One of the key issues raised by Mateusz was the testing of CI/CD processes. With tools such as Terraform and GitHub Actions, the infrastructure as a code has been greatly improved. Nevertheless, testing streams in Databricks notebooks remains a challenge that the team continues to work on.

Phew! We've reached the end point!

No one will be surprised that the presentation caused us some stress. Despite some jitters, everything went according to plan. The presentation concluded with a QA session, which allowed us to further explore the topics. The discussion that ensued showed how popular the topic of data infrastructure transformation is. Together with other conference participants, we saw how important flexibility, collaboration, and process standardisation are in today's world of data management. Despite some difficulties, the migration to Databricks proved to be a key step in optimizing data processing. It has given our teams greater autonomy while ensuring a high level of data quality and security. It was a very challenging but also rewarding day!

Would you like to know more? Check out the video of our presentation

We shared not only the challenges we encountered during this transformation, but also the solutions that proved to be crucial to the success of the entire process. We described the entire journey through the data storm – from planning to implementation to the final result that revolutionised our approach to data processing.

If you are curious about the topic of migrating from DataLake on AWS to Databricks and you want to know the details of our technological journey, we encourage you to watch the video available on our Tech Blog and on YouTube. We also invite you to explore the presentation in English, which we discussed during our talk.

pdf Download presentation (4 MB)

Jacek Cieślak

APEX Service Manager

jacek.cieslak@rbinternational.com.pl