How to organize DataLake. Data ingestion, reconciliation, transformation and data validation.

Aleksei Kornev
3 min readFeb 6, 2023

--

DataLake structure
DataLake structure

When you hear about BigData, what do you imagine? A lot of the data or something else. I usually imagine a set of several systematic processes that works with data. Actually the organisation of those processes usually called Data Lake, yes, yes, and yes Data Lake is based on some storage, but more important that data lake is a set of well orchestrated processes.

One of the way to organize the data lake is to logically split it to several layers. On the picture above you may see 3 layers: bronze, silver and golden. Let’s take a look at the difference between those layers:

Bronze layer

Layer is responsible for raw data that comes from any source. As you can see above, I especially group different type of sources. Between the bronze and source, there is an ingestion process. Ingestion could be done by multiple ways: CDC (mostly loved nowadays), files, APIs, dumps fetching, etc. So the data that was consumed from source is stored in this layer as is from data perspective, from the processing perspective the data should be consumable for your precessing framework.

Silver layer

It’s the layer where you have done simple type of transformations, in this layer you should not do any type of complex joins or aggregations. Try to keep the objects in this layer simple and plain. In Java world there is a term POJO (plain old java objects) so this term perfectly describes data model of this layer.

Golden layer

In this layer, you have more complex objects that have been built by joining/aggregating of several objects from Silver layer. It usually keeps datamart objects and more complex aggregated objects. This layer is mostly used for exposing dashboard, making reports, exposing objects over API, etc.

Validation and reconciliation

One is the most important task for data engineers who manage data lake is to keep it consistent on the all layers. There are 2 process that helps to keep storage consistent: data validation and data reconciliation.

Validation is usually done between data lake layers: bronze, silver and gold, the main goal of validation is to find bugs in transformation processes between layers. It usually organized as a bunch of data quality checks. The checks itself could be partition based or random checks against the whole data set, the technique is depends on the amount of data.

Reconciliation is the process that is done between data source and storage. The goal is to make sure that the data in data lake is in consistent state with the source data. It’s usually run against bronze and source.

Conclusion

The most important thing is to have understanding about DataLake structure across the data team. Everybody from the team should understand that consistency of data is the most important part of Data Lake management. Everything else is built on top of consistent and stable DataLake.

--

--

Aleksei Kornev
Aleksei Kornev

Written by Aleksei Kornev

Solution Architect Consultant DevOps/Microservices/Backend

Responses (1)