How to organize DataLake. Data ingestion, reconciliation, transformation and data validation.
When you hear about BigData, what do you imagine? A lot of the data or something else. I usually imagine a set of several systematic processes that works with data. Actually the organisation of those processes usually called Data Lake, yes, yes, and yes Data Lake is based on some storage, but more important that data lake is a set of well orchestrated processes.
One of the way to organize the data lake is to logically split it to several layers. On the picture above you may see 3 layers: bronze, silver and golden. Let’s take a look at the difference between those layers:
Bronze layer
Layer is responsible for raw data that comes from any source. As you can see above, I especially group different type of sources. Between the bronze and source, there is an ingestion process. Ingestion could be done by multiple ways: CDC (mostly loved nowadays), files, APIs, dumps fetching, etc. So the data that was consumed from source is stored in this layer as is from data perspective, from the processing perspective the data should be consumable for your precessing framework.
Silver layer
It’s the layer where you have done simple type of transformations, in this layer you should not do any type of complex joins or aggregations. Try to keep the objects in this layer simple and plain. In Java world there is a term POJO (plain old java objects) so this term perfectly describes data model of this layer.
Golden layer
In this layer, you have more complex objects that have been built by joining/aggregating of several objects from Silver layer. It usually keeps datamart objects and more complex aggregated objects. This layer is mostly used for exposing dashboard, making reports, exposing objects over API, etc.
Validation and reconciliation
One is the most important task for data engineers who manage data lake is to keep it consistent on the all layers. There are 2 process that helps to keep storage consistent: data validation and data reconciliation.
Validation is usually done between data lake layers: bronze, silver and gold, the main goal of validation is to find bugs in transformation processes between layers. It usually organized as a bunch of data quality checks. The checks itself could be partition based or random checks against the whole data set, the technique is depends on the amount of data.
Reconciliation is the process that is done between data source and storage. The goal is to make sure that the data in data lake is in consistent state with the source data. It’s usually run against bronze and source.
Conclusion
The most important thing is to have understanding about DataLake structure across the data team. Everybody from the team should understand that consistency of data is the most important part of Data Lake management. Everything else is built on top of consistent and stable DataLake.