Data lake Architecture Best Practices
Author: Omid Vahdaty 8.7.2020
In this article, I am going to share the best practices of creating a new architecture of a data lake.
- Ingestion Layer – A layer in your big data architecture designed to do one thing: ingest data via Batch or streaming. I.e move data from source data to the ingestion buckets in the architecture.
- Transformation Layer – A layer in the architecture, designed to transform data and cleanse data (fix bugs in data, convert, filter, beautify, change format , reparition)
- Modeling layer – A layer in the architecture, designed to model data – Joins / Aggregations, nightly jobs, machine learning. Do not be confused with BI Model.
- Presentation Layer – A layer in the architecture used only for:
- Presenting data (caching) to operational applications and/or BI systems.
- Building the BI dimensions.
- Decoupling – Separate your compute technologies ( Athena,EMR , Redshift) and storage technologies (S3).
General DATA Architecture Guidelines:
- Decouple your compute and storage whenever possible. This will enable you to use your data lake as follows. One copy of your data on external storage such AWS S3, and then use different compte solutions to analyze the data. By Decoupling you may be able to incorporate several technologies each solving a different set of data challenges.
- Avoid using 24X7 clusters by using transient cluster (in a cloud architecture)
- Another option would be to use “pay as you go” services instead of cluster – used correctly could be 50% cheaper than 24X7 clusters.
Methodology - DATA Department
- Each BI, Data Science, Data Engineering, Analysis, Tracking , Machine learning should be subject to architecture review on planning phase, and quality review in the end of each task.
- Data Level monitoring should applied.
- SCRUM/Kanban methodologies should be applied.
- All development should be done on a DATA Account, with dedicated Access control of Admin to the data engineers and limited access control the rest of the organization , this is to ensure “One Truth” of data.
ETL & Data pipelines Guidelines
- Use Ingestion layer only to:
- Ingest data either for batch or streaming process.
- Remove PII information / encrypt at rest the sensitive data.
- Use Transformation Layer only to:
- Repartition data.
- Reformat data.
- Transform data via regex/ case/ cast etc.
- You can use this layer for logical transformations only after cleansing transformation was applied on the data. The rule of thumbs – if this table is not supposed to be used by the end user, it should be in the transformation layer.
- Avoid code duplication.
- Use Modeling Layer for:
- Use for AGG, Unions, Joins, even some FACTs.
- If can not be avoid – Joins / Window function
- Try to minimize the time frame per Window function, Join by apply parrailsim and loops and airflow
- Break into small parts to ensure future scale.
- General Guidelines about ETLS and Data Pipelines
- Views should be used mainly for Transformation type of tasks such as casting, filter, regex, case as they nature of the transformation may be dynamic over time.
- Parallelism should be applied as much as possible.
- Naming convention should created (see appendix Naming conventions)
- If possible, Avoid using long “With table” in ETL – this makes the ETL unreadable. Break it down to small logical ETLS.
- Avoid code duplication in “With Tables” – recipe for disaster.
- Avoid using too many databases – each database must coincide with the later in the architecture.
- Use partitions in your tables.
Table Naming Conventions
- Uppercase VS lowercase – for Example : Country vs country.
- Column names: prefix_suffic vs prefixSuffix – both are OK, just be consistent.
- Table names should contain:
- FACT_ for tables with RAW DATA in ingestions layer.
- DIM_ for dimensions tables for the BI.
- AGG_ prefix for aggregations tables in the transformation/modeling layers.
Example of AWS Big Data Architecture
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn:
1 thought on “Data Lake Architecture Best Practices”