What is a Data Lake and how can it help you?
Authors: Pieter-Jan Serlet, and Thomas D'Hauwe
In today's world of Data Management a lot of terms are thrown at anyone who tries to get a deeper understanding. While it's easy to dismiss the differences between Data Lakes, Data Warehouses, Databases, etc. - "Just give us something that works!" - It's also dangerous if you don't weigh the benefits & drawbacks of each solution.
This blogpost will focus on the added value of Data Lakes, which can complement your current data infrastructure. Should you be interested in examples & definitions of data swamps, data marts, etc., this article offers valuable insights.
What is a data lake?
James Dixon, CTO of Pentaho, was the first person to popularize the term 'Data Lake' back in 2010. He describes a Data Lake using an excellent analogy, in his blog “Pentaho, Hadoop, and Data Lakes”:
“If you think of a Datamart (subset of a Data Warehouse - Ed.) as a store of bottled water: cleansed, packaged, and structured for easy consumption; the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source, and various users of the lake can come in to examine, dive in, or take samples.”
Following this analogy you can describe a Data Lake as a data store that is built for ingesting and processing raw data. The incoming raw data can be structured, semi-structured, or unstructured. The data arrives from multiple sources without structuring the data in advance according to a preferred model.
An enterprise Data Warehouse on the other hand contains data that is already structured. The data that is uploaded to a Data Warehouse, first needs to be modified so that it will fit the predefined structure. We'll explain more in-depth in the next paragraph.
The key difference: ETL VS ELT
The main difference between Data Warehouses & Data Lakes can be captured by 'ETL vs. ELT'. What do these acronyms mean?:
To fill an enterprise Data Warehouse, we first have to Extract, Transform, and Load the data. Therefore the bulk of the data preparation work has to be done upfront. Why is this important? Between 50 and 80% of data scientists' time is spent 'wrangling data' (source: New York Times).
By contrast Data Lakes store the data in their 'raw form'. No data is refused entry into the data lake, and data scientists don't have to spend an extraordinary amount of their (well-paid) hours as a 'data janitor'. When a data scientist wants to use the Data Lake for a specific report/analysis, they will have to transform a part of the data. This way of working is called Extract, Load and Transform. This saves a lot of time, because only the data for the specific analysis has to be transformed.
Because all data 'lives free' in the data lake, it's easier for users to go in and get creative with data. This stimulates managers and data scientists to explore the data in novel ways, to really think out of the box, and to come with surprising insights that drive business (source: Scientific American). By contrast the very regimented & structured approach to a data warehouse has its benefits, but it can also stymie creative data exploration.
Zoning your data lake
Since all types of data enter a data lake in their raw form, you end up with a mix of structured & unstructured data. Failing to properly zone your Data Lake can lead it to become a Data Swamp. To avoid the creation of an unmanageable mess your Data Governance should prescribe at least the following zones:
- Ingestion zone: This is the first zone that the data will enter and contains all data that your organization has access to.
- Raw data zone: This is the approved raw data zone. In this zone, you add metadata to your available data.
- Trusted zone: Consolidated and aggregated data.
- Refined zone: This is the enriched data that is ready for transformation.
Do not think it's easy to set-up a data lake because ‘you can just ingest all data in its raw form’. Creating a well-functioning data lake requires a lot of thought and effort. You will require a well-maintained data catalog. Failure to implement data curation & governance guidelines could turn it into a Data Swamp in no time.
Added value of Data Lakes
Some organizations tend to treat different data sources as different data silos. Each data silo is controlled by one department and isolated from the rest of the organization. These silos make it difficult to combine data and gain insights from different sources.
To simplify data management, a Data Warehouse is a great first step for managing and integrating data. This is nothing new, a lot of companies have implemented Data Warehouses with a highly structured Data Model designed for reporting & analysis.
However each organization is gathering a staggering amount of unstructured data that may never be deemed fit for entry in the Data Warehouse. Examples are videos, pictures, sensor data from IoT-enabled devices, etc. Research (Source: Western Digital & 451 Research) found that 63% of enterprises and service providers have 25 Petabytes of unstructured data floating around - That's 25.000 Terabytes! IDG even predicts that by 2020 up to 93% of all data will be unstructured.
By excluding these data types & sources from the party you risk missing out on important pieces of the puzzle. Furthermore a data lake offers a chance to go beyond descriptive analytics, into the exciting realms of predictive and prescriptive analytics. This is where the real surprising insights wait to be discovered, and they offer greater opportunities to leverage and monetize your data.
We'd like to keep it short and sweet so here goes: Keep up the good work with a well-organized Data Warehouse, but do think about the possibilities and added value of a complementary Data Lake!
How can we assist you?
- High-level overview of the current data sources with their main data entities and their data quality level
- Documentation of the current solution architecture, challenges, and needs (as-is)
- High-level requirements list based on future goals and plans (to be)
- Scoping the data lake architecture
- Assisting with implementation of a data lake
- Assure you're set-up for future success by training admins/users and updating data governance
Join us on the 18th of October to learn how you can scale up your analytics & business intelligence efforts. Reserve your seat here!
We would like to thank the people at Xplenty for allowing us to use their ETL vs ELT graph.
Written by Pieter-Jan Serlet, Data Engineer/Architect at LoQutus
Thomas D'Hauwe, Business Intelligence Consultant at LoQutus.
Edited by Dries Lamont, Marketing Manager at LoQutus.