Understanding Delta lake
Data Lakes and Data Warehouses has complementary benefits which made both of them a must have is many organizations. A data warehouse is a central repository of information that can be analyzed to make more informed decisions. It is a collection of structured data. While a data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data.
Like in the below analogy a lake has water and impurities and things from all kind of sources comprising of rain, disposals from houses, fishes, wastes, plants etc. We take the water from lake and after processing we store it in a tank which depicts a data warehouse.
Data lakes supports machine learning and has an open ecosystem allowing unstructured data. While Data warehouses are great for BI and reporting but has limited support for machine learning.
Bringing the best from both we have Lakehouse. Lakehouse provides a unification of data, analytics and AI workloads. In order to facilitate building of lakehouse, databricks provides a open-source storage framework known as Delta Lake. Databricks is the data and AI company which provided the first lakehouse. Let’s understand this with our analogy of lake and water tank. You know that lake has its own importance. You can go fishing on a lake but not on a water tank. While the water from water tank is processed and hence can be used for drinking, cooking and for other households. So we need both to serve different purposes. Right???
Imagine a hypothetical case where we can have seperate storage area or may be a tank which can not only be used for fishing but also can be leveraged to provide water for household we shall no longer have dependency on different sources and just a single source will suffice. This storage will then be synonymous to a Lakehouse and the boundaries in which this lake is confined which facilitates building this new lake is synonymous to Delta Lake.
Lets see some of the features of delta lake:
- They provide the ability to build and curate data lakes that has relibility, governance and performance that is expected from a data warehouse directly from a data lake.
- It Support ACID property, ensures all transactions are either completly successfull or completly failed.
- Unitilised caching and indexing allowing quick quering of data from data lakes.
- Provides control on who can access data from data lakes
In the below image you can see how a lakehouse is a solution that has best of data lake and data warehouse.
Delta lake as can be seen in the below diagram is the framework that support building a lakehouse.
By utilising delta lake we can build a lake house over our data lake. In the below figure the bottom layer of unstructured, semistructured, structured and streaming data indicates a data lake.
The lakehouse platfrom provides governance and security as well as providing data to different work streams. In the absence of it 4 seperate architectures were needed to supports work load accross data engineering, analytics, ML and real-time applications. A common data foundation is achieved using lakehouse.
Delta lake is open source but different companies has modified it and made their own versions out of it like Azure’s Azure Data Lake Storage Gen2.