When we think of data storage, we usually think of data warehouses or lakes – architectures that work very well for some things and not at all for others. To address these limitations, a new architecture is being developed: the data lakehouse.
In this article, we’ll discuss what the data lakehouse architecture is, how it meets current data architecture challenges, and the basic requirements for implementing it.
What Is a Data Lakehouse?
A data lakehouse is a new architecture that borrows from both data warehouses (which primarily house structured data in a proprietary format) and data lakes. Data lakes use an open format to store all types of data, including semi-structured and unstructured data.
While both data warehouses and data lakes have their advantages, they come with significant drawbacks. Data warehouses only support SQL for data access and are more difficult to scale. On the plus side, they’re very reliable and offer better security, data governance, and performance. Data lakes use APIs as well as SQL, Python, and other programming languages for data access. They house a wide variety of data and are very scalable. In the debit column, we have the lakes’ lower reliability, comparatively poor data governance, and slower performance.
In contrast, a data lakehouse merges data lakes’ openness, API usage, and scalability with data warehouses’ reliability, governance, security, and reliability. Lakehouses support both SQL and other programming languages and serve a number of important use cases, including SQL, BI, and ML.
Essential data lakehouse features include:
- Support for concurrent data pipelines and multiple data environments (i.e. structured, semi-structured, unstructured, textual, IoT, etc.).
- Physical storage separate from compute, making the system more scalable.
- Built-in data governance and auditing.
- The ability to handle varied workloads, e.g. analytics, machine learning, BI, etc.
- Enabling streaming analytics.
Challenges Solved by the Data Lakehouse Architecture
Of course, many companies have both data warehouses and data lakes, thus getting the benefits of both architectures. What makes the data lakehouse revolutionary is that it offers everything mentioned above on one platform. This reduces the complexity of having two data systems and shuffling information between them. It also mitigates the performance issues involved with the extract-transform-load (ETL) process.
How to Build a Data Lakehouse
What does an organization need to take full advantage of the benefits of a data lakehouse architecture? It starts with low-cost, flexible storage (i.e. Cloud storage) that can support multiple read/write operations, warehouse-type data schemas, and dynamic schema changes.
Next, security and governance come into play. The architecture needs to be able to support DML through multiple languages, provide a full history of all changes (including metadata, versioning, and the ability to roll back changes), and role-based access control. Obviously, this is also the time to ensure that all regulatory and industry compliance measures regarding data privacy and safety are implemented.
Another key factor in the data lakehouse is openness, i.e. using open application programming interfaces (APIs) to share data and open file formats to store it in. Data access should be available in several languages, tools, and engines.
Finally, there must be the ability to handle multiple data use cases, including analytics, Business Intelligence, and machine learning. While some of this is covered in the preceding paragraphs, it’s worth noting that ML systems require certain APIs and tools of their own (e.g. DataFrame, TensorFlow, R, PyTorch, etc.) as well as the ability to perform direct reads of large amounts of data.
In short, preparing for a data lakehouse means preparing for concurrent but diverse use cases, supported by a robust and flexible system.
The Case for the Data Lakehouse
Streamlining operations has been a byword in business and technology for a long time now. With the emergence of the data lakehouse, organizations can do more than harness two very functional (but very different) data architectures. They can eliminate common error points in the ETL process, mitigate some of the weaknesses inherent to ether system, and more effectively promote data movement and consumption. In short, they can streamline their data workload, which paves the way for even better results.