An Essential Overview of the Data Lakehouse

When we think of data storage, we usually think of data warehouses or lakes – architectures that work very well for some things and not at all for others. To address these limitations, a new architecture is being developed: the data lakehouse.

In this article, we’ll discuss what the data lakehouse architecture is, how it meets current data architecture challenges, and the basic requirements for implementing it.

What Is a Data Lakehouse?

A data lakehouse is a new architecture that borrows from both data warehouses (which primarily house structured data in a proprietary format) and data lakes. Data lakes use an open format to store all types of data, including semi-structured and unstructured data.

While both data warehouses and data lakes have their advantages, they come with significant drawbacks. Data warehouses only support SQL for data access and are more difficult to scale. On the plus side, they’re very reliable and offer better security, data governance, and performance. Data lakes use APIs as well as SQL, Python, and other programming languages for data access. They house a wide variety of data and are very scalable. In the debit column, we have the lakes’ lower reliability, comparatively poor data governance, and slower performance.

In contrast, a data lakehouse merges data lakes’ openness, API usage, and scalability with data warehouses’ reliability, governance, security, and reliability. Lakehouses support both SQL and other programming languages and serve a number of important use cases, including SQL, BI, and ML.

Essential data lakehouse features include:

  • Support for concurrent data pipelines and multiple data environments (i.e. structured, semi-structured, unstructured, textual, IoT, etc.).
  • Physical storage separate from compute, making the system more scalable.
  • Built-in data governance and auditing.
  • The ability to handle varied workloads, e.g. analytics, machine learning, BI, etc.
  • Enabling streaming analytics.

Challenges Solved by the Data Lakehouse Architecture

Of course, many companies have both data warehouses and data lakes, thus getting the benefits of both architectures. What makes the data lakehouse revolutionary is that it offers everything mentioned above on one platform. This reduces the complexity of having two data systems and shuffling information between them. It also mitigates the performance issues involved with the extract-transform-load (ETL) process.

How to Build a Data Lakehouse

What does an organization need to take full advantage of the benefits of a data lakehouse architecture? It starts with low-cost, flexible storage (i.e. Cloud storage) that can support multiple read/write operations, warehouse-type data schemas, and dynamic schema changes.

Next, security and governance come into play. The architecture needs to be able to support DML through multiple languages, provide a full history of all changes (including metadata, versioning, and the ability to roll back changes), and role-based access control. Obviously, this is also the time to ensure that all regulatory and industry compliance measures regarding data privacy and safety are implemented.

Another key factor in the data lakehouse is openness, i.e. using open application programming interfaces (APIs) to share data and open file formats to store it in. Data access should be available in several languages, tools, and engines.

Finally, there must be the ability to handle multiple data use cases, including analytics, Business Intelligence, and machine learning. While some of this is covered in the preceding paragraphs, it’s worth noting that ML systems require certain APIs and tools of their own (e.g. DataFrame, TensorFlow, R, PyTorch, etc.) as well as the ability to perform direct reads of large amounts of data.

In short, preparing for a data lakehouse means preparing for concurrent but diverse use cases, supported by a robust and flexible system.

The Case for the Data Lakehouse

Streamlining operations has been a byword in business and technology for a long time now. With the emergence of the data lakehouse, organizations can do more than harness two very functional (but very different) data architectures. They can eliminate common error points in the ETL process, mitigate some of the weaknesses inherent to ether system, and more effectively promote data movement and consumption. In short, they can streamline their data workload, which paves the way for even better results.

About the Author

Dr. Anil Kaul

Dr. Anil Kaul is Chief AI Officer at Infogain and CEO at Absolutdata – an Infogain company. He has over 22 years of experience in advanced analytics, market research, and management consulting. He is very passionate about analytics and leveraging technology to improve business decision-making. Prior to founding Absolutdata, Anil worked at McKinsey & Co. and Personify. In addition to speaking at industry conferences and top business schools, Anil has published articles in McKinsey Quarterly, Marketing Science, Journal of Marketing Research, and International Journal of Research. Anil holds a Ph.D. and a Master of Marketing degree, both from Cornell University.

Harshit Parikh

Harshit is the Vice President, Global Practice lead at Infogain. A seasoned technology executive with nearly 20 years of experience leading large engineering teams, architecting complex technical solutions, and building and scaling geographically distributed teams to deliver them, Harshit knows how to deliver results in today's changing world of business. A self-described digital native, Harshit has spent his career building the technical foundations that enable true digital transformation. He has advised clients on a diverse range of initiatives, including digital marketing, technology strategy and roadmap, enterprise solution architecture, CMS platforms, data platforms, commerce solutions, DevOps, and custom development, and led several global, technology-driven digital transformation initiatives for Fortune 500 clients.