Skip to content

Latest commit

 

History

History
86 lines (55 loc) · 4.56 KB

35. Lakehouse.md

File metadata and controls

86 lines (55 loc) · 4.56 KB

Lakehouse

image

A Lakehouse is a data management system based on lowcost and directly-accessible storage that also provides traditional analytical DBMS management and performance features such as ACID transactions, data versioning, auditing, indexing, caching, and query optimization.

Lakehouses thus combine the key benefits of data lakes and data warehouses: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter.

Architecture

Overview

  • Store data in a low-cost object store using a standard file format
  • Implement a transactional metadata layer on top of the object store that defines which objects are part of a table version

Example Design

With key components shown in green. The system centers around a metadata layer such as Delta Lake that adds transactions, versioning, and auxiliary data structures over files in an open format, and can be queried with diverse APIs and engines.

image

DataFrames were popularized by R and Pandas and simply give users a table abstraction with various transformation operators, most of which map to relational algebra.

Metadata Layers

Data lake storage systems such as S3 or HDFS only provide a low-level object store or filesystem interface where even simple operations, such as updating a table that spans multiple files, are not atomic.

Organizations soon began designing richer data management layers over these systems:

  • starting with Apache Hive ACID, which tracks which data files are part of a Hive table at a given table version using an OLTP DBMS and allows operations to update this set transactionally.
  • Delta Lake, which stores the information about which objects are part of a table in the data lake itself as a transaction log in Parquet format, enabling it to scale to billions of objects per table
  • Apache Iceberg, which started at Netflix, uses a similar design and supports both Parquet and ORC storage

In addition, metadata layers are a natural place to implement data quality enforcement features. For example, Delta Lake implements schema enforcement to ensure that the data uploaded to a table matches its schema.

Finally, metadata layers are a natural place to implement governance features such as access control and audit logging.

SQL Performance

Techniques to implement SQL performance optimizations in a Lakehouse:

Caching

Safe for a Lakehouse system to cache files from the cloud object store on faster storage devices such as SSDs and RAM on the processing nodes. Running transactions can easily determine when cached files are still valid to read.

Moreover, the cache can be in a transcoded format that is more efficient for the query engine to run on.

Auxiliary data

It can maintain other data that helps optimize queries in auxiliary files

In Delta Lake and Delta Engine, we maintain column min-max statistics for each data file in the table within the same Parquet file used to store the transaction log, which enables data skipping optimizations when the base data is clustered by particular columns

Data layout

Even when we fix a storage format such as Parquet, there are multiple layout decisions that can be optimized by the Lakehouse system.

The most obvious is record ordering: which records are clustered together and hence easiest to read together.

Conclusion

We have argued that a unified data platform architecture that implements data warehousing functionality over open data lake file formats can provide competitive performance with today’s data warehouse systems and help address many of the challenges facing data warehouse users. Although constraining a data warehouses’s storage layer to open, directly-accessible files in a standard format appears like a significant limitation at first, optimizations such as caching for hot data and data layout optimization for cold data can allow Lakehouse systems to achieve competitive performance. We believe that the industry is likely to converge towards Lakehouse designs given the vast amounts of data already in data lakes and the opportunity to greatly simplify enterprise data architectures.

Build Lakehouses with Delta Lake

image

*References

https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

https://delta.io/