Delta Parque Storage Understanding

Sunday, 9 February 2025

Delta Parque Storage Understanding

Delta Lake leverages Parquet by building upon it to add powerful features, such as ACID transactions, versioning, and time travel.

Parquet Files

Store the actual data in a columnar format. Efficient for reading specific columns and compressing data

Delta Parquet

Delta adds four key advantages to Parquet file storage:

Component	Functionality Added
_delta_log	JSON and checkpoint files that track all changes (add, remove, update).
ACID Transactions	Ensure that you write to the log before modifying Parquet files.
Schema Enforcement	Validates data before writing to Parquet.
Time Travel	Uses the log to reconstruct previous versions of the data.

Process of using Delta Lake storage

Usage of Delta Parquet

Databricks created/invented Delta Parquet
Apache Spark
MS Fabric
Snowflake (via connector)
Presto (connector)

Report directly from Delta Parquet

Power BI,
Tableau, and
Jupyter Notebooks (man, do I like a good notebook!).

All can access Delta Parquet data indirectly via Spark or Databricks connectors.

Snowflake Notes:

Delta Parquet/Delta Lake in MS Fabric (Azure Synapse) can be used by Snowflake. There are various options, but Snowflake doesn't understand the delta part, so it needs to use a manifest to convert the delta part. You need to create External storage in Snowflake (similar to a shortcut in MS Fabric) and then make an external table in Snowflake. Delta Uniform then allows Snowflake to utilise Apache Iceberg, its native format for storing Parquet with time travel capabilities.

Tip: Apache Iceberg on Snowflake is similar to Delta Parquet on Databricks and MS Fabric.

Note: Power BI semantic model can utilise Snowflake parquet files, but the update is only aligned with the parquet file, and there is no ACID or time series ability.

Sunday, 9 February 2025