Delta and Parquet: Integer, GUID/UUID or SHA256 as ID?

Christian Henrik Reich
4 min readMay 6, 2024

Motivation

Last week I posted the post Microsoft Fabric: Building Pseudo Identity Columns Without monotonically_increasing_id() in Spark. Despite the post being about creating a pseudo-identity column with integers, as commonly seen in relational databases, I have received some feedback regarding the use of hashes as ID columns. So I will to address this, in this post.

In general, while a business key as a GUID coming from the OLTP system is acceptable, I try to avoid GUIDs and hashes as IDs when it comes to OLAP.

The reason for this is that using integers is more cost-effective. A long integer is 8 bytes, GUIDs are 16 bytes, MD5 hashing results in 16 bytes, and SHA256 hashing results in 32 bytes

Fact tables can easily reach billions of rows and often contain multiple dimension keys. For example, if we consider a fact table with 4 keys to dimensions: 4 integer keys would use 32GB, 4 GUIDs or MD5 keys would use 64GB, and 4 SHA256 keys would use 128GB.

The size on disk is not so linear when using Delta Lake and Parquet due to the encoding and compression. To provide better insight, I have created a fictive example for measurement purposes.

Measuring the differences

I have created three star schemas, each with one fact table and three dimensions. The difference lies in their use of either integers, GUIDs, or SHA256 as keys. GUIDs and MD5 both use 16 bytes, so their results should be the same and therefore MD5 is not included.

The dimensions are not as wide as they normally would be. This helps focusing on the keys. There is no Slowly Changing Dimension Type 2 (SCD2) implemented.

The code for creating the tables and measure can be found her: https://github.com/ChristianHenrikReich/BlogNoteBooks/blob/main/IntegersGuidAndHashes/StarBuilder.ipynb

The notebook is designed to run on Microsoft Fabric, but with minor adjustments, it can also run on Databricks. I have utilized an ordinary Starter Pool. If you run the test, you may not get the exact same number, but you should observe similar differences.



Table name Row count Columns
--------------- ----------- -------------------------------------------------------------------------
dim_customers 100.000 dim_id, dim_type, name, age, gender, city
dim_products 10.000 dim_id, dim_type, name, category, color, size
dim_locations 100 dim_id, dim_type, country, state, city
fact_sales 36.5 mill. product_dim_id, customer_dim_id, location_dim_id, date, quantity, price

Storage

The example involves one-day snapshots of all customers buying a product from a location. We are measuring the sizes of Parquet files. Facts for a whole year is created, comprising 365 days, with 100,000 customers each day.


Table Id Type fact_sales(36.5 million rows) dim_customer(100000 rows) dim_products(10000 rows) dim_location(100 rows)
--------------- ------------------------------- --------------------------- -------------------------- ------------------------
Long Integers 135 MB 1 MB 106 KB 4 KB
GUID 308 MB 4 MB 479 KB 8 KB
SHA256 612 MB 7 MB 715 KB 11 KB

Query

I have created a query where I sum quantity and price, and executed it with dim_id being either an integer, GUID, or SHA256 key. Fortunately, Spark doesn’t optimize away the unreferenced joins. The differences in performance are mainly produced in the joins.

SELECT SUM(quantity), SUM(price) FROM fact_sales_id
INNER JOIN dim_customer_id ON customer_dim_id = dim_customer_id.dim_id
INNER JOIN dim_product_id ON product_dim_id = dim_product_id.dim_id
INNER JOIN dim_location_id ON location_dim_id = dim_location_id.dim_id

Results:


Table Id Type 1. Run 2. Run 3. Run
--------------- --------------- --------------- ---------------
Long Integers 6 sec 392 ms 6 sec 388 ms 4 sec 795 ms
GUID 21 sec 212 ms 21 sec 106 ms 21 sec 76 ms
SHA256 16 sec 765 ms 16 sec 707 ms 16 sec 792 ms

Discussion

As the results shows, there is a significant difference in both storage and performance costs. When using technologies likeDelta Tables and having versions of a table, can increase the difference even more regarding storage. Having 2 version of table with sha256 keys vs integer keys, would cost around 1 Gb more.

The performance of our reporting tool is heavily influenced by the dataset’s performance. While we strive to maximise data availability within the reporting environment, using the wrong data type for keys, can overload the reporting tools with excessive relational information, leaving less space for the analytical data. Consequently, this diminishes the end-user experience, as even a 10-second delay can have a noticeable impact.

Each type of key has its own purpose, and the decision of which one to use depends on its scalability. While data solutions may function well initially, their true test comes with long-term usage, typically after a year or two in production.

--

--

Christian Henrik Reich

Renaissance man @ twoday Kapacity, Renaissance man @ Mugato.com. Focusing on data architecture, ML/AI and backend dev, cloud and on-premise.