Delta and Parquet: Integer, GUID/UUID or SHA256 as ID?

4 min readMay 6, 2024

Motivation

Last week I posted the post Microsoft Fabric: Building Pseudo Identity Columns Without monotonically_increasing_id() in Spark. Despite the post being about creating a pseudo-identity column with integers, as commonly seen in relational databases, I have received some feedback regarding the use of hashes as ID columns. So I will to address this, in this post.

In general, while a business key as a GUID coming from the OLTP system is acceptable, I try to avoid GUIDs and hashes as IDs when it comes to OLAP.

The reason for this is that using integers is more cost-effective. A long integer is 8 bytes, GUIDs are 16 bytes, MD5 hashing results in 16 bytes, and SHA256 hashing results in 32 bytes

Fact tables can easily reach billions of rows and often contain multiple dimension keys. For example, if we consider a fact table with 4 keys to dimensions: 4 integer keys would use 32GB, 4 GUIDs or MD5 keys would use 64GB, and 4 SHA256 keys would use 128GB.

The size on disk is not so linear when using Delta Lake and Parquet due to the encoding and compression. To provide better insight, I have created a fictive example for measurement purposes.

Measuring the differences

I have created three star schemas, each with one fact table and three dimensions. The difference lies in their use of either integers, GUIDs, or SHA256 as keys. GUIDs and MD5 both use 16 bytes, so their results should be the same and therefore MD5 is not included.

The dimensions are not as wide as they normally would be. This helps focusing on the keys. There is no Slowly Changing Dimension Type 2 (SCD2) implemented.

The code for creating the tables and measure can be found her: https://github.com/ChristianHenrikReich/BlogNoteBooks/blob/main/IntegersGuidAndHashes/StarBuilder.ipynb

The notebook is designed to run on Microsoft Fabric, but with minor adjustments, it can also run on Databricks. I have utilized an ordinary Starter Pool. If you run the test, you may not get the exact same number, but you should observe similar differences.


 
   Table name     Row count                                   Columns                                  
 --------------- ----------- ------------------------------------------------------------------------- 
  dim_customers     100.000   dim_id, dim_type, name, age, gender, city                                
  dim_products       10.000   dim_id, dim_type, name, category, color, size                            
  dim_locations         100   dim_id, dim_type, country, state, city                                   
  fact_sales      36.5 mill.  product_dim_id, customer_dim_id, location_dim_id, date, quantity, price

Storage

The example involves one-day snapshots of all customers buying a product from a location. We are measuring the sizes of Parquet files. Facts for a whole year is created, comprising 365 days, with 100,000 customers each day.


  Table Id Type   fact_sales(36.5 million rows)   dim_customer(100000 rows)   dim_products(10000 rows)   dim_location(100 rows)  
 --------------- ------------------------------- --------------------------- -------------------------- ------------------------ 
  Long Integers   135 MB                          1 MB                        106 KB                     4 KB                    
  GUID            308 MB                          4 MB                        479 KB                     8 KB                    
  SHA256          612 MB                          7 MB                        715 KB                     11 KB

Query

I have created a query where I sum quantity and price, and executed it with dim_id being either an integer, GUID, or SHA256 key. Fortunately, Spark doesn’t optimize away the unreferenced joins. The differences in performance are mainly produced in the joins.

SELECT SUM(quantity), SUM(price) FROM fact_sales_id
INNER JOIN dim_customer_id ON customer_dim_id = dim_customer_id.dim_id
INNER JOIN dim_product_id ON product_dim_id = dim_product_id.dim_id
INNER JOIN dim_location_id ON location_dim_id = dim_location_id.dim_id

Results:


  Table Id Type      1. Run          2. Run          3. Run      
 --------------- --------------- --------------- --------------- 
  Long Integers    6 sec 392 ms    6 sec 388 ms    4 sec 795 ms  
  GUID            21 sec 212 ms   21 sec 106 ms   21 sec  76 ms  
  SHA256          16 sec 765 ms   16 sec 707 ms   16 sec 792 ms

Discussion

As the results shows, there is a significant difference in both storage and performance costs. When using technologies likeDelta Tables and having versions of a table, can increase the difference even more regarding storage. Having 2 version of table with sha256 keys vs integer keys, would cost around 1 Gb more.

The performance of our reporting tool is heavily influenced by the dataset’s performance. While we strive to maximise data availability within the reporting environment, using the wrong data type for keys, can overload the reporting tools with excessive relational information, leaving less space for the analytical data. Consequently, this diminishes the end-user experience, as even a 10-second delay can have a noticeable impact.

Each type of key has its own purpose, and the decision of which one to use depends on its scalability. While data solutions may function well initially, their true test comes with long-term usage, typically after a year or two in production.