Microsoft Fabric and Databricks: The low-level challenge of enforcing primary keys and foreign keys and uniqueness in columns

Christian Henrik Reich
5 min readMay 18, 2024

One of the strongest features of traditional relational databases, regarding data consistency and data quality, is the primary key and the guarantee of its uniqueness. This is also true for foreign keys and unique columns.

While it is possible to declare primary keys in both Microsoft Fabric Warehouse and Databricks with Unity Catalog, these are not enforced. In Microsoft Fabric, we even declare the primary key as not enforced.

Example from Databricks
Exmaple from Microsoft Fabric Warehouse. The statement works, the red scribbles is not correct

This means there is no guarantee that the primary column holds unique values. The same is true for columns with unique constraints. While there is no perfect solution to this issue, understanding it might help when designing a solution.

Why does it works for old technologies?

To understand one of the reasons why it is not enforced, we start by looking at why it works in older technologies like SQL Server, Postgres, etc. Many traditional databases can be viewed as in-memory databases.

Many traditional databases can be viewed as in-memory databases. Often the full database doesn’t fit in memory, so data is also stored on disk and is swapped in and out. The art of being a DBA is to arrange the hot data in memory and leave cold data on disk.

Tables are stored row-by-row on so-called data pages. Pages can be arranged in different ways; the most common are either continuous, called a heap, or indexed in a B+ tree structure. A table doesn’t have to be fully in memory to be searchable.

When a single value, like a primary key, is searched in a heap, we start at the first record on the first page and read through all pages until the last record on the last page. Search time grows with size of the table, which is suboptimal.

Heaps is a data structure which is slow for lookups, but they are fast for writing

When having data indexed in a B+ tree, as many database systems do, looking up a value can be done with a few jumps over 3–4 pages, no matter the size of the table. Trees are generally known for fast lookup of values. I have marked the travel to retrieve an indexed record, through search of a indexed single value e.g. a primary key.

B+ tress has 1–2 levels of indexing pages. They usually have depth of 3–4 levels, and they grow in width.

When creating a primary key or a unique column, a B+ index tree is created. When inserting into the B+ tree, and there is a value already present, a key violation error is raised; otherwise, we insert. When performing a reference key check, we search the tree as shown in the drawing.

As there is no encoding and compression of pages in memory and access to memory is extremely fast, single lookups can easily be done by the millions in seconds, depending on hardware and setup. There might be a setback if some information from a table is not loaded into memory, but that can be retrieved during processing.

Moving data from memory to CPU is counted in nano seconds.

How does that compare to Microsoft Fabric or Databricks?

When separating compute from storage, data are stored on storage by default, and not in memory (though caching can occur).

Using Spark as an example. When starting a Spark job, data on storage is partition pruned, file pruned and pruned from the parquets, when reading to Sparks memory, to increase performance. Missing data can’t be retrieved during the job(unless you write logic for this). Moving data from storage to memory, takes longer than moving data from memory. Data stille have to move from memory to CPU, but the ekstra step of moving from storage to memory is added.

Data stored as parquet on storage is encoded and compressed and must be decompressed and decoded for reading the values. Data is stored column-by-column in a continuous manner, with somewhat the same properties as a heap when searching for values.

Storage to memory latency is usually around 1 ms, significant slower than memory to CPU.

When searching for one or a few records, the experience is acceptable. Now, let’s imagine we have enforcement of primary and foreign keys and want to insert a batch of 100,000 records into a table.

Spark can only process what is in memory during a job. So the table for the primary key inserts needs to be loaded into memory, plus the foreign referenced tables. Because Spark does not load data into a search-optimised structure like a tree, the full table has to be checked for each record to be inserted, in this case, 100,000 times. This is the same for each foreign key check. Let’s say thetarget table is 1,000,000. That is a lot of single searches and table scans.

To sum up, loading all the tables introduces an initial latency. Decoding and decompression further contribute to this latency. Looking up values in a linear data structure results in poor scaling giving growing latency, and significantly worse performance compared to a B+ tree, especially as the size of the table increases.

So, one of the reasons — there might be more — there is no key and uniqueness enforcement in Microsoft Fabric and Databricks is what gives the flexibility in storage and compute, as well as the performance for aggregation and analytical processing, but it also leads to the worst search times for singles in data. That is the compromise, and properly one of the reasons why we still are waiting.

Discussion

Many might not find this as an issue. Many texts on data warehousing recommend not having constraints on facts to increase load performance. Integrity can be ensured by policies around loading tables. Then I have seen people conduct integrity-testing, such as using group by count. One could consider if a traditional database might be a better solution, if there is a strong need for it.

Still, some processing engine optimisers can utilise this information, so it is not bad practice to define the keys.

--

--

Christian Henrik Reich

Renaissance man @ twoday Kapacity, Renaissance man @ Mugato.com. Focusing on data architecture, ML/AI and backend dev, cloud and on-premise.