Microsoft Fabric: Avoid extra storage layers for temporary data with Spark caching and Fabric notebookutils’ runMultiple
Motivation
When working with data, we sometimes create a base dataset and then want to use it in transformations with other datasets. In a Lakehouse context, I’m often asked where to store this data; should it be in Silver or Gold, or should there be a fourth layer in between?
It’s important to remember that everything in computing has a cost. Storing temporary data on disk for reuse is often the most expensive option, whereas storing it in memory right after creation is typically the least expensive. Both writing data from Spark memory to disk and then reading it back into Spark memory each time you need it can be costly.
Furthermore, if storing temporary data in Delta Lake, one incur additional overhead from versioning and the need for maintenance tasks like vacuuming, which are generally unnecessary for purely temporary data.
Introduction to the solution
This solution may remind you of a general Spark optimization strategy I discussed in my previous post, Spark performance: Let cache() or persist() handle your temporary data when possible.
In this case, I’m targeting Microsoft environments, as I’m using the Notebookutils library to leverage some of the advantages Microsoft Fabric provides for managing Spark contexts. I covered this topic in my earlier post, Microsoft Fabric: Utilize Shared SparkSessions fully with mssparkutils.notebook.run and runMultiple, which was written before mssparkutils was renamed to notebookutils.
The solution
The solution leverages the fact that notebookutils.notebook.runMultiple shares the same Spark context, allowing a Spark temporary view to be shared across all notebooks when using runMultiple.
For this example, I created a star schema with fictitious customer data spanning an entire year, using my star builder. The setup includes a computationally expensive base view, which two other queries extend.
The base view
My star builder also provides me with a star schema that uses SHA-256 as a key, which is useful for when wanting slow queries. This is the BaseView notebook.
The cache statement triggers the query execution immediately instead of relying on lazy loading. From a cold start, this process takes 43.7 seconds in this trail.
The processing_example_1 and processing_example_2 notebooks
Both notebooks are identical, so I’m only showing one of them here. It’s important to note that I don’t reference the BaseView notebook at all. The “CREATE OR REPLACE TEMPORARY VIEW base_view…” statement disrupts caching because it rewrites the query plan for base_view.
However, since the processing_example notebooks run in the same Spark context, we can assume base_view already exists.
The processing_example_2 notebook is identical to processing_example_1, except that it uses “example 2” where the other uses “example 1.”
The orchestretoration
When using notebookutils.notebook.runMultiple, we can introduce dependencies in the DAG to ensure that the base view is created before it’s used.
It’s worth mentioning that both runMultiple and Delta Lake introduce some overhead, adding to the overall runtime. Even so, we can see that BaseView takes the longest, while the two other notebooks that build on it are faster.
Before we conclude, let’s examine where the time was spent.
Times
For the BaseView notebook
The processing notebooks
Conclusion
Looking at the timings, BaseView takes about 38 seconds, while the two processing notebooks take a bit over 9 and 7 seconds, respectively. I also conducted some testing where I wrote the base view to a Delta table and then read it back. Writing took about 1 minute, and in a new session, reading took 38 seconds for each.
When all reading and writing occur in a single session, performance improves slightly. Though it’s still about 10 seconds slower, as shown in the last figure below.
This slight improvement is likely because the process uses data that remains in Spark memory, bypassing the need to read from disk. Consequently, the on-disk data isn’t actually utilized, eliminating the need to write it at all. By avoiding unnecessary disk writes, we reduce the number of moving parts, layers, and notebooks to maintain and orchestrate, ultimately saving both cost and time.
As mentioned in the writings, I have bit more on this topic, if interested: