Microsoft Fabric: Utilize Shared SparkSessions fully with mssparkutils.notebook.run and runMultiple

Christian Henrik Reich
4 min readJun 6, 2024

--

Introduction

Microsoft Fabric and even Azure Synapse have this toolset called mssparkutils, with utility functions to help with common tasks. Two of these functions, run and runMultiple, are powerful functions for running Spark notebooks either synchronously or asynchronously.

Each function runs the notebooks encapsulated, so, for example, Python code in one notebook doesn’t interfere with the code in another notebook. Due to the nature of Spark, it is still possible to have collisions with these notebooks. Unaware, this can lead to problems, but when done right, we can have a smooth transfer of data between notebooks without ever initiating a save to a table or disk ourselves.

For showing how to utilize shared SparkSessions in full, I have made this example.

The two notebooks

To start, I have created two notebooks and no Lakehouses in Microsoft Fabric. Besides the Python script that writes out the Spark app name, I am using SQL as I think it emphasizes my point better. It works the same way in Python.

In the Define_view notebook, I define a temporary view, which selects a string constant saying, "Hello, I am from the notebook Define_view."

In the Read_view notebook, I have a SELECT * FROM the temporary view defined in the Define_view notebook. While we can create the temporary view, we can't read it. Also, the error message is not correct, and we will get to that.

The problem is shown with the AppName printout. We can see that they are two different Spark apps, meaning the notebooks are not in the same SparkSession. This means these two notebooks are running completely isolated from each other, sharing nothing. So, failing is expected.

Let’s add a top level notebook with mssparkutils.notebook.run

When running the notebooks from a top-level notebook, we don’t get errors.

Take note of the AppName, when diving into the Define_view and Read_view, by pressing the blue links.

The view is created, and we are also able to read from it. We don’t even need a Lakehouse, as the error message told us before.

Why is it working?

The observant reader might have guessed it has something to do with the fact that we are running under the same AppName and SparkSession. But why is it working?

SparkSQL, which is a module in Spark Core, has an optimizer called Catalyst. SparkSQL has yet another component within Catalyst called Catalog. Catalog is not Unity Catalog, but a meta datastore for all known tables and views within the SparkSessions. Known tables can be Hive tables, Unity Catalog tables, etc.

Image taken from Databricks homepage: https://www.databricks.com/glossary/catalyst-optimizer

When a query comes in, SparkSQL checks the columns and tables against the Catalog to resolve where to get the data. When the temporary view is created, it is registered in the Catalog. Catalog doesn’t care if it holds metadata from a Lakehouse or not. As long as the metadata can be found, all is good.

To catch up, it works because the top-level notebooks with the mssparkutils.notebook.run statements provide the same SparkSession to the Define_view and Read_view notebooks. Define_view is registering the temporary view in the Catalog associated with the SparkSession. Read_view is using this information from the same Catalog to run its query.

Discussion

It is a special feature (side effect might be the right name) of Microsoft Fabric, though it is pure Spark. It can function as a sort of a view, and we can use it to reference data across isolated notebooks. Yet, it can also cause problems if not paying attention to how it works.

Imagine having a top-level notebook running several notebooks. If these notebooks are using the same name as a temporary view name, there can be name collisions with all sorts of effects from it.

--

--

Christian Henrik Reich
Christian Henrik Reich

Written by Christian Henrik Reich

Renaissance man @ twoday Kapacity, Renaissance man @ Mugato.com. Focusing on data architecture, ML/AI and backend dev, cloud and on-premise.

No responses yet