Microsoft Fabric: Diving into Lakehouse access from local machines and other remotes with Delta-RS

Christian Henrik Reich
6 min readSep 15, 2024

--

Introduction

The prior week, I had the pleasure of a visit from Juste Stumbryte and Rasmus Bjerregaard Christensen, who were working on a case involving consumers needing to access Fabric Lakehouse from a local machine. This can easily be done with a few lines of Python, though there are some considerations regarding access rights. So, we conducted some testing on this.

There can be many cases like this. Not every analyst or researcher in an organization needs Semantic Models or Power BI. Additionally, many work with smaller datasets, making Apache Spark a heavy and less favorable option compared to tools like Pandas, Polars, or DuckDB. To be honest, when not using Spark in Microsoft Fabric, working locally from Visual Studio Code or a similar environment feels like a better experience for me.

For this blog, I created some users and a dedicated workspace, re-testing what we did during the visit. Interestingly, my tests showed slightly different results this time, and for the better. I take this as a sign that Microsoft Fabric is evolving positively in this area.

TL;DR: There’s a ‘What Did We Learn’ section further down.

Creating example data

I have a Microsoft Fabric notebook, create_fictive_customers, to create a Delta table of 50.000 fictive customers. The Notebook is hardcoded to 50.000 customers and works on the Default Lakehouse.

NB!: I’m using the Gold layer. By default, we are in a data-serving context, so our consumers should only be able to read from the serving layer, which in this case is Gold.

I have a user named Fabric Explorer that I always use for testing Microsoft Fabric. Fabric Explorer does not have an Azure Subscription but is registered as a non-admin user in my Microsoft Entra ID tenant.

Accessing a Microsoft Fabric Lakehouse

Logging into Microsoft Fabric from local

To access a Microsoft Fabric Lakehouse, we need to log into the Microsoft Entra ID tenant where Microsoft Fabric is located. For this, we need the Azure CLI (Command Line Interface). I have tested using the Visual Studio Code Azure plugin, but so far I have only had success with the Azure CLI.

It is not necessary for a Microsoft Fabric user to have an Azure subscription, so it makes sense to always assume they don’t have one when logging in to avoid login errors. From PowerShell or any other shell, type:

az login --allow-no-subscriptions

In some cases, when the user has access to multiple tenants, it might be necessary to specify which tenant to log into.

az login --allow-no-subscriptions --tenant <insert_tenant_guid_here>
Example of logging into Microsoft Entry Id. My user does not have a Azure Subscription, so subscription id apparently becomes tenant id.

The testing script

The script is quite simple but has a few dependencies that must be installed:

pip install pandas deltalake azure-identity

I run the script from Visual Studio Code, but I assume that a locally installed Jupyter notebook or similar would also work. Services such as Google Colab won’t work, as they are remote services without access to the Azure credentials from the ‘az login’ command.

from deltalake import write_deltalake, DeltaTable
from azure.identity import DefaultAzureCredential

# Using the login credentials provided from az Login
delta_token = DefaultAzureCredential().get_token("https://storage.azure.com/.default").token
storage_options = {"bearer_token": delta_token, "use_fabric_endpoint": "true"}

# Pointing to the Delta Table. You must insert your own here.
DELTA_TABLE_PATH: str = 'abfss://exploring_local_compute@onelake.dfs.fabric.microsoft.com/Gold.Lakehouse/Tables/customer'

df_delta = DeltaTable(DELTA_TABLE_PATH, storage_options=storage_options)

df = df_delta.to_pandas()

# Write back the Pandas dataframe.
write_deltalake(f"{DELTA_TABLE_PATH}_from_local", df, storage_options=storage_options)

print("Done")

What did we learn?

Before diving into the learnings.

One of the challenges with testing access is that Microsoft Fabric changes access permissions very slowly, though sometimes they can be applied instantly. This delay is less apparent when using the portal compared to accessing it with Python from a local machine. Testing can be time-consuming due to this lag, which can lead to a different understanding of what is actually happening. Although it might seem random, given time to take effect, the permissions are quite consistent.

Additionally, be aware that currently, Workspace access grants access to all resources within the workspace:

And there is endpoint access, which allows for more granular control by providing direct access to Lakehouses without requiring membership in Workspaces.

Here are the insights:

A minimum of F2 Fabric capacity is still required to access a Lakehouse

As working from a local machine doesn’t require compute resources in Microsoft Fabric, there is no issue with developers blocking each other’s sessions. The SKU size can be small if we only need to serve data.

Example of accessing with no SKU running behind the workspace

Revoking or downgrading user access can have a significant lag

This can be problematic, as we typically want restrictions on access to be instantly effective to secure our data. Even though Microsoft Fabric warns that changes can take up to 2 hours to take effect, this delay must be taken into consideration, when serving data this way.

During testing, it was possible for a user to read data even after their workspace and endpoint access had been removed. To verify this and ensure it wasn’t due to local machine caching, a new table was created, after removing all access for the user, and the user was still able to read the new table. After a while, the user lost access to data.

Restarting the Fabric capacity in the Azure portal seems to reduce the lag.

Furthermore, users should ideally be accessing resources through Microsoft Entra ID groups.

Access via Microsoft Entra id groups doesn’t seem to work

A way to secure instant denial of data access could be through groups. By assigning users to a group and then assigning the group to a workspace or endpoints, removing the user from the group might immediately revoke access to data.

However, despite a good effort, access to data via groups never worked as expected.

Creating a Delta table in Lakehouse without permissions creates an artifact in the Lakehouse

Currently, Workspace access above ‘Viewer’ gives permissions to write.

Granting any type of endpoint access, gives access to read and not write.

A few weeks ago, this approach specifically didn’t seem to work with the testing code, but it works now. However, there can still be strange artifacts in the Lakehouse when attempting to write without permission.

It’s also worth noting that SQL endpoints and Workspace viewer permissions seem to experience a lag as well.

Wrapping up

Despite it might sound a bit quirky, it actually works quite well when running.

Setting endpoint access without workspace access, is in my book the right way. I’m still cautious about problems around instant denying access to data. Then I’ll hope group access will work soon. Still, over the few weeks since Juste and Rasmus did visit me, there has been changes in this area of Microsoft Fabric. So, hopefully we’ll be there soon.

--

--

Christian Henrik Reich
Christian Henrik Reich

Written by Christian Henrik Reich

Renaissance man @ twoday Kapacity, Renaissance man @ Mugato.com. Focusing on data architecture, ML/AI and backend dev, cloud and on-premise.

No responses yet