Microsoft Fabric: Accessing CDM data (from Dataverse) from Spark
Motivation
There is a need to read and interpret CDM data with Spark. In our case, we had a need to read CDM data exported from Dataverse. The export was done by Power Apps and exported to an Azure Data Lake Gen2.
There is currently a Spark CDM connector from Microsoft (https://github.com/Azure/spark-cdm-connector), and Fabric Shortcuts makes it easy to connect and access the files from the Azure Data Lake Gen2.
Still, we can’t read the CDM data.
We need an Azure Entra Id to access OneLake and its shortcuts, and the Spark CDM connector, by default uses Managed Identity, which only works on Databricks or Synapse. To add to the challenges, at time of writing, there seems to be an issue with using custom jars in Microsoft Fabric, and the connector is not part of the Microsoft Fabric runtime 1.2.
Lucky, and undocumented (Default level packages for Java/Scala libraries) the Spark CDM connector is part of Microsoft Fabric runtime 1.1.
The solution
To sum up, we are going to do follow:
- We are not creating a shortcut. Still we are creating a SAS token for the Azure Datalake where the CDM is stored.
- We create a Fabric Environment with Fabric Runtime 1.1.
- We create Fabric Notebook, attach it to our environment from step 2.
- Read the data
Creating the SAS
Navigate to the storage account in the Azure Portal. Find the Shared Access Signature
Creating the Fabric Environment
Make sure it says Runtime 1.1, and remember to Save and Publish
Create a Notebook in the Workspace and attach it to the Fabric environment
NB! The SAS token should not start with & when inserted
Read the Data
Use the code, in this example it is PySpark.
storageAccountName = "<azure_storage_name_where_cdm_data_is_stored>.dfs.core.windows.net"
container = "<container_name_where_cdm_data_is_stored>"
df = (spark.read.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("sasToken", "<insert_your_SAS_token_without_&")
.option("manifestPath", container + "/model.json")
.option("entity", "account")
.load())
display(df)