Accessing Delta Lakes on Azure Data Lake with ordinary python
The concept of Delta Lakes/Data Lake Houses is excellent. When implemented correctly, it enables us to access the same data using various data engines. By avoiding data movement, we can save both money and time. As mentioned, the idea is to bring compute to the data.
Typically, these computes are designed to handle massive amounts of data. However, it is not uncommon to find oneself processing very small files. At times, it may seem excessive to initiate a cluster just to address a simple/trivial question from a small dataset stored in a Delta Lake. To mitigate this issue, serverless services can offer a solution. In certain cases, if governance permits, the most convenient approach would be for your workstation to directly read from a Delta Lake.
Lately it has caught my attention Delta.io (the organisation behind Delta Lake), has released a python library based on Delta-RS. A rust-library for Delta Lakes. This library can access Delta Lakes, without the need for clusters. It just requires a python installation. In it-self it is very easy to use, but I have only seen examples pointing to local storage. It would make the most sense to use, if you can access a Data Lake House. Often in my cases these would be on an Azure Data Lake.
The example of accessing a Delta Lake on an Azure Data Lake
- We are going to read some data with Pandas from the Azure Data Lake. This not the primary example, it is just to get some data, to save to Delta. I used The NYC Taxi dataset, which I had by coincidence on a data lake as parquet. Any dataset will work.
- We are going to save the data into a Delta Lake on Azure Data Lake.
- We are going to read the saved data from the Delta Lake on Azure Data Lake.
In this specific example, you’ll need an Azure Service Principle. Ideally, you should be able to access it using DefaultCredentials from an ‘az login’, but unfortunately, that part seems to have some bugs in the Delta Lake library. So, for now, the only way to access the data lake using the Delta library is by using an Azure Service Principle. But hey, the good news is that it works perfectly fine with pandas!
First the dependencies:
pip install deltalake
pip install pandas
pip install fsspec
pip install adlfs
fsspec and adlfs are soft required by Delta Lake and Pandas for making the connection to the Azure Data Lake.
import pandas as pd
import pyarrow as pa
from deltalake.writer import write_deltalake
from deltalake import DeltaTable
#We are configuring our storage options with the Azure Service Principal's credentials.
#Remember to assign the BLOB_STORAGE_CONTRIBUTER role for the data lake to the principal
storage_options={'tenant_id': <tenant_id>, 'client_id': <client_id>, 'client_secret': <client_secret> }
#Unfortunately the delta lake lib has bug, else by setting anon to false, we could have used DefaultCredentials/Personal account and az login.
#When this is fixed, and you want to use it, remember the BLOB_STORAGE_CONTRIBUTER role.
#storage_options={'anon': False }
#Getting some example data with Pandas. When using pandas with Delta Lake, you can experince some of the original Pandas
#data types are not compatible with parquet. Setting dtype_backend tells Pandas to use arrow types(same as parquet), solving compability issues.
df = pd.read_parquet('abfss://<my_container>@<my_storage>.dfs.core.windows.net/SeattleSafety', dtype_backend='pyarrow', storage_options = storage_options)
#My dataset had some infering issues with these 2 columns. So great exapmle of how to change data types with pandas and arrow
df['year'] = df['year'].astype(pd.ArrowDtype(pa.int16()))
df['month'] = df['month'].astype(pd.ArrowDtype(pa.int16()))
#Writig the to the Delta Lake
write_deltalake("abfss://<my_container>@<my_storage>.dfs.core.windows.net/SeattleSafety_python", df, mode="append", storage_options=storage_options)
#'Reading from it'. It is lazy-loaded, so are actually not reading yet
df_delta = DeltaTable("abfss://<my_container>@<my_storage>.dfs.core.windows.net/SeattleSafety_python", storage_options=storage_options)
#We have different interfaces to out data. Calling one of these will start a load
df_pandas = df_delta.to_pandas()
arrow_ds = df_delta.to_pyarrow_dataset()
arrow_tbl = df_delta.to_pyarrow_table()