Lakehousing: Navigating in AI/ML and other types of non-deterministic transformations when processing data.

4 min readDec 11, 2024

Introduction

One of the primary reasons to select a Lakehouse for a data platform is its ability to handle structured, semi-structured, and non-structured data. Yet, I mostly see solutions only handling structured data from relational sources, or semi-structured data such as JSON from APIs.

The textbook example of a Lakehouse builds on the Medallion Architecture, and the Silver layer has data aligned (and cleaned, etc.) so we can more easily query across different sources and types of source.

While aligning structured and semi-structured data is manageable, including unstructured data like images, audio files, PDFs, and other formats has a history of being more demanding.

With the rise of AI, especially handling of images and sound files has enabled us to utilise the unstructured data with more ease.

How to align unstructured data with structured and semi-structured data

Using an AI model like OpenAI’s Whisper, we can transcribe text from an audio file. To build on this example, we can index the text by performing sentiment analysis or even entity analysis. All of this can be done with public APIs, such as those from Microsoft AI Services, or with locally installed models.

For images, we can extract text, objects, actions, etc., and index them by classification.

Using LLMs to extract information from texts in the same way as from audio and images can also bring good value.

So, while we can’t align images, audio, and text files themselves, we can extract information or descriptions that we can use for alignment and indexing in the Silver layer.

While this section has been focusing on using AI for unstructured data processing, the same applies for structured and semi-structured as well.

Non-deterministic data processing

In data engineering, we have long relied on our deterministic transformations. They have more or less been our backbone in securing data quality, as we knew we would get the same output every time we called a transformation with the same input.

Some AI/ML models have probability as an element, which gives them a degree of non-determinism by nature. Some models, e.g., linear regression, are deterministic, but model drift might require them to be retrained. A retrained model will probably give a different answer to the same question compared to earlier versions.

A good example of non-determinism is to ask ChatGPT for a chocolate cake recipe, and then open another chat and ask the exact same question. It will give two different answers.

Why is non-determinism an issue?

A good principle for a Lakehouse, to strive after, is to be able to rebuild the Silver and Gold layers from the Bronze layer. In that sense, the Lakehouse is deterministic as well. The reason for this principle is that we should be able to perform the same analytic query today, tomorrow, next week, next year, etc.

While rebuilding a full Lakehouse is not that common, reloading a table is something that happens to all of us more frequently. Imagine reloading a table due to a new column, and then the rest of the table is completely different from before the reload. This an issue we can get from non-determistisc transformations.

Where to start?

The Bronze layer should be a data safe. It should be append-only, with no deletes, read-only for individuals, and generally protected. Data should be stored raw and unprocessed, but can still be Hive-partitioned.

When it comes to binary files (such as audio files, images, and many document formats) as well as textual formats like JSON and XML, there is not much benefit in embedding entire content into a single column within Parquet files. They will not be compressed, and it is not possible to draw meaningful statistics from them. There will be much performance gains, if any.

It is said that a picture is worth more than 1,000 words. Imagine having pictures of cars; we can extract the number plates. Another day, we get requests for weather conditions, or something else we can’t even imagine when we first received them. Keeping them untouched means that we preserve all information. The same goes for all other data.

Transformations

As we would like to align data in the Silver lager, such transformations will be in the placed when processing data from the Bronze to Silver layer. Taking into account, using AI/ML transformations are slow (and more costly than ordinary logic transformation), it makes sense to keep it, to an incremental Bronze to Silver layer processing. Still, there can be good cases for Silver to Gold layer processing.

Always use versioned models, stored in model repositories so they can be retrieved later. It should be noted which part of the data in a table or dataset is derived from which model and model version. It can be tempting to store the model information in the destination table in the Silver layer. This approach comes with a big risk of losing this information, for example, if a table is reloaded. A separate bookkeeping table, keeping track of date spans for each model, is the best solution.

This would also adhere to the Transparency, Accountability, and Reliability and Safety principles of Microsoft’s Responsible AI Principles and approach. https://www.microsoft.com/en-us/ai/principles-and-approach

Lastly, some models have options for reducing non-determisn. For example setting tempature low or zero on a LLM, might be nessecary.

A closing note

While it has never been easier to write code and include AI in our products, the challenging part is ensuring reliability. It is also what makes AI interesting, as it is a rapidly evolving niche, with new best practices still to be uncovered and debated.