Sitemap

Lakehousing: A Schema-change strategy beyond Spark’s mergeSchema

7 min readMay 12, 2025

--

Motivation

Schema changes have long been a challenge in any data paradigm, especially when working with structured data. However, in a Lakehouse architecture, using Apache Spark or Python as your processing engine, managing evolving schemas can become much simpler.

Spark’s built-in mergeSchema option may seem helpful, but it has notable limitations:

  • Limited to expansion: It can only preserve the existing schema or add new columns. It cannot correct typos or consolidate columns that were misspelled, so you end up with duplicate fields full of NULLs.
  • Fragile with type changes: If a column’s data type evolves (e.g. from integer to string), mergeSchema crashes the processing.
  • Locking: Relying on Spark’s mergeSchema ties your solution to Spark. If you later want to process your data with other engines or frameworks, that feature simply isn’t available. There might be something similar.

In this post, I’ll show an approach that avoids these pitfalls and build a portable schema-evolution strategy.

Data versioning

Data should be treated like any other product. When it’s definition or schema changes, it should be treated as a new version. Each version lives alongside its predecessors, stored and queryable together, allowing you to track and manage evolution over time.

To overcome schema-change challenges, we need to understand what it takes to migrate a version to the next. We will look into this.

The medallion architecture

When working with lakehouses, the Bronze–Silver–Gold layered architecture is the default, and I’m using this for context in post.

By definition, Bronze keeps data raw as possible, which is the key to versioning in a lakehouse. Schema changes are best fixed when reading from the Bronze layer with schema-on-read.

Relying on Delta tables can actually reintroduce the usual schema-change headaches in structured data, even with mergeSchema. Instead, we’ll use Hive-style partitioning, and when done right. It can be just as effective as Delta tables. (This can be explained in a future post.)

The challenges in action

I have a script to generate data for my examples. There is one dataset, it has 3 versions of schemas. For easy of example it is in parquet format. They has following traits:

  • Version 1:
  • Age is stored as a string
  • Version 2:
  • Age has changed to an integer
  • LastName is now misspelled as LstName
  • Version 3:
  • LstName is corrected back to LastName
  • A new LoyaltyScore column is added
import os
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

first_names = ['Alice', 'Bob', 'Charlie', 'Diana', 'Edward', 'Fiona', 'George', 'Hannah', 'Ian', 'Julia']
last_names = ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis', 'Rodriguez', 'Martinez']
countries = ['USA', 'Canada', 'UK', 'Germany', 'France', 'Spain', 'Italy', 'Australia', 'Brazil', 'India']

n = 5
versioned = False

def random_date(start, end):
delta = end - start
return start + timedelta(days=random.randrange(delta.days))

data_versions = {
'v1': {
'data': {
'CustomerID': [f'CUST{str(i).zfill(4)}' for i in range(1, n+1)],
'FirstName': random.choices(first_names, k=n),
'LastName': random.choices(last_names, k=n),
'Age': [str(x) for x in np.random.randint(18, 70, size=n)],
'Country': random.choices(countries, k=n),
'PurchaseAmount': np.round(np.random.uniform(20.0, 500.0, size=n), 2).tolist(),
},
'partition': {'year':2025, 'month':5, 'day':12}
},
'v2': {
'data': {
'CustomerID': [f'CUST{str(i).zfill(4)}' for i in range(1, n+1)],
'FirstName': random.choices(first_names, k=n),
'LstName': random.choices(last_names, k=n),
'Age': np.random.randint(18, 70, size=n).tolist(),
'Country': random.choices(countries, k=n),
'PurchaseAmount': np.round(np.random.uniform(20.0, 500.0, size=n), 2).tolist(),
},
'partition': {'year':2025, 'month':5, 'day':13}
},
'v3': {
'data': {
'CustomerID': [f'CUST{str(i).zfill(4)}' for i in range(1, n+1)],
'FirstName': random.choices(first_names, k=n),
'LastName': random.choices(last_names, k=n),
'Age': np.random.randint(18, 70, size=n).tolist(),
'Country': random.choices(countries, k=n),
'PurchaseAmount': np.round(np.random.uniform(20.0, 500.0, size=n), 2).tolist(),
'LoyaltyScore': np.random.randint(0, 1000, size=n).tolist(),
},
'partition': {'year':2025, 'month':5, 'day':14}
},
}

base_path = '/lakehouse/default/Files/customers'

for version, cfg in data_versions.items():
df = pd.DataFrame(cfg['data'])

p = cfg['partition']
dir_path = os.path.join(
base_path,
f'{version}' if versioned else '',
f"year={p['year']}",
f"month={p['month']:02d}",
f"day={p['day']:02d}"
)
os.makedirs(dir_path, exist_ok=True)

out_file = os.path.join(dir_path, 'customers.parquet')
df.to_parquet(out_file, engine='pyarrow', index=False)

print(f"[{version}] wrote {len(df)} rows to {out_file}")

It saves 3 files, in Hive partitioned folder structure:

Let us try to read them with mergeSchema:

df = spark.read.option("mergeSchema", "true").parquet("Files/customers")
display(df)

We have a problem with Age, Spark can’t convert between string and integers. Let’s fix this quick.

from pyspark.sql.types import IntegerType

df = spark.read.option("mergeSchema", "true").parquet("Files/customers/year=2025/month=05/day=12/")
df = df.withColumn("Age", df["Age"].cast(IntegerType()))
df2 = spark.read.option("mergeSchema", "true").parquet("Files/customers/year=2025/month=05/day=13/", "Files/customers/year=2025/month=05/day=14/")
display(df.union(df2))

Not only did we give up on our Hive-partioning, which should have given us performance if used. We are still having an issue because dataset version 3 has an extra column, LoyaltyScore, and union requires equal numbers of columns. There is unionByName which can solve this:

from pyspark.sql.types import IntegerType

df = spark.read.option("mergeSchema", "true").parquet("Files/customers/year=2025/month=05/day=12/")
df = df.withColumn("Age", df["Age"].cast(IntegerType()))
df2 = spark.read.option("mergeSchema", "true").parquet("Files/customers/year=2025/month=05/day=13/", "Files/customers/year=2025/month=05/day=14/")
display(df.unionByName(df2, allowMissingColumns=True))

We have a dataset that includes both LastName and LstName, which must be handled in some way. We also need to decide how to handle the LoyaltyScore column. In typical schema-merge scenarios, NULL is automatically inserted for any rows that didn’t originally have those fields. But is NULL really the right placeholder?

  • NULL can carry a genuine meaning downstream (e.g. “value unknown” vs. “absent by design”).
  • A sentinel value like –1, “REDACTED”, or something similar might more clearly signal “unknown” or “not applicable.”

So there’s still work to do before this combined dataset is fully valid, and by merging all versions into one table, we’ve also given up the partition-pruning performance benefits we had at the Bronze layer.

The solution in action

When you set versioned = True in the data-generation script, the output is written into version-specific folders. We can store an unlimited amount of data under each version (here we only write one file per version). We place the version folder above the Hive partition directories, because the partitioning becomes part of the dataset at read time, this also gives us the flexibility to adjust partitioning independently of the version structure.

The same 3 files, saved versioned.

Next, we define our migrations in two steps:

  1. Version 1 → Version 2
  • Cast Age from string to integer.
  • Rename LastNameLstName (to match the V2 schema).
  1. Version 2 → Version 3
  • Rename LstNameLastName (correcting the typo).
  • Add a new LoyaltyScore column, initializing every row to 0.

Each migration is implemented as a transformation function, giving us a dataset with the latest scehma and corrections. No mergeSchema was needed, and we are ready to process the data or just write it to a Delta Table.

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import lit
from pyspark.sql import DataFrame


def migrate_v1_to_v2(df: DataFrame) -> DataFrame:
df = df.withColumn("Age", df["Age"].cast(IntegerType()))
df = df.withColumnRenamed('LastName', 'LstName') # We need to rename to an error to be complient with V2 of the schema
return df

def migrate_v2_to_v3(df: DataFrame) -> DataFrame:
df = df.withColumnRenamed('LstName', 'LastName')
df = df.withColumn('LoyaltyScore', lit(0))
df = df.select(['CustomerID', 'FirstName', 'LastName', 'Age', 'Country', 'PurchaseAmount', 'LoyaltyScore', 'year', 'month', 'day'])
return df

def get_customers() -> DataFrame:
df_v1 = spark.read.parquet("Files/customers/v1")
df_v2 = migrate_v1_to_v2(df_v1).union(spark.read.parquet("Files/customers/v2"))
df_v3 = migrate_v2_to_v3(df_v2).union(spark.read.parquet("Files/customers/v3"))
return df_v3

customer_df = get_customers()
display(customer_df)

The Result:

Not only is the result clean, but we also have the option to do it with other Frameworks:

import pandas as pd

base_path = '/lakehouse/default/Files/customers'

def migrate_v1_to_v2(df: pd.DataFrame) -> pd.DataFrame:
df['Age'] = df['Age'].astype(int)
df.rename(columns={'LastName': 'LstName'}, inplace=True)
return df

def migrate_v2_to_v3(df: pd.DataFrame) -> pd.DataFrame:
df.rename(columns={'LstName': 'LastName'}, inplace=True)
df['LoyaltyScore'] = 0
return df

def get_customers() -> pd.DataFrame:
df_v1 = pd.read_parquet(f'{base_path}/v1')
migrate_v1_to_v2(df_v1)

df_v2_new = pd.read_parquet(f'{base_path}/v2')
df_v2 = pd.concat([df_v1, df_v2_new], ignore_index=True)

migrate_v2_to_v3(df_v2)

df_v3_new = pd.read_parquet(f'{base_path}/v3')
df_v3 = pd.concat([df_v2, df_v3_new], ignore_index=True)

return df_v3

df = get_customers()
display(df)

Same result, and even the Hive-Partitioning is in play

Conclusion

While Spark’s mergeSchema has its place, schema evolution is too complex to be solved by expansion alone. Maintaining evolving schemas is a critical aspect of data quality, unattended schema changes can break pipelines and make errors much harder to correct later.

Schema changes aren’t just the data team’s issue, they affect the entire organization and require coordinated effort from those who produce the data and the data team.

In this post, we’ve demonstrated a simple example of schema evolution with versioned data. You can easily extend these patterns into a fully data-driven workflow. I'll leave that as an exercise for the reader! 😊

--

--

Christian Henrik Reich
Christian Henrik Reich

Written by Christian Henrik Reich

Renaissance man @ twoday Kapacity, Renaissance man @ Mugato.com. Focusing on data architecture, ML/AI and backend dev, cloud and on-premise.

No responses yet