Data Architecture: Data capture time and event time in medallion architecture.
Motivation
When discussing the medallion architecture, it is commonly understood that the Bronze layer contains raw data, the Silver layer holds aligned data, and the Gold layer features curated data. However, it is important to consider additional factors related to how data is organized within these layers.
Laying out data correctly is the single most significant task for making a data project performant and scalable. No matter what system is used, there is always the task of moving data from storage to memory. This process is expensive, and using Parquet files, column stores, and proper data layout are all steps to reduce this cost.
One of the key considerations regarding data layout is the sequence of the type of time used in each layer of the medallion architecture.
Capture time vs event time
Almost any data has a timestamp, or should have one, indicating when it is valid. This could be a created time, modified time, or, in general, an event time. Capture time, on the other hand, refers to when data is observed, extracted, delivered, etc. — in short, when it is captured. These are usually not the same.
For example, consider a customer created in a CRM system. The time of creation is the event time. By the end of the day, an ETL process moves this event, which marks the capture time.
Another example could be an IoT system. Here, the event time would be when a metric is recorded, and the capture time is when the streaming technology has delivered the record.
In some systems, events can be delivered out of order, meaning events are not captured chronologically.
What are the challenges?
Unlike traditional relational databases, primary or unique constraints are very expensive in a file-based medallion architecture (including Delta tables). I have described this in detail here: Microsoft Fabric and Databricks: The low-level challenge of enforcing primary keys and foreign keys and uniqueness in columns
Delegating the responsibility of ensuring event uniqueness to the delivering system is impractical if the system delivers to the Bronze layer. Running post-processing on a layer doesn’t scale well as data grows. Ideally, data should only be processed when moving from one layer to another.
Additionally, we should only process the most recently arrived data, also known as incremental loading, to save both money and time.
How to
Recalling the responsibilities of the Bronze, Silver, and Gold layers: the Bronze layer is for raw data. So, let’s keep it raw but partition, cluster, or order it by the capture date. In this way, we are not sensitive to the duplication of events or out-of-order events.
Using event time in Bronze layer would likely make the data sensitive to these issues. When moving data from Bronze layer to Silver layer, we can use the capture date to read the latest data. During this process, we can deduplicate the data in memory and then merge it into Silver. The initial deduplication speeds up the insertion into Silver. However, the reduced data set to be inserted may still contain duplicates of the data already in Silver. That’s why we use merge.
How data is processed from Silver layer to Gold layer depends on business requirements. Still, organizing data by event date, as in Silver, is likely part of the solution.
Summing up
In summary, by focusing on how data is stored by date within the medallion architecture, we can manage the complexity of the solution more effectively. Once data is stored with the correct dates, exploring patterns like the star schema can take the solution further.