Christian Henrik Reich
1 min readDec 10, 2024

--

Thanks :-)

TL;DR: You can build very complex data frames and cache them, but it is only on the second action you will benefit from them.

You have to see each Spark Jobs as read data from some storage to memory, process in memory and then off load memory to some storage. So when off loading to Delta tables, Spark is emptying its memory for the next Spark Job. Cache() will prevent this off loading, for the data frames where cache is called, until session runs out or we do an uncache().

There a are few ways I interpret your question, so I'll try to answer them all :-)

If doing a union between 2 data frames and write to delta tables. We can have called the cache() on one og both of the data frames or on the union expression. First time the write action is called, it is starting the Spark Jobs. First thing the Jobs have to do, is to get data into memory from storage. After processing the Spark Jobs writes to Delta tables. If there is called Cache() on either the data frames or the union expression, these will stay in memory. Next time we run the write action, it will identitify the cached data frames from the query plans, and fetch which data is in memory to avoid reading it from storage again.

--

--

Christian Henrik Reich
Christian Henrik Reich

Written by Christian Henrik Reich

Renaissance man @ twoday Kapacity, Renaissance man @ Mugato.com. Focusing on data architecture, ML/AI and backend dev, cloud and on-premise.

No responses yet