Azure Synapse Serverless Databases: A study of external tables, OPENROWSET and TempDB
Introduction
Due to some feedback after publishing Azure Synapse Serverless Databases: The caveats of external tables, I decided to document a study of externals table, OPENROWSET and TempDB. Why is this important? When designing any solution, it is best to know the pro and cons of the components used. A Synapse Serverless SQL pool is limited by a 400 Gb TempDB. This could lead to severe future problems, if not taken into account. 400 Gb is not an uncommon size now a days. What this post will show, is that TempDB is a used resource when querying data through external tables or OPENROWSET
TempDB and things to know about it
The SQL Server, which includes Synapse Serverless family, has a system database names TempDB. We only have partly control of use of TempDB e.g. when we create #-tables, use ORDER BY without indexes, uses DISTINCT etc. The SQL engine also uses TempDB for internal things, out of our control, like when rebuilding indexes, to bail on wrong memory allocation etc. In Synapse Serverless SQL pool, the TempDB is hidden. You can switch to by write ‘USE TempDB’ or by reference it directly in you queries like [tempdb].[sys].[database_files].
The study
To observe TempDB is used, we are monitoring the TempDB data files, when doing queries with external tables and OPENROWSET. This is the only metric I can find.
Setup for the study
In general, for SQL Server, and its kind, it is hard to reclaim disk space. When data is deleted in SQL Server, files keep their sizes for reuse and keeps space unallocated. DBCC SHRINKDATABASE is not a good solution in SQL Server, there is a lot of text on that, and it is not available in Synapse Serverless SQL pool. None of the common ways to reclaim space is available. When doing the querying, we’ll see TempDB grow, and it won’t shrink. This means, it is hard to read the result of multiple tests. The positive, is that by monitoring the TempDB data files, we can see peak loads and see if we are nearing the limit of 400Gb.
A solution to this, I have provisioned 2 Synapse workspaces, one for test of external table and one for test of OPENROWSET. Then we are working on untouched TempDB data files. Each is 16 Mb in size, and will grow for each test.
OPENROWSET can be used from the Master database, while external tables need to be created in a new database.
The procedure
The follow steps are the same for each Synapse workspace. Step 3. is the only difference.
- Get a large data set, in this study we used one from Kaggle: https://www.kaggle.com/datasets/odins0n/amex-parquet
- Upload the data to the Synapse default storage and navigate to the dataset.
- a) When testing external tables, create new database and create an external table to the dataset by right clicking on it and select ‘Create External Table…’.
b) When testing OPENROWSET, get the query by right click on the data set and select ‘SELECT TOP 100…’ - Take a screen shot of the TempDB data files.
- Do a query of 1.000.000 entities
- Take a screen shot of the TempDB data files, when sizes increase after a while.
- While 1st query runs, do 3 more, having 4 concurrent queries running.
- Take a screen shot of the TempDB data files, when sizes increase after a while.
- Stop all queries
- Rerun a single query in a new window, and see it doesn’t effect TempDB (meaning some allocated space is reused when possible)
External tables and TempDB
The untouched Synapse workspace’s TempDB data files:
The external table is names dbo.temptest
The query: SELECT TOP 1000000 * FROM [dbo].[temptest]
When 1 query running:
When 4 concurrent runs:
OPENROWSET and TempDB
The untouched Synapse workspace’s TempDB data files:
The query:
When 1 query running:
After 4 concurrent runs:
Wrapping it up
We can see both external tables and OPENROWSET makes use the limited resource TempDB. We did also see that, that more concurrent queries expand the TempDB, meaning that each query much have its own table in TempDB. Also meaning no data is reused. A bit unexplainable for me, is that external tables used a bit more TempDB under concurrent runs. Running step 10 (not documented), will show that when running a query solely afterward re-use the space in TempDB. The run time indicates it don’t re-uses data loaded from previous runs.
Maybe this behavior changes in the future, I can’t say. Again, not saying don’t use Synapse Serverless SQL pools, but as with every other tool in box be aware of its capabilities.