Azure Synapse Serverless Databases: The caveats of external tables
Introduction
With the introduction of Azure Synapse Analytics, Microsoft also introduced Serverless Databases. Serverless databases can’t store any data, but it can tap in various sources and read the data ad hoc from there. One way this can be done, is with the feature ‘external tables’. While it makes sense in an ad hoc context, I have seen people making it a hammer. I’ve seen people using Serverless Databases between PowerBI and data lakes, while it is not needed. Seen people using it for sources pointing to other sources in Synapse Pipelines. Making it an unnecessary part which can fail, and it will properly make such ETL less scalable. Lastly, Microsoft suggest a logical dataware house. While it can make sense in some situations, it must be done with some considerations. Relying on Azure Serverless as a foundation in an enterprise data solution, sounds risky when deep diving into the technology. It could leave a solution get stuck, deep in the solution life circle, and lead to an expensive restart or a complete replacement.
What is a Synapse Serverless Database
Not the be confused with the Azure product Serverless Azure SQL Database(that names makes no sense, since such database still needs to be a attached to a Azure SQL Server), a Synapse Serverless Database is part of what Microsoft call Serverless SQL Pools(which is in practical, a logical SQL Server instance, strange name again). The database itself is an Azure SQL Database variant, in that sense it has no tables. It can only do views, external tables, and a little stored proceduers.
When running following SQL:
SELECT @@version
SELECT SERVERPROPERTY(‘ProductVersion’) as[version],
SERVERPROPERTY(‘Edition’) as [edition],
SERVERPROPERTY(‘EngineEdition’) as [EngineEdition];
This means the current product edition is the same as Azure SQL Database. What makes a Synapse Serverless database different is the EngineEdition which is 11, Azure SQL Database use EngineEdition 5. The noticeable difference between these engines is the lack of tables, but also lack of DMVs and not being able the see an execution plan. But in the end, Serverless SQL Pools is part of the SQL Server family, and even when I refer it as SQL Server, we still talk about things which Synapse Serverless Database is a part of.
Looking into a database Cars (I have created for the occasion), we see something interesting. Running:
SELECT [name],[compatibility_level],[recovery_model_desc] FROM [sys].[databases] WHERE name = ‘Cars’
SELECT [file_id],[file_guid],[type],[type_desc],[data_space_id],[name],[physical_name],[size],[max_size],[growth]FROM [sys].[database_files]
The database itself is not configured for high performant data movement. The data must be loaded into the SQL engine, before it can processed. So the data must be moved fast. Recovery_model_desc, tells the database is in full recovery mode. This mode makes sure the database can be restored to last action if crashing and it enables replication. A traditional analytic database would run in so-called simple mode, which improves high volume data loading. Despite of the data disk, named Cars, has infinite growth (max_size = -1), there is only one disk. Not ideal for data loading either. What I see, is a database which is made for holding the external table setups, views definitions, security setup etc, and this setup can be easily restored, if the database corrupts or breaks down. So how does it work with data?
External Tables
The external tables work by meta data is defined in the SQL database and a pointer to data is defined. The data can be located on a data lake or another Azure SQL Database or something different. When data is needed, the SQL database reads data into the engine via the external tables. It is heavily implied in the documentation, that Polybase is used for this. Polybase is a technology introduced with SQL Server 2016. It is made to enable SQL Server to read data from non-SQL server sources. Citing from the external table documentation:
“In ad-hoc query scenarios, such as SELECT FROM EXTERNAL TABLE, PolyBase stores the rows that are retrieved from the external data source in a temporary table. After the query completes, PolyBase removes and deletes the temporary table. No permanent data is stored in SQL tables.” - Microsoft
So a temporary table need TempDB, we don’t see it when we connect to the pools, but it is there. When running:
SELECT [file_id],[file_guid],[type],[type_desc],[data_space_id],[name],[physical_name],[size]*8/1024 [max_size_mb],[max_size]*8/1024/1024 [max_size_gb],[growth]
FROM [tempdb].[sys].[database_files]
We see, not only do we have 4 disks do spread the load of loading data into the engine. It is 4 disks of 106 gigabytes, a bit more of 400 gigabytes, which matches what Microsoft defines as the limitation for what a Serverless SQL pool can hold.
I have made a study of this: Azure Synapse Serverless Databases: A study of external tables, OPENROWSET and TempDB
Notice the note about concurrency. Loading a SQL Server table, even in TempDB, is throttled by GAM, SGAM and FPS which administer the pages (memory) in SQL Server. There is a GAM, SGAM and FPS for each 4Gb in a SQL Server data file and having more files can ease the throttling. Secondly Polybase can only handle 32 concurrent queries, so the Synapse Serverless SQL might end up waiting on Polybase, even when the session is running.
“The maximum number of concurrent PolyBase queries is 32” — Microsoft
Would 8 TempDB disks be better, depends. It usually has something to do with the number of CPU cores available. One might guess, we only have 4 cores based on the number of TempDB disks.
Considerations when working with External Table
When utilising an external table with a predicate like
SELECT …. FROM <external_table> WHERE <field> = <value>
Is often okay, when it is a data lake source, because the predicate can be pushed down. Joining with other external tables, like a logical dataware house, could rise some performance issues. When an external table is loaded, it is not indexed. When joining 2 unindexed tables, the query engine has to traverse each row in both tables. When joining 2 tables of 100 rows each, no problem. When joining 2 tables of 10 million each, then it properly hurts.
No problem, I have all the time in the world
Or do you? You might never hit the 400 Gb constraint, but the 30 minutes query timeout is reality.
This can not be changed. When external table performance depends heavily on data sizes, then Synapse Serverless Database could end up in loads waiting on loads, hitting this boundary. E.g. ETL running concurrent loads, or a logical dataware house with many users.
Important note about temporary tables
This is a bit funny, when we have 400Gb TempDB, we only can spend 100mb on temporary tables. Using non-inlining CTE(the WITH statement), table variables, or just # could be affected. Personally, I find this constraint a bit concerning. I don’t why there is such a constraint, but I feel now it is there, it will bite me at some time.
Wrapping it up
I’m not saying don’t use external tables, it has it’s place. I’m saying, use it according to how it works. First of all, is it really needed? What is you gain, for introducing a step which might not be needed? Synapse Pipelines can read from 60+ more different sources, including data lakes. The same is true for PowerBI. Synapse Pipelines engine runs on spark, it is low-code and can scale out processing and read data and process from a lake much much faster.
If wanting to use external tables in a ETL context, make sure to have incremental loading in place from the beginning. You want to keep down, volume the external table has to load, and incremental loading is the only way to do this.
If wanting to use external tables in a logical dataware house context. Use it as a data mart, on data derived from other places. Try to keep data constant only for a year, 3 month or so. Only for few end-users.
For both cases avoid joins and anything which could rely on temporary tables.