Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 5

Kruti_Mehta · Nov 23, 2023

This article is intended for audience who are considering options to move their data into Azure and prefer T-SQL to query the data.

Overview

This is a follow-up blog to Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 1 , Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2, Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3. & Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4. In this article we will cover the following following key decision factors. In this article we cover Performance Tuning differentiation and techniques between Serverless SQL Pool and Dedicated SQL Pool for Azure Synapse

G) Performance Tuning

Azure Data Lake Storage Gen2 organizes all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. Whether you choose serverless or dedicated Best Practices for Data Lake Containers should be followed (if the data being loaded &/or written in any file format).

A commonly used approach in batch processing is to place data into an "in" directory.

Then, once the data is processed, put the new data into an "out" directory for downstream processes to consume.

The level of granularity for the date structure is determined by the interval on which the data is uploaded or processed, such as hourly, daily, or even monthly. Sometimes file processing is unsuccessful due to data corruption or unexpected formats. In such cases, a directory structure might benefit from a /bad folder to move the files to for further inspection. For batch jobs consider the following template structure:

{Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/
{Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/
{Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/

You can always evaluate the workspace before proceeding with any of models. Its good to go through the following sections which describes the best practices and in-built capabilities by both these engines which impact performance

Concepts	Serverless SQL Pool	Dedicated SQL Pool
Storage	Has no local storage ingestion capabilities but can export the SELECT query results to a set of files in your storage account using CETAS	It can store data in relational tables with columnar storage as compared to traditional T-SQL tables which store data in row storage. PolyBase loads can be run using CTAS or INSERT INTO. You can also use CETAS to create external tables
File Size	Keeping your files in the range between 100 MB and 10 GB is preferred. It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.	PolyBase can't load rows that have more than 1,000,000 bytes (1MB) of data. It's preferred to break up files into 60 or more files to maximize parallelism of your load. You can use any of the following for initial load CREATE EXTERNAL DATA SOURCE CREATE EXTERNAL FILE FORMAT CREATE EXTERNAL TABLE
Data Format Supported	Parquet, CSV, Delta, Spark, Dataverse, (Alternatively, create views for Azure Cosmos DB data formats (JSON, BSON etc.)	Parquet, SQL via tables
Table Data Types Supported	Data Types for Serverless SQL Pool	Data Types for Dedicated SQL Pool
Concurrency	Maximum concurrency is not limited and depends on the query complexity and amount of data scanned. One serverless SQL pool can concurrently handle 1,000 active sessions that are executing lightweight queries. The numbers will drop if the queries are more complex or scan a larger amount of data	The workload management consists of three high-level concepts: Workload Classification, Workload Importance, and Workload Isolation. The performance capacity of a query is determined by the user's resource class. Resource classes are pre-determined resource limits in Synapse SQL pool that govern compute resources and concurrency for query execution. There's a trade-off between memory and concurrency.
Data Loading	Convert large CSV and JSON files to Parquet. Parquet is a columnar format. Because it's compressed, its file sizes are smaller than CSV or JSON files that contain the same data. Serverless SQL pool skips the columns and rows that aren't needed in a query if you're reading Parquet files.	For fastest loading speed, run only one load job at a time. When moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap and use round-robin for the distribution option.
Indexes	While tables indexes are not supported as there is no local storage	Dedicated SQL pool offers several indexing options including clustered columnstore indexes, clustered indexes and nonclustered indexes, and a non-index option also known as heap. By default, dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table. Best practices for Indexes
Partitioning	Partition your data by storing partitions to different folders or file names.	In most cases, table partitions are created on a date column. Partitioning is also supported on all distribution types, including both hash or round robin distributed. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributions. Best practices can be found here
Statistics	Serverless does created statistics for columns using sampling and in most cases sampling percentage will be less than 100%. This flow is the same for every file format. For small tables with estimated low cardinality (number of rows) automatic statistics creation will be triggered with sampling percentage of 100%.	Dedicated SQL pool can be configured to automatically detect and create statistics on columns. We recommend that you enable AUTO_CREATE_STATISTICS for your databases. One best practice is to update statistics on date columns each day as new dates are added. Implement statistics management at the end of the load
Temporary Tables	Temporary tables in serverless SQL pool are supported but their usage is limited. They can't be used in queries which target files.	Dedicated SQL pool does have a few implementation limitations for temporary tables: Only session scoped temporary tables are supported. Global Temporary Tables aren't supported. Views can't be created on temporary tables. Temporary tables can only be created with hash or round robin distribution. Replicated temporary table distribution isn't supported.
Caching	Caching results on the client side by using Power BI import mode or Azure Analysis Services is preferred.	Results can be cached within Dedicated SQL Pool. Follow these techniques for caching results
Monitoring	Monitor queries in Synapse Serverless SQL Pool	Monitor workload in Synapse Dedicated SQL Pool
Labels	Serverless SQL pool doesn't support labelling queries. (as of when this article was published)	Dedicated SQL pool supports a concept called query labels
Evaluation Checklist	Evaluate serverless SQL pool design	Evaluate dedicated SQL pool design

For both these system below following are recommendations for updating statistics:

Type	Recommendations
Frequency of stats updates	Conservative: Daily [ After loading or transforming your data ]
Sampling	Less than 1 billion rows, use default sampling (20 percent). With more than 1 billion rows, use sampling of two percent.

You should validate the collation support before loading for both the models.
Both these model certain system views. Best to refer this documentation for catalog information.
The ADF to Synapse Migration Tool (currently PowerShell scripts) enables you to migrate Azure Data Factory pipelines, datasets, linked service, integration runtime and triggers to a Synapse Analytics Workspace.
Laying out a clear project plan with operational readiness parameters is very important before shift of an entire data engine

I would like to end this series with one of the important factor for Azure SQL and Azure Synapse (be it Serverless SQL or Dedicated SQL). Data Security. Its often considered at the end and we have seen implications when ignored. The following articles should provide you detailed guidance for data security

And for both these engines(Azure SQL, Azure Synapse) Azure provides you some from of alerting should there be a planned maintenance activity that might affect the service

Advance notifications for planned maintenance events in Azure SQL
Maintenance schedule for planned maintenance events in Azure Synapse

Good Reads:

References:

Don't forget to share a

if this helps

Credit: Thanks Ben Harding for review and guidance

FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.

Continue reading...

Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 5

Kruti_Mehta