Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 5

  • Thread starter Thread starter Kruti_Mehta
  • Start date Start date
K

Kruti_Mehta

This article is intended for audience who are considering options to move their data into Azure and prefer T-SQL to query the data.

Overview

This is a follow-up blog to Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 1 , Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2, Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3. & Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4. In this article we will cover the following following key decision factors. In this article we cover Performance Tuning differentiation and techniques between Serverless SQL Pool and Dedicated SQL Pool for Azure Synapse



G) Performance Tuning

Azure Data Lake Storage Gen2 organizes all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. Whether you choose serverless or dedicated Best Practices for Data Lake Containers should be followed (if the data being loaded &/or written in any file format).



A commonly used approach in batch processing is to place data into an "in" directory.

Then, once the data is processed, put the new data into an "out" directory for downstream processes to consume.

The level of granularity for the date structure is determined by the interval on which the data is uploaded or processed, such as hourly, daily, or even monthly. Sometimes file processing is unsuccessful due to data corruption or unexpected formats. In such cases, a directory structure might benefit from a /bad folder to move the files to for further inspection. For batch jobs consider the following template structure:

  • {Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/
  • {Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/
  • {Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/

You can always evaluate the workspace before proceeding with any of models. Its good to go through the following sections which describes the best practices and in-built capabilities by both these engines which impact performance

Concepts
Serverless SQL Pool
Dedicated SQL Pool
Storage
Has no local storage ingestion capabilities but can export the SELECT query results to a set of files in your storage account using CETAS

It can store data in relational tables with columnar storage as compared to traditional T-SQL tables which store data in row storage.

PolyBase loads can be run using CTAS or INSERT INTO. You can also use CETAS to create external tables
File Size

Keeping your files in the range between 100 MB and 10 GB is preferred. It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.

PolyBase can't load rows that have more than 1,000,000 bytes (1MB) of data. It's preferred to break up files into 60 or more files to maximize parallelism of your load. You can use any of the following for initial load

Data Format Supported
Parquet, CSV, Delta, Spark, Dataverse, (Alternatively, create views for Azure Cosmos DB data formats (JSON, BSON etc.)
Parquet, SQL via tables
Table Data Types SupportedData Types for Serverless SQL PoolData Types for Dedicated SQL Pool
Concurrency
Maximum concurrency is not limited and depends on the query complexity and amount of data scanned. One serverless SQL pool can concurrently handle 1,000 active sessions that are executing lightweight queries. The numbers will drop if the queries are more complex or scan a larger amount of data
The workload management consists of three high-level concepts: Workload Classification, Workload Importance, and Workload Isolation.
The performance capacity of a query is determined by the user's resource class. Resource classes are pre-determined resource limits in Synapse SQL pool that govern compute resources and concurrency for query execution. There's a trade-off between memory and concurrency.
Data LoadingConvert large CSV and JSON files to Parquet. Parquet is a columnar format. Because it's compressed, its file sizes are smaller than CSV or JSON files that contain the same data. Serverless SQL pool skips the columns and rows that aren't needed in a query if you're reading Parquet files.For fastest loading speed, run only one load job at a time. When moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap and use round-robin for the distribution option.
IndexesWhile tables indexes are not supported as there is no local storage Dedicated SQL pool offers several indexing options including clustered columnstore indexes, clustered indexes and nonclustered indexes, and a non-index option also known as heap. By default, dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table. Best practices for Indexes
PartitioningPartition your data by storing partitions to different folders or file names. In most cases, table partitions are created on a date column. Partitioning is also supported on all distribution types, including both hash or round robin distributed. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributions. Best practices can be found here
StatisticsServerless does created statistics for columns using sampling and in most cases sampling percentage will be less than 100%. This flow is the same for every file format. For small tables with estimated low cardinality (number of rows) automatic statistics creation will be triggered with sampling percentage of 100%. Dedicated SQL pool can be configured to automatically detect and create statistics on columns. We recommend that you enable AUTO_CREATE_STATISTICS for your databases. One best practice is to update statistics on date columns each day as new dates are added. Implement statistics management at the end of the load
Temporary TablesTemporary tables in serverless SQL pool are supported but their usage is limited. They can't be used in queries which target files.
Dedicated SQL pool does have a few implementation limitations for temporary tables:

  • Only session scoped temporary tables are supported. Global Temporary Tables aren't supported.
  • Views can't be created on temporary tables.
  • Temporary tables can only be created with hash or round robin distribution. Replicated temporary table distribution isn't supported.
CachingCaching results on the client side by using Power BI import mode or Azure Analysis Services is preferred.Results can be cached within Dedicated SQL Pool. Follow these techniques for caching results
MonitoringMonitor queries in Synapse Serverless SQL PoolMonitor workload in Synapse Dedicated SQL Pool
LabelsServerless SQL pool doesn't support labelling queries. (as of when this article was published)Dedicated SQL pool supports a concept called query labels
Evaluation ChecklistEvaluate serverless SQL pool designEvaluate dedicated SQL pool design



For both these system below following are recommendations for updating statistics:

TypeRecommendations
Frequency of stats updatesConservative: Daily [ After loading or transforming your data ]
SamplingLess than 1 billion rows, use default sampling (20 percent). With more than 1 billion rows, use sampling of two percent.

I would like to end this series with one of the important factor for Azure SQL and Azure Synapse (be it Serverless SQL or Dedicated SQL). Data Security. Its often considered at the end and we have seen implications when ignored. The following articles should provide you detailed guidance for data security

436x269?v=v2.png

And for both these engines(Azure SQL, Azure Synapse) Azure provides you some from of alerting should there be a planned maintenance activity that might affect the service
Good Reads:
References:
  1. Best practices for dedicated SQL pools in Azure Synapse Analytics
  2. Best practices for serverless SQL pool in Azure Synapse Analytics
  3. Pre-migration guide for Dedicated SQL Pool
  4. Distribution Advisor for Dedicated SQL Pool
  5. Azure Advisor for Dedicated SQL Pool
  6. Common Distribution methods for tables in Dedicated SQL pool

Don't forget to share a small?v=v2&px=200.jpg if this helps

Credit:
Thanks Ben Harding for review and guidance

FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.

Continue reading...
 
Back
Top