K
Kruti_Mehta
This article is intended for audience who are considering options to move their data into Azure and prefer T-SQL to query the data.
Overview
This is a follow-up blog to Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 1 , Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2, Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3. & Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4. In this article we will cover the following following key decision factors. In this article we cover Performance Tuning differentiation and techniques between Serverless SQL Pool and Dedicated SQL Pool for Azure Synapse
G) Performance Tuning
Azure Data Lake Storage Gen2 organizes all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. Whether you choose serverless or dedicated Best Practices for Data Lake Containers should be followed (if the data being loaded &/or written in any file format).
A commonly used approach in batch processing is to place data into an "in" directory.
Then, once the data is processed, put the new data into an "out" directory for downstream processes to consume.
The level of granularity for the date structure is determined by the interval on which the data is uploaded or processed, such as hourly, daily, or even monthly. Sometimes file processing is unsuccessful due to data corruption or unexpected formats. In such cases, a directory structure might benefit from a /bad folder to move the files to for further inspection. For batch jobs consider the following template structure:
You can always evaluate the workspace before proceeding with any of models. Its good to go through the following sections which describes the best practices and in-built capabilities by both these engines which impact performance
For both these system below following are recommendations for updating statistics:
And for both these engines(Azure SQL, Azure Synapse) Azure provides you some from of alerting should there be a planned maintenance activity that might affect the service
Don't forget to share a if this helps
Credit: Thanks Ben Harding for review and guidance
FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.
Continue reading...
Overview
This is a follow-up blog to Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 1 , Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2, Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3. & Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4. In this article we will cover the following following key decision factors. In this article we cover Performance Tuning differentiation and techniques between Serverless SQL Pool and Dedicated SQL Pool for Azure Synapse
G) Performance Tuning
Azure Data Lake Storage Gen2 organizes all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. Whether you choose serverless or dedicated Best Practices for Data Lake Containers should be followed (if the data being loaded &/or written in any file format).
A commonly used approach in batch processing is to place data into an "in" directory.
Then, once the data is processed, put the new data into an "out" directory for downstream processes to consume.
The level of granularity for the date structure is determined by the interval on which the data is uploaded or processed, such as hourly, daily, or even monthly. Sometimes file processing is unsuccessful due to data corruption or unexpected formats. In such cases, a directory structure might benefit from a /bad folder to move the files to for further inspection. For batch jobs consider the following template structure:
- {Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/
- {Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/
- {Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/
You can always evaluate the workspace before proceeding with any of models. Its good to go through the following sections which describes the best practices and in-built capabilities by both these engines which impact performance
Concepts | Serverless SQL Pool | Dedicated SQL Pool |
Storage | Has no local storage ingestion capabilities but can export the SELECT query results to a set of files in your storage account using CETAS | It can store data in relational tables with columnar storage as compared to traditional T-SQL tables which store data in row storage. PolyBase loads can be run using CTAS or INSERT INTO. You can also use CETAS to create external tables |
File Size | Keeping your files in the range between 100 MB and 10 GB is preferred. It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION. | PolyBase can't load rows that have more than 1,000,000 bytes (1MB) of data. It's preferred to break up files into 60 or more files to maximize parallelism of your load. You can use any of the following for initial load |
Data Format Supported | Parquet, CSV, Delta, Spark, Dataverse, (Alternatively, create views for Azure Cosmos DB data formats (JSON, BSON etc.) | Parquet, SQL via tables |
Table Data Types Supported | Data Types for Serverless SQL Pool | Data Types for Dedicated SQL Pool |
Concurrency | Maximum concurrency is not limited and depends on the query complexity and amount of data scanned. One serverless SQL pool can concurrently handle 1,000 active sessions that are executing lightweight queries. The numbers will drop if the queries are more complex or scan a larger amount of data | The workload management consists of three high-level concepts: Workload Classification, Workload Importance, and Workload Isolation. The performance capacity of a query is determined by the user's resource class. Resource classes are pre-determined resource limits in Synapse SQL pool that govern compute resources and concurrency for query execution. There's a trade-off between memory and concurrency. |
Data Loading | Convert large CSV and JSON files to Parquet. Parquet is a columnar format. Because it's compressed, its file sizes are smaller than CSV or JSON files that contain the same data. Serverless SQL pool skips the columns and rows that aren't needed in a query if you're reading Parquet files. | For fastest loading speed, run only one load job at a time. When moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap and use round-robin for the distribution option. |
Indexes | While tables indexes are not supported as there is no local storage | Dedicated SQL pool offers several indexing options including clustered columnstore indexes, clustered indexes and nonclustered indexes, and a non-index option also known as heap. By default, dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table. Best practices for Indexes |
Partitioning | Partition your data by storing partitions to different folders or file names. | In most cases, table partitions are created on a date column. Partitioning is also supported on all distribution types, including both hash or round robin distributed. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributions. Best practices can be found here |
Statistics | Serverless does created statistics for columns using sampling and in most cases sampling percentage will be less than 100%. This flow is the same for every file format. For small tables with estimated low cardinality (number of rows) automatic statistics creation will be triggered with sampling percentage of 100%. | Dedicated SQL pool can be configured to automatically detect and create statistics on columns. We recommend that you enable AUTO_CREATE_STATISTICS for your databases. One best practice is to update statistics on date columns each day as new dates are added. Implement statistics management at the end of the load |
Temporary Tables | Temporary tables in serverless SQL pool are supported but their usage is limited. They can't be used in queries which target files. | Dedicated SQL pool does have a few implementation limitations for temporary tables:
|
Caching | Caching results on the client side by using Power BI import mode or Azure Analysis Services is preferred. | Results can be cached within Dedicated SQL Pool. Follow these techniques for caching results |
Monitoring | Monitor queries in Synapse Serverless SQL Pool | Monitor workload in Synapse Dedicated SQL Pool |
Labels | Serverless SQL pool doesn't support labelling queries. (as of when this article was published) | Dedicated SQL pool supports a concept called query labels |
Evaluation Checklist | Evaluate serverless SQL pool design | Evaluate dedicated SQL pool design |
For both these system below following are recommendations for updating statistics:
Type | Recommendations |
Frequency of stats updates | Conservative: Daily [ After loading or transforming your data ] |
Sampling | Less than 1 billion rows, use default sampling (20 percent). With more than 1 billion rows, use sampling of two percent. |
- You should validate the collation support before loading for both the models.
- Both these model certain system views. Best to refer this documentation for catalog information.
- The ADF to Synapse Migration Tool (currently PowerShell scripts) enables you to migrate Azure Data Factory pipelines, datasets, linked service, integration runtime and triggers to a Synapse Analytics Workspace.
- Laying out a clear project plan with operational readiness parameters is very important before shift of an entire data engine
And for both these engines(Azure SQL, Azure Synapse) Azure provides you some from of alerting should there be a planned maintenance activity that might affect the service
- Advance notifications for planned maintenance events in Azure SQL
- Maintenance schedule for planned maintenance events in Azure Synapse
- Hitchhiker's Guide to Data Lake
- POC : Data warehousing with dedicated SQL pool in Azure Synapse Analytics
- POC: Data lake exploration with serverless SQL pool in Azure Synapse Analytics
- Big data analytics with Apache Spark pool in Azure Synapse Analytics
- Big data analytics with Data Explorer pool in Azure Synapse Analytics
- Best practices for dedicated SQL pools in Azure Synapse Analytics
- Best practices for serverless SQL pool in Azure Synapse Analytics
- Pre-migration guide for Dedicated SQL Pool
- Distribution Advisor for Dedicated SQL Pool
- Azure Advisor for Dedicated SQL Pool
- Common Distribution methods for tables in Dedicated SQL pool
Don't forget to share a if this helps
Credit: Thanks Ben Harding for review and guidance
FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.
Continue reading...