K
Kruti_Mehta
This article is intended for audience who are considering options to move their data into Azure and prefer T-SQL to query the data.
Overview
Often when working with the customers we run into a scenarios, where careful considerations of database scenarios and clear understanding use cases when not done right leads to unnecessary refactoring, at both database and application levels. This series helps you with understanding your use cases and making an efficient choice between Azure SQL and Azure Synapse SQL.
Basic Differentiations between Azure SQL and Azure Synapse
Azure SQL
Azure SQL is a family of managed, secure, and intelligent products that use the SQL Server database engine in the Azure cloud.
With Azure SQL Database, you can create a highly available and high-performance data storage layer for the applications and solutions in Azure.
It is well suited for SMP systems for Transaction relational database engine.
It is a good fit for Extract, Transform and Load (ETL) process.
It enables you to process both relational data and nonrelational structures, such as graphs, JSON, spatial, and XML.
Azure Synapse SQL
Synapse SQL is a distributed query system for T-SQL that enables data warehousing and data virtualization scenarios and extends T-SQL to address streaming and machine learning scenarios.
Azure Synapse Analytics was designed to perform analytics on large volumes data.
It is well suited for MPP systems for Data Warehouse, Data Lakehouse requirements.
It is a good fit for Extract, Load, and Transform (ELT) process. [Note: it can be also be used for ETL. There are no restrictions on the usage].
It can take advantage of built-in distributed query processing capabilities and eliminate resources needed to transform the data before loading.
It can directly work with Parquet, CSV, TSV, and JSON files stored in the data lake.
Azure Synapse SQL is a distributed query system in Azure Synapse Analytics that offers two kinds of runtime environments:
Serverless SQL does not have local storage or ingestion capabilities. However it does allow you to query the data as though you are working with SQL tables using External Table
When to Use Azure SQL v/s Azure Synapse SQL
From the above explanation the first few decision making factors should be clear
1) Is your use case OLTP or Data Warehouse ? [OLTP : Azure SQL | Data Warehouse, Data Lakehouse : Azure Synapse SQL]
2) Are the operations being performed on the data of type ETL v/s ELT [ETL : Azure SQL | ELT : Azure Synapse SQL]
3) What is the format of data you will be working with at the end ? [SQL, JSON, spatial, and XML: Azure SQL | Parquet, CSV, TSV, and JSON: Azure Synapse SQL]
What happens when you map this incorrectly?
Not so efficient OLTP Mapping
Imagine you had an application with the following requirement and you used Azure Synapse for the same.
Implications
Not so efficient Data Warehouse Mapping
Likewise now consider a scenario where you only had data visualization or data aggregation requirements and you use Azure SQL for the same. [Not Hyperscale which was specifically designed for distributed transaction processing separating the compute and storage layer. ]
Consider the requirements of your application something like below
Implications
From the above scenarios it should be clear that a good engine if mapped incorrectly for a wrong use case has a negative impact. In the next articles we will dive deeper into feature differences, business continuity options, concurrency and several other factors for each of these engines.
Good Reads:
References:
Don't forget to share a if this helps.
Credit: Thanks Ben Harding for review and guidance.
FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.
Continue reading...
Overview
Often when working with the customers we run into a scenarios, where careful considerations of database scenarios and clear understanding use cases when not done right leads to unnecessary refactoring, at both database and application levels. This series helps you with understanding your use cases and making an efficient choice between Azure SQL and Azure Synapse SQL.
Basic Differentiations between Azure SQL and Azure Synapse
Azure SQL
Azure SQL is a family of managed, secure, and intelligent products that use the SQL Server database engine in the Azure cloud.
With Azure SQL Database, you can create a highly available and high-performance data storage layer for the applications and solutions in Azure.
It is well suited for SMP systems for Transaction relational database engine.
It is a good fit for Extract, Transform and Load (ETL) process.
It enables you to process both relational data and nonrelational structures, such as graphs, JSON, spatial, and XML.
Azure Synapse SQL
Synapse SQL is a distributed query system for T-SQL that enables data warehousing and data virtualization scenarios and extends T-SQL to address streaming and machine learning scenarios.
Azure Synapse Analytics was designed to perform analytics on large volumes data.
It is well suited for MPP systems for Data Warehouse, Data Lakehouse requirements.
It is a good fit for Extract, Load, and Transform (ELT) process. [Note: it can be also be used for ETL. There are no restrictions on the usage].
It can take advantage of built-in distributed query processing capabilities and eliminate resources needed to transform the data before loading.
It can directly work with Parquet, CSV, TSV, and JSON files stored in the data lake.
Azure Synapse SQL is a distributed query system in Azure Synapse Analytics that offers two kinds of runtime environments:
- Serverless SQL pool: An On-demand SQL query processing, primarily used to work with data in a data lake.
- Dedicated SQL pool: An Enterprise-scale relational database instances used to host data warehouses in which data is stored relational tables in columnar format.
Serverless SQL does not have local storage or ingestion capabilities. However it does allow you to query the data as though you are working with SQL tables using External Table
When to Use Azure SQL v/s Azure Synapse SQL
From the above explanation the first few decision making factors should be clear
1) Is your use case OLTP or Data Warehouse ? [OLTP : Azure SQL | Data Warehouse, Data Lakehouse : Azure Synapse SQL]
2) Are the operations being performed on the data of type ETL v/s ELT [ETL : Azure SQL | ELT : Azure Synapse SQL]
3) What is the format of data you will be working with at the end ? [SQL, JSON, spatial, and XML: Azure SQL | Parquet, CSV, TSV, and JSON: Azure Synapse SQL]
What happens when you map this incorrectly?
Not so efficient OLTP Mapping
Imagine you had an application with the following requirement and you used Azure Synapse for the same.
- High frequency reads and writes.
- Large numbers of singleton selects.
- High volumes of single row inserts.
- Row-by-row processing needs.
Implications
- For every single row update or select, all the compute nodes where the data is distributed are initiated. The Azure Synapse uses a node-based architecture where applications can connect and issue T-SQL to a control node. That control node is the single point of entry in Synapse SQL.
- In this system query’s are split into smaller query’s, and executed on the compute nodes. (Did you really need a simple select or update to be split into sub-queries? Another question here is also the frequency of these singleton write/updates/selects queries?)
- If this was just one time rare occurrence then perhaps the engine would have been a better choice. The result can be loaded into another data store e.g. a Dedicated SQL Pool, an Azure SQL database or the Data Lake. But if not it has performance & cost implications which need re-evaluation.
Not so efficient Data Warehouse Mapping
Likewise now consider a scenario where you only had data visualization or data aggregation requirements and you use Azure SQL for the same. [Not Hyperscale which was specifically designed for distributed transaction processing separating the compute and storage layer. ]
Hyperscale separates the query processing engine, where the semantics of various data engines diverge, from the components that provide long-term storage and durability for the data.
The initially supported storage limit is 100 TB as of when this article is published.
Hyperscale databases have one primary compute node where the read-write workload and transactions are processed.
Up to four high-availability secondary compute nodes can be added on demand.
They act as hot standby nodes for failover purposes and can serve as read-only compute nodes to offload read workloads when desired.
Named replicas are secondary compute nodes designed to enable various additional OLTP read-scale out scenarios and to better support Hybrid Transactional and Analytical Processing (HTAP) workloads.
The initially supported storage limit is 100 TB as of when this article is published.
Hyperscale databases have one primary compute node where the read-write workload and transactions are processed.
Up to four high-availability secondary compute nodes can be added on demand.
They act as hot standby nodes for failover purposes and can serve as read-only compute nodes to offload read workloads when desired.
Named replicas are secondary compute nodes designed to enable various additional OLTP read-scale out scenarios and to better support Hybrid Transactional and Analytical Processing (HTAP) workloads.
Consider the requirements of your application something like below
- The data is coming in the Parquet, CSV, JSON formats.
- File sizes are ranging between 100MB to 10GB everyday.
- Large number bulk operations.
- Selects being perform on high volume of data.
- Eventually the data can exceed 4TB within a very short interval ( in a 1-2 years or even lesser).
Implications
- You first are processing each of these files in SQL compatible format for loading purposes.
- Post which, you run computations via an additional compute layer.
- Or you are leveraging the built-in processors of the database engine that share the same memory and operate under a single OS.
- The same database/tables are accessed with simple set of transactions to generate aggregated results for substantially high volume of data (ranging in TB, PB).
- Data changes at a high velocity, variety, volume and veracity.
- Eventually the aggregated content have to be repopulated at very short intervals.
From the above scenarios it should be clear that a good engine if mapped incorrectly for a wrong use case has a negative impact. In the next articles we will dive deeper into feature differences, business continuity options, concurrency and several other factors for each of these engines.
- Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2 (Talks about key decision factors for Azure SQL like T-SQL differences, Purchasing Options, High availability and provides guidance on migration thorough SQL Migration repo )
- Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3 (Talks about Files/Files Group, T-SQL differences, Data Loading, Purchasing Models and Workload specifics for Dedicated and Serverless Pool in Azure Synapse Analytics)
- Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4 (Talks about Back-up Restore & High availability for Dedicated and Serverless pool in Azure Synapse Analytics)
- Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 5 (Talks about Performance Tuning options in Dedicated and Serverless pool for Azure Synapse Analytics)
Good Reads:
- Transitioning from SMP to MPP, the why and the how
- Type of Data
- Operation on Data
- Split of Data - Transactional, Historical
- Split of Data - Analytical
- Split of Data - Delta Concepts
References:
Don't forget to share a if this helps.
Credit: Thanks Ben Harding for review and guidance.
FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.
Continue reading...