Posted September 12, 20222 yr Build a Logical Enterprise Data Warehouse using Azure Data Lake Storage Gen 2 and Synapse Serverless SQL pools Overview A common pattern in Modern Data Warehouse architecture is to land your source data in its existing format into a data lake followed by transforming and loading it into an Enterprise Data Warehouse (EDW) for reporting and analytics. In Microsoft Azure, this translates to loading data into Azure Data Lake Storage Gen2 (ADLS), transforming data with Synapse or Azure Data Factory Pipelines, storing data in an Azure Synapse Dedicated Pool database or other Azure relational data store, and building reports over that data in Power BI, Excel, or other reporting tools. Typically, the EDW data is stored in a Star Schema, the optimal design for many reporting and analytical tools like Power BI. This is a great practice for enterprise reporting requirements but has some pitfalls: Traditional transformation and load of data to the final EDW data store can be time consuming Ingress and storage costs increase when transforming and loading from the landing zone to the final data store for reporting Deep knowledge of the data sources is required in order to provide business insights and optimal data warehouse design But: What if your data was transformed and ready for reporting as soon as it is landed in the cloud? What if you could eliminate that time consuming ETL process? How about not incurring additional costs for moving, transforming and storing data in another location? What if you suspect that you will need a traditional EDW but would like to start an Exploratory Data Analysis project in the cloud? How about doing this all within T-SQL? Then Serverless SQL Pools in Azure Synapse Analytics may be the answer for you! In this post, we'll walk through creating a logical data warehouse over your ADLS data using a Serverless SQL database. Prerequisites Synapse Analytics Workspace Use an existing workspace or create a new Synapse Analytics Workspace [*]Data in an Azure Data Lake Storage Gen 2 Use your own data and follow steps below as a guideline Download the parquet files used in my demo and upload them to your default Synapse ADLS storage account using Azure Storage Explorer or through the Azure portal. (In real-world applications, this is typically done via a Synapse Pipeline rather than manually.) Steps to create a logical EDW via Synapse SQL Serverless Open Synapse Studio Create a SQL Serverless Database in Azure Synapse Analytics Workspace Data hub [attachment=22271:name] [attachment=22272:name] I created two schemas, Bronze and Silver. The views in the Bronze schema represent the ADLS data in its existing format. The views in the Silver schema contain queries over the Bronze views to denormalize the data into Facts and Dimensions. In the Data hub, go to your SQL Serverless database and click on the ellipses to the right of the database name, choose New SQL script, then Empty script In the empty script, type CREATE SCHEMA Bronze, then click Run: Repeat step 3 to create a Silver schema. Create the SQL views in your Bronze schema over your ADLS files to define column names for easy querying in t-SQL. A quick way to do this is to navigate to your ADLS account in the Data hub and generate a New SQL script over the folder (or file) for each entity You will be prompted for the format (Text, Parquet, or Delta) and then a script will be created for you. Change the script to eliminate the TOP100 from the SELECT statement and add CREATE VIEW. Click Run to create the view. Do this for all of the ADLS files that will be used in your logical data warehouse. Here are the views I created in my Bronze Schema: Create SQL views in the Silver schema to define the transformations for the fact and dimension tables for your star schema. Below are the views I created in the Silver schema. You can create your own views or download the code I created for the views below Here is an example of the StockItems dimension table that was denormalized for the star schema: Create View Silver.StockItems as SELECT SI_A.StockItemID, SI_A.StockItemName, S.SupplierID, S.SupplierName, SC.SupplierCategoryID, SC.SupplierCategoryName, C.ColorID, SI_A.UnitPackageID as [selling Package], SP_A.PackageTypeName, SI_A.OuterPackageID as [buying Package], SI_A.Brand, SI_A.Size, SI_A.LeadTimeDays, SI_A.QuantityPerOuter, SI_A.IsChillerStock, SI_A.Barcode, SI_A.TaxRate, SI_A.UnitPrice, SI_A.RecommendedRetailPrice, SI_A.TypicalWeightPerUnit, SI_A.MarketingComments, SI_A.InternalComments, SI_A.Photo, SI_A.CustomFields, SI_A.Tags, SI_A.SearchDetails, SI_A.LastEditedBy, SI_A.ValidFrom, SI_A.ValidTo From Bronze.StockItems as SI_A INNER JOIN Bronze.Suppliers S ON SI_A.SupplierID = S.SupplierID INNER JOIN Bronze.Colors C on SI_A.ColorID = C.ColorID INNER JOIN Bronze.PackageTypes SP_A on SP_A.PackageTypeID = SI_A.UnitPackageID INNER JOIN Bronze.SupplierCategories SC on S.SupplierCategoryID = SC.SupplierCategoryID Connect to the SQL Serverless Endpoint to query your database in your tool of choice, such as Power BI or SQL Server Management studio. Start by navigating to your Synapse Analytics Workspace in the Azure Portal and copying the Serverless SQL Endpoint: Connect to the endpoint in SQL Server Management Studio by pasting the Serverless SQL endpoint URL as the Server name: Here's the results of a query over one of the views: Connect to the Serverless SQL endpoint in Power BI Desktop: [attachment=22282:name] [attachment=22283:name] Load and transform just like any other Power BI data source: Summary Keeping data in ADLS Gen2 without transforming and loading to another data platform gives users quick access to the data they need for reporting and analytics without the cost and delays due to heavy transformations and data movement. After creating your logical EDW in the Synapse SQL Serverless database, you can further evaluate if Synapse SQL Serverless meets your goals for query performance. If more performance is needed, you can leverage your Synapse SQL Serverless views to move data to another Azure data platform like Azure Synapse Analytics SQL Dedicated pool for further query optimization and performance features. Continue reading...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.