Microsoft Fabric Metadata Driven Pipelines with Mirrored Databases

jehayes · Aug 27, 2024

Microsoft Fabric's database mirroring feature is a game changer for organizations using Azure SQL DB, Azure Cosmos DB or Snowflake in their cloud environment! Mirroring offers near real-time replication with just a few clicks AND for at no cost!

Features:

No data movement cost when mirroring
No storage cost for mirrored tables
No consumption of Fabric Capacity Units
Mirror all tables in your source database or just a few, with the capability to add more tables as your Fabric analytics environment grows
Source data continuously replicated with no data pipelines to configure
Data landed in delta tables in One Lake which are optimized by default
SQL endpoint and default semantic model automatically created

What can you do?

Run near real-time queries against the SQL endpoint with no impact to your source system since the data is replicated
Share the mirrored database across your Fabric tenant
Run cross database queries from within the mirrored database SQL endpoint or from One Lake when the mirrored database is shared
Create SQL views over mirrored data, which can include joins or unions with data from other mirrored databases, warehouses or lakehouse SQL endpoints
Configure row-level security and object level security against the mirrored data
Create Direct Lake near real-time Power BI reports against the default semantic model or against a new model
Copy mirrored data into a lakehouse and used in Spark notebooks or Data Science workloads
And… build faster, simpler metadata driven pipelines! Which is the focus of this article!

Metadata Driven Pipelines with Microsoft Fabric Mirrored Databases

Metadata-driven pipelines in Microsoft Fabric enable you to streamline data ingestion and transformations with minimal coding, lower maintenance, and enhanced scalability. And when your source is Azure SQL DB, Azure Cosmos, or Snowflake, this becomes even easier!

Architecture Overview

With Mirroring, the data is replicated and readily available in OneLake. This eliminates the pipeline which brings data into Fabric. After mirroring is configured:
1. A SQL endpoint is automatically created, allowing you and your users to query the data in near real time without impacting source data performance
2. A default semantic model is also automatically created, allowing you and your users to create near real time Power BI reports. However, best practice is to create a new semantic model, which provides more features and options than the default semantic model
A Fabric Data Warehouse is created to contain:
1. The fact and dimension tables (star schema) which simplifies and optimizes the analytical data model for SQL querying and semantic models
2. SQL views and stored procedures used in data transformations
3. The metadata table which holds the information on how to transform load each fact or dimension table
The Fabric Data Pipelines are created and scheduled to perform:
1. A Lookup activity on the metadata table to get the information on how to load each fact or dimension table in the Fabric Data Warehouse
2. Copy Data activities for full loads with the source being a SQL view over the mirrored tables and the destination a table in the Fabric Data Warehouse
3. Stored Procedure activities for incremental loads to merge the latest data from the mirrored data source into the Data Warehouse destination table
A default semantic model is automatically created over the Fabric Data Warehouse. But create a new one per best practices
Build Power BI reports for analytical reporting

Why 2 semantic models? Fabric Data Warehouse vs Mirrored Database SQL Endpoint for semantic model reporting

For this architecture, I considered using:

Just the mirrored tables
SQL views over the mirrored tables
A new Fabric Data Warehouse with data loaded from the mirrored tables

Since SQL views over mirrored tables always resort to Direct Query rather than Direct Lake, I decided against building a semantic model over SQL views. Instead, I created a semantic model over the mirrored tables and a separate Fabric Data Warehouse and semantic mode over it.

The semantic model over the mirrored tables:

Is used only by users who understand the source data schema
Allows for near-real time access for data to answer ad-hoc questions that are not analytical in nature such as the order shipment status of a particular order number or current stock availability of a specific product at a certain location
Incur no data storage cost but could be offset by consumption of Capacity Units if reports/queries are complex
Can leverage Power BI Direct Lake connection for faster report performance if the report is not too complex; if the query is too complex, it will resort to Direct Query
Could include other lakehouses, mirrored databases, or data warehouse tables in the SQL endpoint and thus the semantic model but Power BI reports using these tables will always resort to direct query rather than direct lake connect
Can include complex relationships between tables and unexpected results may be returned if the semantic model and/or reports are not configured correctly

The semantic model over the Fabric Data Warehouse:

Requires scheduled data refreshes but will be relatively fast since the source data is already in Fabric
Is best for more analytical questions such as "What was our sales revenue by month by customer location?" or "What is our days on hand for products in a particular shipping warehouse?"
Allows for user friendly data warehouse table and column names rather than using the potentially cryptic mirrored database table and column names
Eliminates snowflake schema, allowing for better performing reports and delivery of consistent results without having to understand complex relationships and filtering rules
Is more likely to leverage Direct Lake connection in Power BI reports since model is simpler
Allows other data sources to be loaded into the same warehouse, eliminating cross database joins in reports that automatically resort to direct query
Leverages a simpler star schema model resulting in faster reports with less consumption of capacity units

This solution addresses two key use cases: providing near real-time responses to queries about specific data transactions and statuses, and delivering rapid analytics over large datasets. Continue reading to learn how to implement this architecture in your own environment.

Solution details

Below are detailed steps to build the metadata pipeline. The data source is the Wide World Importers SQL database, which you can download here. Then follow the instructions to import into an Azure SQL DB.

Configure database mirroring

From the Synapse Data Warehouse experience, choose the Mirrored Azure SQL DB Option:

Then choose the tables to mirror:

After the mirroring has started, the main canvas will say “Mirrored Azure SQL Database is running’; Click on Monitor replication to see the number of rows replicated and the last completion time:

At this point, both a SQL analytics endpoint and default semantic model are created (1a). But I created a new semantic model(1b) and set up the table relationships:

2. Create a Fabric Data Warehouse

Create the fact tables or any other tables that will be incrementally loaded (2a). You can also manually create the dimension tables or tables are full loaded OR you can specify to auto-create the tables in the pipeline Copy Data activity, as I do later.

Create the views over the mirrored database tables and stored procedures (2b):

Create and load the metadata table (2c) with information on how to load each fact or dimension table:

3. Create the data pipelines to load the fact and dimension tables

Below is the orchestrator pipeline:

Set variable – set pipelinestarttime to the current date/time. This is logged in the metadata driven pipeline table for each table

Lookup – get the table load attributes from metadata table for each table to load

For each table to load

Invoke the pipeline, passing in the current row object and the date/time the orchestrator pipeline started:

Load warehouse table pipeline

Set variable pipeline start time for tracking the time of each table load

If activity – check if full or incremental load

If full load

Use Copy Data Activity to load the Data Warehouse table, set the pipeline end time variable and update the metadata table with load information

Copy data activity

Source settings reference the view over the mirrored database:

Destination settings reference data warehouse table:

Note that the data warehouse table will be dropped and re-created each time

Set the pipeline end time variable

Run Script to update the pipeline run details for this table

If not a full load, then run the incremental load activities

Lookup activity calls a stored procedure to insert or update new or changed records into the destination table. The value for the StartDate parameter is the latest date of the previous load of this table. The value for the EndDate parameter is usually a null value and only set if there is a need to load or reload a subset of data.

The stored procedure performs an insert or update, depending upon whether or not the key value exists in the destination. Only the records from the source that have changed since the last table load are selected. This reduces the number of updates performed.

The stored procedure returns how many rows were inserted or updated, along with the latest transaction data of the data loaded, which is needed for the next incremental load.

Set the pipeline end time

Script activity updates the table load details:

4. Build a new semantic model in the Fabric/Power BI service

Create relationships between the tables, DAX calculations, dimension hierarchies, display formats – anything you need for your analytics

Note that all tables have Direct Lake connectivity as noted by the dashed, blue line. Direct Lake has the performance of Import semantic models without the overhead of refreshing the data.

5. Create reports

Create reports from your semantic model

Continue on building out more reports and dashboards, setting up security, scheduling data warehouse refresh (which will now be super fast since the source data is already in Fabric), creating apps, adding more data sources – whatever it takes to get the analytics your organization needs into Fabric!

Mirroring - Microsoft Fabric | Microsoft Learn

What is data warehousing in Microsoft Fabric? - Microsoft Fabric | Microsoft Learn

Data Factory in Microsoft Fabric documentation - Microsoft Fabric | Microsoft Learn

Work with semantic models in Microsoft Fabric - Training | Microsoft Learn

Create reports in the Power BI - Microsoft Fabric | Microsoft Learn

Dimensional modeling in Microsoft Fabric Warehouse - Microsoft Fabric | Microsoft Learn

If your source database is not supported for mirroring (yet!), check out these other articles I wrote:

Metadata Driven Pipelines for Microsoft Fabric - Microsoft Community Hub

Metadata Driven Pipelines for Microsoft Fabric – Part 2, Data Warehouse Style - Microsoft Community Hub

Continue reading...

Microsoft Fabric Metadata Driven Pipelines with Mirrored Databases

jehayes

Metadata Driven Pipelines with Microsoft Fabric Mirrored Databases​

Architecture Overview​

Why 2 semantic models? Fabric Data Warehouse vs Mirrored Database SQL Endpoint for semantic model reporting​

Solution details​

Metadata Driven Pipelines with Microsoft Fabric Mirrored Databases

Architecture Overview

Why 2 semantic models? Fabric Data Warehouse vs Mirrored Database SQL Endpoint for semantic model reporting

Solution details