Jump to content

Developing a Comprehensive Analytics Platform for Daily Pandemic Updates within Fabric

Featured Replies

Posted

In this tutorial, we will create a complete solution using Microsoft Fabric to monitor news related to the MPOX virus at a specific time. This solution is essential for healthcare professionals as it allows for a quick response to the changing situation by sending alerts when significant negative information is identified.

 

What is Microsoft Fabric?

 

Microsoft Fabric is a comprehensive data management solution that leverages artificial intelligence to integrate, organize, analyze, and share data across a unified and multi-cloud data lake. It simplifies analytics by providing a single platform that combines various services such as Power BI, Azure Synapse Analytics, Azure Data Factory, and more into a seamless SaaS (Software as a Service) foundation. With Microsoft Fabric, you can centralize data storage with OneLake, integrate AI capabilities, and transform raw data into actionable insights for business users.

 

 

 

[ATTACH type=full" alt="PascalBurume08_5-1724231782509.png]63812[/ATTACH]

 

[HEADING=1]

1. Exploring the Features of Microsoft Fabric Components
[/HEADING]

 

 

Each component of Microsoft Fabric offers unique features that enhance data handling:

 

  • Data Factory: Supports pipelines and data flows for seamless data integration and transformation.
  • Synapse Data Engineering: Includes advanced features like Lakehouse databases and shortcuts for accessing data from external sources without duplication.
  • Synapse Data Warehouse: Simplifies data modeling and warehouse creation with integrated pipelines for direct data loading.
  • Power BI: Features like Copilot and Git integration streamline report creation and collaborative development.
  • OneLake: Centralized storage that supports various databases, ensuring data consistency across different components.

  • AI - Copilot: An AI-enhanced toolset tailored to support data professionals in their workflow. It includes Copilot for Data Science and Data Engineering, Copilot for Data Factory, and Copilot for Power BI.
  • Purview: Microsoft Purview, a unified data governance service that helps manage and govern data.

[HEADING=1]

2. Architecture of Microsoft Fabric
[/HEADING]

 

 

 

The architecture of Microsoft Fabric is centered around its components and their serverless compute capabilities. Serverless compute allows users to run applications on-demand without managing infrastructure. For example:

 

  • Synapse Data Warehouse: Uses T-SQL for querying tables with serverless compute.
  • Synapse Data Engineering: Leverages a Spark engine for notebook-based data transformations, eliminating the wait time typically seen in Azure Synapse Analytics.
  • Synapse Real-Time Analytics: Uses KQL for querying streaming data stored in Kusto DB.

 

The unified storage solution, OneLake, ensures that all data is stored in a single, cohesive location, making data management and retrieval more efficient.

 

 

 

[ATTACH type=full" alt="PascalBurume08_2-1724231651839.png]63813[/ATTACH]

 

 

[HEADING=1]

3. Project Architecture Overview

[/HEADING]

 

 

[ATTACH type=full" alt="Architecture.png]63814[/ATTACH]

 

Let’s break down the architecture of this project:

 

  1. Data Ingestion:
    • Tool: Data Factory (built into Microsoft Fabric).
    • Process: We'll configure the Bing API in Azure to ingest the latest news data about Mpox virus into the Microsoft Fabric workspace. The data will be stored in OneLake as a JSON file.

[*]Data Storage:

  • Tool: OneLake.
  • Process: The raw JSON data ingested will be stored in the Lake Database within OneLake. This unified storage solution ensures that all data is easily accessible for subsequent processing.

[*]Data Transformation:

  • Tool: Synapse Data Engineering.
  • Process: The raw JSON file will be transformed into a structured Delta table using Spark notebooks within Synapse Data Engineering. This step includes implementing incremental loads to ensure that only new data about Mpox virus is processed.

[*]Sentiment Analysis:

  • Tool: Synapse Data Science.
  • Process: We'll use a pre-trained text analytics model to perform sentiment analysis on the news data about Mpox virus. The results, indicating whether news articles are positive, mixed, negative, or neutral, will be stored as a Delta table in the Lake Database.

[*]Data Reporting:

  • Tool: Power BI.
  • Process: The final step involves creating a news dashboard in Power BI using the sentiment-analyzed data. This dashboard will provide insights into the latest news trends about Mpox virus and sentiments.

[*]Alerting System:

  • Tool: Data Activator.
  • Process: We'll set up alerts based on the sentiment analysis. For example, alerts can be triggered when a news article about Mpox virus with a negative sentiment is detected. Alerts will be configured to notify via email.

[*]Orchestration:

  • Tool: Data Factory Pipelines.
  • Process: All tasks, from data ingestion to reporting, will be orchestrated using pipelines in Data Factory. This ensures that the entire workflow is automated and connected, allowing for efficient data processing and analysis.

[HEADING=1]

4. The Agenda for the Project

[/HEADING]

 

Here's what you'll learn in this tutorial:

 

  1. Environment Setup: Creating and configuring the necessary resources in Microsoft Fabric.
  2. Data Ingestion: Using Data Factory to ingest Bing News data into OneLake.
  3. Data Transformation: Converting the raw JSON data into structured Delta tables with incremental loads using Synapse Data Engineering.
  4. Sentiment Analysis: Performing sentiment analysis on the news data using Synapse Data Science.
  5. Data Reporting: Building a Power BI dashboard to visualize the analyzed news data about Mpox virus.
  6. Orchestration: Creating pipelines in Data Factory to automate the end-to-end process.
  7. Alerting: Setting up alerts in Data Activator to monitor and respond to critical data events.

[HEADING=1]

5. Setting Up Your Environment for an End-to-End Azure Data Engineering Project with Microsoft Fabric
[/HEADING]

 

 

 

In this section, we'll guide you through the environment setup required for the project. Let's get started!

 

Accessing the Azure Portal

 

To begin, open your browser and navigate to
. This will take you to the Azure portal, where you'll perform all the necessary configurations for this project.

 

Creating a Resource Group

 

The first task is to create a dedicated
Resource Group
for this project. Resource groups help organize and manage your Azure resources efficiently.

 

 

1. Navigate to Resource Groups
: On the Azure portal, locate the
Resource Groups
option at the top of the page and click on it. This will show all the resource groups currently in your subscription.

 

 

[ATTACH type=full" alt="PascalBurume08_6-1724232458237.png]63815[/ATTACH]

 

 

2. Create a New Resource Group
:

 

 

  • Click the
    Create
    button.

  • Subscription
    : Choose your subscription.

  • Resource Group Name
    : Enter a meaningful name, such as
    rg-bing-data-analytics

 

to represent the project.

 

  • Region
    : Select the region closest to your current location (e.g., "Central US").

  • Tags
    : Optionally to help identify resources later.

     

 

3. Review and Create
: After entering the details, click
Review and Create
. Once validated, click
Create
to set up the resource group.

 

[ATTACH type=full" alt="PascalBurume08_7-1724232458241.png]63816[/ATTACH]

 

 

 

 

 

[HEADING=1]

4. Setting Up the Bing Search API
[/HEADING]

 

 

 

Now that we have a resource group, the next step is to create the Bing Search API, which will serve as the data source for our project.

 

  1. Create Bing Search API:
    • Inside your resource group, click Create.
    • In the marketplace search box, type "Bing" and hit Enter.
    • Select Bing Search v7 from the results and click Create.

[*]Configure the API:

  • Subscription: Ensure your subscription is selected.
  • Resource Group: Confirm the resource group you just created is selected.
  • Resource Name: Enter a name like Bing-news-api.
  • Region: The only available option is "Global".
  • Pricing Tier: Select F1 (Free Tier), which supports up to 1,000 API calls per month.

[*]Accept Terms and Create:

  • Ensure you accept the terms and conditions by checking the required box.
  • Review the settings and click Create.

 

Once the API is created, you'll be redirected to the resource page, where you can find essential details like the API key and endpoint. These will be crucial for connecting to the API later in the project.

 

 

 

[ATTACH type=full" alt="PascalBurume08_8-1724232458249.png]63817[/ATTACH]

 

 

 

[HEADING=1]

5. Setting Up Microsoft Fabric in Power BI

[/HEADING]

 

Next, we'll move to Microsoft Fabric within the Power BI workspace.

 

  1. Access Power BI: Open another tab in your browser and navigate to app.powerbi.com. This is where you'll interact with Microsoft Fabric.
  2. Create a Dedicated Workspace:
    • Click on New Workspace at the bottom.
    • Workspace Name: Enter a name like News Bing.
    • Description: Optionally, add a brief description.
    • Licensing: Assign the Fabric trial license to this workspace.

 

Note: If you don’t have the necessary permissions to create a workspace, contact your Power BI admin to request access.

 

 

 

 

 

[ATTACH type=full" alt="PascalBurume08_0-1724232934684.png]63818[/ATTACH]

 

 

 

3.
Enable Fabric
: If you haven't already, ensure that Microsoft Fabric is enabled in your workspace. You can do this by navigating to the settings and activating the free trial provided by Microsoft.

 

 

4.
Creating the Lakehouse Database

 

Finally, we'll set up the Lakehouse Database, where all the data will be stored and processed.

 

5.
Switch to Data Engineering Component
:

 

  • In the Power BI workspace, click the
    Data Engineering
    option in the bottom left corner.

 

6.
Create the Lakehouse Database
:

 

  • Click on
    Lakehouse
    at the top.

  • Database Name
    : Enter a name like bing_lake_db.

  • Click
    Create
    .

 

This Lakehouse database will store both the raw JSON data from the Bing API and the processed data in Delta table format.

 

 

[ATTACH type=full" alt="PascalBurume08_1-1724232934692.png]63819[/ATTACH]

 

 

 

[HEADING=1]

6. Data Ingestion with Microsoft Fabric: Step-by-Step Guide

[/HEADING]

 

In this section, we’ll dive into the data ingestion process using Microsoft Fabric’s Data Factory component. This is a critical step where we’ll connect to the Bing API, retrieve the latest news data about Mpox virus, and store it in our Lakehouse database as a JSON file.

 

Step 1: Accessing the Bing API

 

First, let’s revisit the Bing API resource we created earlier in Azure. This API provides the keys and endpoints necessary to connect and retrieve data.

 

  1. Retrieve API Details:
    • Go to the Azure Portal and navigate to the Bing API resource.
    • Under the Keys and Endpoints section, note the key and the base URL.

 

[ATTACH type=full" alt="PascalBurume08_2-1724232934698.png]63820[/ATTACH]

 

 

2
. Documentation and Tutorials
:

 

  • Azure’s Bing API resource includes a tutorials section that links to the official documentation. This documentation will help us configure our API calls, including endpoints, headers, and query parameters.

 

Step 2: Setting Up Data Factory in Microsoft Fabric

 

Next, we’ll switch over to Microsoft Fabric to set up our data ingestion pipeline.

 

  • Access Data Factory
    : In the bottom left, click on the
    Power BI icon
    and choose the
    Data Factory
    component.

 

Step 3: Creating the Data Ingestion Pipeline

 

We’ll now create a pipeline that will handle the data ingestion process.

 

  1. Create a New Pipeline:
    • Click on Data Pipeline and give it a meaningful name, such as News Ingestion Pipeline.
    • Click Create to set up the new pipeline.

 

[ATTACH type=full" alt="PascalBurume08_5-1724232934748.png]63821[/ATTACH]

 

 

2
. Add Copy Data Activity
:

 

  • In the pipeline workspace, click on
    Copy Data Activity
    and choose
    Add to Canvas
    .

  • Name this activity something descriptive, like Copy Latest News.

  • Move to the
    Source tab
    to configure the data source.

 

Step 4: Configuring the Data Source (Bing API)

 

Now, we’ll configure the source of our data, which is the Bing API.

 

 

  1. Select Data Store Type:
    • Choose More since the Bing API is outside the Fabric workspace.
    • Click REST to establish a new connection.

 

[ATTACH type=full" alt="PascalBurume08_6-1724232934775.png]63822[/ATTACH]

 

 

2
. Set Up API Connection
:

 

  • Select
    REST
    as the data source type and click
    Continue
    .

  • Enter the
    Base URL
    of the Bing News API from the Azure portal.

  • Use
    Anonymous
    as the authentication method because we’ll handle authentication with headers.

 

[ATTACH type=full" alt="PascalBurume08_7-1724232934783.png]63823[/ATTACH]

 

 

 

 

 

3.
Add Headers for Authentication
:

 

  • In the
    Source tab
    , expand the
    Advanced section
    .

  • Add a new header with the name Ocp-Apim-Subscription-Key (copied from the documentation) and paste your API key from Azure.

 

[ATTACH type=full" alt="PascalBurume08_8-1724232934789.png]63824[/ATTACH]

 

 

4. Configure Query Parameters
:

 

 

  • q=mpox for the search term.
  • count=100 to retrieve up to 100 news articles.
  • freshness=Day to get news from the past 24 hours.

 

Use the Relative URL field to add query parameters:

 

[ATTACH type=full" alt="PascalBurume08_9-1724232934795.png]63825[/ATTACH]

 

 

 

  • Please note that we are incorporating updates on the Mpox virus into the Bing news engine.

 

 

 

[ATTACH type=full" alt="PascalBurume08_10-1724232934889.png]63826[/ATTACH]

 

 

 

5.
Preview Data
:

 

  • Click
    Preview Data
    to verify the setup. You should see JSON output with the latest news articles.

 

[ATTACH type=full" alt="PascalBurume08_11-1724232934899.png]63827[/ATTACH]

 

 

 

Step 5: Configuring the Destination (Lakehouse Database)

 

Now, we’ll set up the destination for the data – our Lakehouse database.

 

  1. Choose Workspace Option:
    • In the Destination tab, select Workspace as the data store type.
    • Select Lakehouse and choose the bing_lake_db.

[*]Set File Path:

  • Choose Files as the root folder.
  • Set the file name to bing-latest-news.json and select JSON as the file format.

 

[ATTACH type=full" alt="PascalBurume08_12-1724232934905.png]63828[/ATTACH]

 

 

 

3.
Save and Run Pipeline
:

 

  • Save your pipeline by clicking the
    Save
    button.

  • Run the pipeline by clicking
    Run
    .

 

[ATTACH type=full" alt="PascalBurume08_13-1724232934907.png]63829[/ATTACH]

 

 

 

Step 6: Verifying Data Ingestion

 

Once the pipeline runs successfully, we’ll verify that the data has been ingested correctly.

 

  1. Check Lakehouse Database:
    • Open the Lakehouse database in the Data Engineering component.
    • You should see the bing-latest-news.json file listed under Files.

[*]Review the Data:

  • Ensure that the JSON file contains the expected news data based on the query parameters you configured.

 

[ATTACH type=full" alt="PascalBurume08_14-1724232934939.png]63830[/ATTACH]

 

 

[HEADING=1]

7. Data Transformation Process

[/HEADING]

 

The process involves several steps, including reading the raw JSON file, processing the data, and loading it into a Delta table.

 

 

Step 1. Creating notebooks

 

From the workload switcher located at the bottom left of the screen, select Data engineering. Select Notebook from the New section at the top of the landing page of the Data Engineering experience.

 

[ATTACH type=full" alt="PascalBurume08_0-1724234866261.png]63831[/ATTACH]

 

 

 

Once the notebooks are created, go to the items view in your workspace to view the imported notebooks and begin the transformation process.

 

[HEADING=3]

Step 2: Reading the Raw JSON File

[/HEADING]

 

The first step involves reading the raw JSON file from the Lakehouse database into a Spark DataFrame.

 

df = spark.read.option("multiline", "true").json("Files/bing-latest-news.json")
display(df)
   

 

[ATTACH type=full" alt="PascalBurume08_0-1724237077033.png]63832[/ATTACH]

 

[HEADING=3]

Step 3: Selecting the Relevant Column

[/HEADING]

 

Since we're only interested in the [iCODE]value[/iCODE] column that contains the actual JSON structure of the news articles, we select this column from the DataFrame.

 

df = df.select("value")
display(df)
   

 

[ATTACH type=full" alt="PascalBurume08_1-1724237111563.png]63833[/ATTACH]

 

[HEADING=3]

Step 4: Exploding the JSON Objects

[/HEADING]

 

We use the [iCODE]explode[/iCODE] function to explode all the JSON objects that exist in the [iCODE]value[/iCODE] column from a single row structure to multiple rows. This allows us to represent each news article as a separate row.

 

from pyspark.sql.functions import explode
df_exploded = df.select(explode(df["value"]).alias("json_object"))
display(df_exploded)
   

 

[ATTACH type=full" alt="PascalBurume08_2-1724237176845.png]63834[/ATTACH]

 

[HEADING=3]

Step 5: Converting JSON Objects to JSON Strings

[/HEADING]

 

Next, we convert the exploded JSON objects into a list of JSON strings.

 

json_list = df_exploded.toJSON().collect()
   

[HEADING=3]

Step 6: Parsing and Extracting Information from JSON Strings
[/HEADING]

 

Using the [iCODE]json[/iCODE] library, we parse the JSON strings into dictionaries and extract the required information for each news article, such as title, description, category, URL, image URL, provider, and publication date.

 

import json

# Initialize lists to store extracted information
title = []
description = []
category = []
url = []
image = []
provider = []
datePublished = []

# Process each JSON string in the list
for json_str in json_list:
   try:
       article = json.loads(json_str)

       if article["json_object"].get("name"):
           title.append(article["json_object"]["name"])

       if article["json_object"].get("description"):
           description.append(article["json_object"]["description"])

       if article["json_object"].get("category"):
           category.append(article["json_object"]["category"])
       else:
           category.append(None)

       if article["json_object"].get("url"):
           url.append(article["json_object"]["url"])

       if article["json_object"]["provider"][0].get("image"):
           image.append(article["json_object"]["provider"][0]["image"]["thumbnail"]["contentUrl"])

       if article["json_object"]["provider"][0].get("name"):
           provider.append(article["json_object"]["provider"][0]["name"])

       if article["json_object"].get("datePublished"):
           datePublished.append(article["json_object"]["datePublished"])

   except Exception as e:
       print(f"Error processing JSON object: {e}")
   

[HEADING=3]

Step 7: Creating a DataFrame with Extracted Information

[/HEADING]

 

We then combine all the extracted information into a structured DataFrame.

 

from pyspark.sql.types import StructType, StructField, StringType

# Combine the lists
data = list(zip(title, description, url, image, provider, datePublished))

# Define schema
schema = StructType([
   StructField("title", StringType(), True),
   StructField("description", StringType(), True),
   StructField("url", StringType(), True),
   StructField("image", StringType(), True),
   StructField("provider", StringType(), True),
   StructField("datePublished", StringType(), True)
])

# Create DataFrame
df_cleaned = spark.createDataFrame(data, schema=schema)
display(df_cleaned)
   

 

[ATTACH type=full" alt="PascalBurume08_5-1724237262207.png]63835[/ATTACH]

 

[HEADING=3]

Step 8: Formatting the Date Column

[/HEADING]

 

The [iCODE]datePublished[/iCODE] column, originally in timestamp format, is converted to a more readable date format.

 

from pyspark.sql.functions import to_date, date_format

df_cleaned_final = df_cleaned.withColumn("datePublished", date_format(to_date("datePublished"), "dd-MMM-yyyy"))
display(df_cleaned_final)
   

 

 

 

[ATTACH type=full" alt="PascalBurume08_6-1724237310249.png]63836[/ATTACH]

 

Implementing Incremental Load in Data Transformation with Microsoft Fabric

 

In this section, we’ll cover how to handle the error that occurs when trying to write data to an existing table in your Lakehouse database and how to implement an incremental load using the Type 1 Slowly Changing Dimension (SCD) method. This will help ensure that only new or updated data is added to your table without unnecessarily duplicating or overwriting existing data.

 

Initial method: Comprehending the Error and Overwrite Functionality

 

When you attempt to write data to an existing table without handling it correctly, you might encounter an error stating that the "table or view already exists." This is because the system tries to create a new table with the same name as an existing one.

 

Overwrite Mode
: One way to solve this is by using the overwrite mode, which replaces the entire content of the existing table with the new data. However, this approach can lead to performance issues, especially with large datasets, and can result in data loss if previous records are simply overwritten.

 

Alternative method: Comprehending the Append Mode

 

Another approach is the append mode, where the new data is simply added to the existing table.

 

Append Mode
: This method appends new data to the existing table without checking for duplicates. As a result, it can lead to data duplication and an unnecessary increase in table size.

 

Third method: Incremental Loading with Type 1 and Type 2 Merging

 

- Type 1 Merge Logic
: When a record with the same unique identifier (in this case, the URL) exists in both the new data and the table, the system checks if any fields have changed. If there are changes, the system updates the existing record with the new data. If there are no changes, the record is ignored.

 

 

- Type 2 Merge Logic:
When a record with the same unique identifier (in this case, the URL) exists in both the new data and the table, the system checks if any fields have changed. If there are changes, the system inserts a new record with the new data,

and the old record is marked as expired or archived.

 

We implement an incremental load using a Type 1 merge to address these issues. This method ensures that only new or changed data is added to the table, avoiding duplicates and maintaining data integrity.

 

 

 

from pyspark.sql.utils import AnalysisException

try:
   # Define the table name
   table_name = "tbl_latest_news"

   # Attempt to write the DataFrame as a Delta table
   df_cleaned_final.write.format("delta").saveAsTable(table_name)

except AnalysisException:
   print("Table Already Exists")

# Merge the new data with the existing table
df_cleaned_final.createOrReplaceTempView("vw_df_cleaned_final")
spark.sql(f"""
   MERGE INTO {table_name} target_table
   USING vw_df_cleaned_final source_view
   ON source_view.url = target_table.url
   WHEN MATCHED AND 
       source_view.title <> target_table.title OR
       source_view.description <> target_table.description OR
       source_view.image <> target_table.image OR
       source_view.provider <> target_table.provider OR
       source_view.datePublished <> target_table.datePublished
   THEN UPDATE SET *
   WHEN NOT MATCHED THEN INSERT *
""")
   

 

 

 

Step 10: Verifying the Data

 

After the table is updated, you can verify the contents by running a simple SQL query to count the number of records in the table.

 

 

 

%%sql
SELECT count(*) FROM tbl_latest_news
   

 

 

 

[ATTACH type=full" alt="PascalBurume08_0-1724243709901.png]63837[/ATTACH]

 

[HEADING=1]

8. Performing Sentiment Analysis using Synapse Data Science in Microsoft Fabric
[/HEADING]

 

In this section, we delve into performing sentiment analysis on the news articles we ingested and processed earlier. We'll use the Synapse Data Science tool within Microsoft Fabric, leveraging the Synapse ML library to apply a pre-trained machine learning model for sentiment analysis. Let's walk through the process step by step.

 

 

Step 1: Accessing Synapse Data Science in Microsoft Fabric

 

  1. Navigating to Synapse Data Science:
    • Start by accessing the Synapse Data Science tool from the Microsoft Fabric workspace. You can do this by clicking on the Power BI icon in the bottom left and selecting the "Data Science" option.

 

[ATTACH type=full" alt="PascalBurume08_0-1724243772975.png]63838[/ATTACH]

 

 

2.
Creating a Notebook
:

 

  • Create a new notebook in the Synapse Data Science workspace. Rename it to something meaningful like "
    News-Sentiment-Analysis
    " for easy identification.

 

Step 2: Setting Up the Environment

 

Attach the Lakehouse Database:

 

  • To work with the data you've processed earlier, you need to attach the Lakehouse database to the notebook. This will allow you to access the clean table containing the news articles.

  • After attaching the Lakehouse database, you can load the data using the following code:

 

# Load the clean table into a DataFrame
df = spark.sql("SELECT * FROM tbl_Latest_News")
df.show()
   

 

 

 

  • This code will display the data from the table, showing the columns and rows of news articles about Mpox data.

 

[ATTACH type=full" alt="PascalBurume08_0-1724244223483.png]63839[/ATTACH]

 

Step 3: Implementing Sentiment Analysis with Synapse ML

 

  1. Import Synapse ML and Configure the Model
    :

    • Synapse ML (formerly ML Spark) provides various pre-built models, including those for sentiment analysis. You can use the AnalyzeText model for this task.

    • First, import the necessary libraries and set up the model:

 

import synapse.ml.core
from synapse.ml.services import AnalyzeText

# Import the model and configure the input and output columns
model = (AnalyzeText()
       .setTextCol("description")
       .setKind("SentimentAnalysis")
       .setOutputCol("response")
       .setErrorCol("error"))
   

 

2.
Apply the Model to the DataFrame
: Once the model is configured, apply it to the DataFrame containing the news about Mpox data descriptions:

 

# Apply the model to our dataframe
result = model.transform(df)
display(result)
   

 

 

 

[ATTACH type=full" alt="PascalBurume08_1-1724246369446.png]63840[/ATTACH]

 

3.
Extract the Sentiment Value
: The sentiment results are stored as a JSON object in the response column. You'll need to extract the actual sentiment value:

 

# Create Sentiment Column
from pyspark.sql.functions import col

sentiment_df = result.withColumn("sentiment", col("response.documents.sentiment"))
display(sentiment_df)
   

 

[ATTACH type=full" alt="PascalBurume08_2-1724246591811.png]63841[/ATTACH]

 

4.
Let remove the response and error columns :

 

sentiment_df_final = sentiment_df.drop("error", "response")
display(sentiment_df_final)
   

 

[ATTACH type=full" alt="PascalBurume08_3-1724270604733.png]63842[/ATTACH]

 

 

 

[ATTACH type=full" alt="PascalBurume08_5-1724246862144.png]63843[/ATTACH]

 

 

Step 4: Writing the Results to the Lakehouse Database with Incremental Load

 

  1. Perform Incremental Load with Type 1 Merge
    :

    • Similar to the data processing step, use a Type 1 merge to write the sentiment analysis results to the Lakehouse database, ensuring that only new or updated records are added.

 

from pyspark.sql.utils import AnalysisException

try:
   table_name = 'tbl_sentiment_analysis'
   sentiment_df_final.write.format("delta").saveAsTable(table_name)
except AnalysisException:
   print("Table Already Exists")
   sentiment_df_final.createOrReplaceTempView("vw_sentiment_df_final")
spark.sql(f"""
   MERGE INTO {table_name} target_table
   USING vw_sentiment_df_final source_view
   ON source_view.url = target_table.url
   WHEN MATCHED AND 
       source_view.title <> target_table.title OR
       source_view.description <> target_table.description OR
       source_view.image <> target_table.image OR
       source_view.provider <> target_table.provider OR
       source_view.datePublished <> target_table.datePublished
   THEN UPDATE SET *
   WHEN NOT MATCHED THEN INSERT *
""")
   

  • This code ensures that the sentiment analysis results are incrementally loaded into the tbl_sentiment_analysis table, updating only the changed records and adding new ones.


  • After running the merge operation, validate that the TBL_Sentiment_Analysis table in the Lakehouse database contains the correct data, with no duplicates and all sentiments accurately recorded.

     

[HEADING=1]

9. Building a News Dashboard using Power BI in Microsoft Fabric

[/HEADING]

 

In this section, we will create a news dashboard using Power BI within Microsoft Fabric. This dashboard will visualize the news articles and their corresponding sentiment analysis results from the table TBL_Sentiment_Analysis. Here’s a step-by-step guide on how to build and customize this dashboard.

 

 

Step 1: Access the Data in Power BI

 

  1. Connect Power BI to the Lakehouse Database:
    • Navigate to the Bing Lake DB database in Microsoft Fabric.
    • Identify the tbl_sentiment_analysis table, which contains the news articles along with the sentiment analysis results.

 

[ATTACH type=full" alt="PascalBurume08_6-1724248377279.png]63844[/ATTACH]

 

 

2.
Create a Semantic Model
:

 

  • A semantic model in Microsoft Fabric acts similarly to a Power BI dataset. It holds table information from the Lakehouse database, enabling Power BI to connect and build reports.

  • Create a new semantic model named News Dashboard DataSet and select the tbl_sentiment_analysis table for inclusion.

 

[ATTACH type=full" alt="PascalBurume08_7-1724248377284.png]63845[/ATTACH]

 

Step 2: Build the Initial Report

 

  1. Auto-Create Report:
    • Microsoft Fabric offers a feature called "Auto Create Report," which quickly generates an initial report based on your dataset. This is a good starting point for building more customized reports.
    • Navigate to Power BI within your workspace, select the News Dashboard DataSet, and use the "Auto Create Report" feature to generate a report.

[*]Edit and Customize the Report:

  • After the report is auto-generated, click the Edit button to modify and add your own visuals.
  • Create a new page in the report for building your custom visuals.

 

[ATTACH type=full" alt="PascalBurume08_9-1724248767767.png]63846[/ATTACH]

 

 

Step 3: Create and Customize Visuals

 

  1. Table Visual:
    • Add a Table visual to display key information such as the news Title, Provider, URL, Category, and Date Published.
    • Adjust the column widths to ensure all information is clearly visible.

[*]Filter Visual (Slicer):

  • Add a Slicer visual to filter the table based on the Date Published column. Convert this slicer to a dropdown format for a cleaner look.

[*]Convert URL to Clickable Links:

  • Go back to the semantic model

 

[ATTACH type=full" alt="PascalBurume08_10-1724248990393.png]63847[/ATTACH]

 

  • Convert the URL column to a web URL format under the Data Category in the column properties.

 

[ATTACH type=full" alt="PascalBurume08_11-1724249033281.png]63848[/ATTACH]

 

 

 

 

 

  • Refresh your Power BI report to see the URLs as clickable links, allowing users to directly access the full news articles.

 

[ATTACH type=full" alt="PascalBurume08_12-1724249211728.png]63849[/ATTACH]

 

  1. Apply Default Filter for Latest News:
    • Configure the table visual to always display the latest news by default. Use the Top N filtering option to ensure the table shows only the news articles with the most recent Date Published.

 

[ATTACH type=full" alt="PascalBurume08_13-1724249280217.png]63850[/ATTACH]

 

 

Step 4: Create Measures for Sentiment Analysis

 

  1. Create Measures in the Semantic Model:
    • In the semantic model, create three DAX measures to calculate the percentage of positive, negative, and neutral sentiments among the news articles.
    • Use the following DAX code snippets to create the measures:

 

 

 

 

 

 

 

 Negative Sentiment % =
IF(COUNTROWS(FILTER(tbl_sentiment_analysis, tbl_sentiment_analysis[sentiment] = "negative")) >= 0,
   DIVIDE(
       CALCULATE(
           COUNTROWS(FILTER(tbl_sentiment_analysis, tbl_sentiment_analysis[sentiment] = "negative"))
       ),
       COUNTROWS(tbl_sentiment_analysis)
   ) * 100,
   0
)
   </code></pre>

 

 

 

 

 

 

 

  • You can create other measures for mixed sentiment and neutral sentiment.

 

Add Card Visuals for Sentiment Scores:

 

  • In the Power BI report, add three Card visuals to display the positive, negative, and neutral sentiment percentages.

  • Configure each card to use the corresponding measure you created.

  • Apply the same Top N filter to these card visuals to ensure they display sentiment scores for the latest news articles.

 

 

 

[ATTACH type=full" alt="PascalBurume08_8-1724248665554.png]63851[/ATTACH]

 

Step 5: Finalize and Save the Report

 

  1. Save the Report:
    • Name your report News Dashboard and save it within your workspace.
    • Ensure that the report is updated with the latest data every time new news articles are ingested.

[*]Review and Test:

  • Test the functionality of the report by filtering the news articles and ensuring that sentiment scores update accordingly.
  • Verify that the URLs are clickable and lead to the correct news articles.

[HEADING=1]

10. Setting Up Alerts in Power BI Reports Using Data Activator

[/HEADING]

 

In this section, we'll go through how to set up alerts in Power BI reports using the Data Activator tool in Microsoft Fabric. This will allow you to receive notifications when certain conditions are met in your Power BI visualizations, such as when a sentiment score crosses a specific threshold.

 

Step 1: Access the Data Activator Tool

 

  1. Open the Data Activator:
    • Navigate to the workspace in Microsoft Fabric.
    • Click on the Power BI icon at the bottom left of the screen.
    • Select Data Activator from the options.

 

[ATTACH type=full" alt="PascalBurume08_0-1724270604720.png]63852[/ATTACH]

 

 

 

Step 2: Setting Up Alerts in Power BI

 

  1. Navigate to Your Power BI Report:
    • Open the News Dashboard Power BI report that you created.
    • Go to the specific page where you want to set up the alert (e.g., Page 2 where your custom dashboard is located).

[*]Select a Visual to Monitor:

  • Identify the visual you want to monitor for alerts. For this example, we'll use the Positive Sentiment card visual.
  • Click on the Edit button in Power BI.
  • Select the Neutral Sentiment card visual, then click on the three dots (More options) at the top-right of the visual.

[*]Configure the Alert:

  • Choose Set Alert from the dropdown.
  • On the right side, configure the alert options:
    • Visual: Automatically selected based on the visual you clicked.
    • Measure: Select the measure related to the positive sentiment (e.g., Neutral Sentiment Percentage).
    • Condition: Set the condition for the alert. For example, set the condition to "becomes less than 50%" to trigger an alert when Neutral sentiment is detected.
    • Notification Type: Choose between Email and Teams. For this project, select mail to receive alerts via Email.
    • Workspace and Reflex: Choose the workspace where this reflex item will be saved (e.g., News Bing). Create a new Reflex item, such as Neutral Sentiment Item, and make sure the Start My Alert checkbox is selected.

 

[ATTACH type=full" alt="PascalBurume08_1-1724270604728.png]63853[/ATTACH]

 

 

 

  1. Create the Alert:
    • After verifying your configurations, click Create Alert.
    • The system will create the alert and link it to a Reflex item in Data Activator.

 

Step 3: Managing and Monitoring Alerts

 

  1. View and Manage Alerts:
    • Once the alert is created, click on View Alert to be redirected to the Data Activator Reflex item.
    • In the Reflex item, you'll find tabs like Triggers, Properties, and Events:
      • Triggers: View the condition that will activate the alert.
      • Properties: Adjust settings related to the alert.
      • Events: Monitor all events triggered by this Reflex item.

 

[ATTACH type=full" alt="PascalBurume08_2-1724270604732.png]63854[/ATTACH]

 

 

 

  1. Modify or Stop Alerts:
    • If you need to adjust any settings, you can do so directly within the Reflex item.
    • You can also stop or delete the alert if it is no longer needed.

 

Step 4: Testing the Alerts

 

  1. Monitor Alerts in Email:
    • Once the alert is configured, test it by ingesting new data through your pipeline or waiting for the scheduled pipeline to run.

 

[ATTACH type=full" alt="PascalBurume08_1-1724271255226.png]63855[/ATTACH]

 

  • If the condition set in the alert is met, you will receive a notification in mail.

     

 

[ATTACH type=full" alt="PascalBurume08_0-1724271316408.png]63856[/ATTACH]

 

 

[HEADING=1]

11. Creating an End-to-End Pipeline Using Data Factory in Microsoft Fabric
[/HEADING]

 

In this section, we'll walk through creating an automated pipeline in Data Factory that orchestrates all the tasks we've done in the project so far. This pipeline will handle data ingestion, transformation, sentiment analysis, and update the Power BI reports with the latest news data.

 

Step 1: Review and Enhance the Existing Pipeline

 

  1. Open the Existing Pipeline:
    • Start by opening the news_ingestion_pipeline that was previously created to ingest data from the Bing API.

[*]Enhance the Pipeline with Additional Activities:

  • Add Data Transformation:
    • Drag a Notebook activity onto the canvas.
    • Connect the Copy Data activity's success output to this notebook activity using the On Success connection.
    • Rename the activity to Data Transformation.
    • In the Settings tab, select the appropriate workspace and choose the process_bing_news notebook.

    [*]Add Sentiment Analysis:

    • Add another Notebook activity and connect it to the success output of the Data Transformation activity.
    • Rename this activity to Sentiment Analysis.
    • In the Settings tab, select the new_sentiment_analysis notebook.

 

[ATTACH type=full" alt="PascalBurume08_0-1724271445097.png]63857[/ATTACH]

 

Step 3: Schedule the Pipeline for Daily Execution

 

  1. Schedule the Pipeline:
    • Click on the Schedule button to set up a daily trigger.
    • Configure the trigger to run every day at 60 Minutes.

 

[ATTACH type=full" alt="PascalBurume08_1-1724271492638.png]63858[/ATTACH]

 

  • Set the start date to today and the end date to one year from now.

 

2.
Run the Pipeline Manually
:

 

  • To test the pipeline, run it manually by providing a search term like "sports" to ingest sports-related news articles.

  • Verify the pipeline execution by checking each step to ensure it completes successfully.

 

Step 5: Save and Finalize

 

  1. Save the Pipeline and Report:
    • After making all changes, ensure that both the pipeline and the Power BI report are saved.

[*]Monitor and Validate:

  • Once the pipeline runs on schedule, monitor the report daily to ensure it updates with the latest news and sentiment analysis correctly.
     

 

[ATTACH type=full" alt="PascalBurume08_2-1724271613164.png]63859[/ATTACH]

 

 

 

 

 

Congratulations on successfully completing this end-to-end project! You've covered a comprehensive workflow, integrating multiple components of Microsoft Fabric to achieve a fully automated system for data ingestion, transformation, sentiment analysis, reporting, and alerting.

 

 

 

 

 

Key Takeaways:

 

  1. Data Ingestion: You created a pipeline in Data Factory to ingest news articles from the Bing API, storing them in a Lakehouse database as raw JSON files.
  2. Data Transformation: You processed the raw JSON files into a structured delta table, preparing the data for further analysis.
  3. Sentiment Analysis: Utilizing Synapse Data Science, you performed sentiment analysis on the news articles using a pre-trained Synapse ML model, storing the results in a delta table.
  4. Reporting with Power BI: You built a Power BI dashboard that dynamically displays the latest news articles and their associated sentiments, with a focus on the most recent data.
  5. Alert Configuration with Data Activator: You set up alerts using Data Activator to monitor changes in the Power BI visuals, specifically alerting when the positive sentiment exceeds zero. The alerts were configured to send notifications via Microsoft Teams.
  6. End-to-End Testing: You tested the entire pipeline by running it with a new search term ("Mpox"), verifying that the system correctly ingested the data, updated the dashboard, and sent the appropriate alerts.

 

Final Thoughts:

 

This project has provided a deep dive into Microsoft Fabric's capabilities, showcasing how its various components can be integrated to build a robust, automated data processing and reporting solution. By completing this project, you've gained valuable hands-on experience that will be incredibly useful in real-world Azure Data Engineering scenarios.

 

Next Steps:

 

  • Expand the Project: You can further enhance this project by adding more features, such as custom machine learning models, additional data sources, or advanced visualizations.
  • Optimize Performance: Consider exploring ways to optimize the pipeline for performance, especially when dealing with large datasets.
  • Explore More Features: Microsoft Fabric offers many more features. Delving deeper into its capabilities, like real-time streaming or advanced data governance, could further enhance your skills.

 

Thank you for following through this comprehensive project. Your dedication and attention to detail will undoubtedly pay off in your future endeavors as a data engineer.

 

 

 

Good luck, and keep learning!

 

Resources

 

Microsoft Certified: Fabric Analytics Engineer Associate - Certifications | Microsoft Learn

 

Query parameters used by News Search APIs - Bing Services | Microsoft Learn

 

Bing News Search APIs v7 Reference - Bing Services | Microsoft Learn

 

Explore end-to-end analytics with Microsoft Fabric - Training | Microsoft Learn

 

Lakehouse end-to-end scenario: overview and architecture - Microsoft Fabric | Microsoft Learn

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...