B
BrianSherwin
Introduction
For quite a while, I have been extremely interested in data visualization. Over the last few years, I have been focused on ways to visualize graph databases (regardless of where the data comes from Using force directed graphs to highlight the similarities or “connected communities” in data is incredibly powerful. The purpose of this post is to highlight the recent work that the Kusto.Explorer team has done to visualize graphs in Azure Data Explorer database with data coming from a Fabric KQL Database.
Note: The Kusto.Explorer application used to visualize the graph is currently only supported on Windows.
Background
Azure Data Explorer (ADX) is Microsoft’s fully managed, high-performance analytics engine specializing in near real time queries on high volumes of data. It is extremely useful for log analytics, time-series and Internet of Things type scenarios. ADX is like traditional relational database models in that it organizes the data into tables with strongly typed schemas.
In September 2023, the ADX team introduced extensions to the query language (KQL) that enabled graph semantics on top of the tabular data. These extensions enabled users to contextualize their data and its relationships as a graph structure of nodes and edges. Graphs are often an easier way to present and query complex or networked relationships. These are normally difficult to query because they require recursive joins on standard tables. Examples of common graphs include social networks (friends of friends), product recommendations (similar users also bought product x), connected assets (assembly line) or a knowledge graph.
Fast forward to February 2024, Microsoft Fabric introduced Eventhouse as a workload in a Fabric workspace. This brings forward the power of KQL and Real-Time analytics to the Fabric ecosystem.
So now, I have a large amount of data in Fabric Eventhouse that I want to visualize with a force directed graph…
Let’s get started!
Pre-Requisites
If you want to follow along, you will need a Microsoft Fabric account (Get started with Fabric for Free).
Next, for this post, I used an open dataset from the Bureau of Transportation Statistics. The following files were used:
- Aviation Support Tables – Master Coordinate data
- When you download this file, you can choose the fields to be included in it. For this example, I only used AirportID, Airport, AirportName, AirportCityName and AirportStateCode.
- This Airport data will be loaded directly to a table in KQL.
- This file does not necessarily need to be unzipped.
- Airline Service Quality Performance 234 (On-Time performance data)
- For this blog, I only used the “April 2024” file from this link.
- This data will be accessed using a Lakehouse shortcut.
- Unzip this file to a local folder and change the extension from “.asc” to “.psv” because this is a pipe-separated file.
In order to use these downloaded files, I uploaded them to the “Files” section of the Lakehouse in my Fabric Workspace. If you do not have a Lakehouse in your workspace, first, navigate to your workspace and select “New” -> “More Options” and choose “Lakehouse” from the Data Engineering workloads. Give your new Lakehouse a name and click “Create”.
Once you have a Lakehouse, you can upload the files by clicking on the Lakehouse to bring up the Lakehouse Explorer. First, in the Lakehouse Explorer, click the three dots next to “Files” and select “New subfolder” and create a folder for “Flights”. Next, click the three dots next to the “Flights” sub-folder and select “Upload” from the drop-down menu and choose the on-time performance file. Confirm that the file is uploaded to files by refreshing the page.
Next, an Eventhouse will be used to host the KQL Cluster where you will ingest the data for analysis. If you do not have an Eventhouse in your workspace, select “New” -> “More Options” and choose “Eventhouse” from “Real-Time Intelligence” workloads. Give your new Eventhouse a name and click “Create”.
Finally, we will use the Kusto.Explorer application (available only for Windows) to visualize the graph. This is a one-click deployment application, so it is possible that it will run an application update when you start it up.
Ingest Data to KQL Database
When the Eventhouse was created, a default KQL database with the same name was created. To get data into the database, click the three dots next to the database name, select “Get Data” -> “Local File”. In the dialog box that pops up, in the “Select or create a destination table”, click the “New table” and give the table a name, in this case it will be “airports”. Once you have a valid table name, the dialog will update to drag or browse for the file to load.
Note: You can upload files in a compressed file format if it is smaller than 1GB.
Click “Next” to inspect the data for import. For the airports data, you will need to change the “Format” to CSV and enable the option for “First row is column header”.
Click “Finish” to load the file to the KQL table.
The airport data should now be loaded into the table, and you can query the table to view the results.
Here is a sample of query to verify that data was loaded:
Code:
airports
| take 100;
For the On-Time performance data, we will not ingest it into KQL. Instead, we will create a shortcut to the files in the Lakehouse storage.
Back in the KQL Database explorer, at the top, click on the “+ New -> OneLake shortcut” menu item.
In the dialog that comes up, choose “Microsoft OneLake” and in the “Select a data source type”, choose the Lakehouse where the data was uploaded earlier, and click “Next”
Once the tree view of the OneLake populates the Tables and Files, open the files, and select the subfolder that was created when uploading the On-Time data, and click “Create” to complete the shortcut creation.
Once the shortcut is created, you can view that data by clicking the “Explore your data” and running the following query to validate your data.
Code:
external_table(‘flights’)
| count;
Note: When accessing the shortcut data, use the “external_table” and the name of the shortcut that was created. You cannot change the shortcut name.
Query and Visualize with Kusto.Explorer
Now that the data is connected to an Eventhouse database, we want to start to do analytics on this data. Fabric does have a way to run KQL Queries directly, but the expectation is that the results of the query will be a table. The only way to show the graph visualization is to use the Kusto.Explorer.
To connect to the KQL database, you need to get the URI of the cluster from Fabric. Navigating to the KQL Database in Fabric, there is a panel that includes the “Database details”.
Using the “Copy URI” to the right of the Query URI will copy the cluster URI to the clipboard.
In the Kusto.Explorer application, right click the “Connections” and select “Add Connection”
In the popup, paste the Query URI into the “Cluster connection” textbox replacing the text that is there. You can optionally give the connection an alias rather than using the URI. Finally, I chose to use the AAD for security. You can choose whatever is appropriate for your client access.
At this point, we can open a “New Tab” (Home menu) and type in the query like what we used above.
Code:
let nodes = airports;
let edges = external_table('flights')
| project origin = Column7, dest = Column8, flight = strcat(Column1, Column2), carrier = Column1;
edges
| make-graph origin --> dest with nodes on AIRPORT
Note: You may need to modify the table names (airports, flights) depending on the shortcut or table name you used when loading the data. These values are case-sensitive.
The points of interest in our graph will be the airports (nodes) and the connections (edges) will be the individual flights that were delayed. I am using the “make-graph” extension in KQL to make a graph of edges from origin to destination using the three-character airport code as link.
Visualize with “make-graph”
When this query is run, if the last line of the query is “make-graph”, Kusto.Explorer will automatically pop up a new window titled “Chart” to view the data. In the image below, I chose to change the visualization to a dark theme and then colored the edges based on the “carrier” column of the flight data.
Note: I have zoomed in on the cluster of interest.
If I drag a few of the nodes around, I can start to see there are some nodes (airports) with a lot of orange connections. If I click on an orange link, I quickly learn the orange lines are Delta Flights and the three nodes I pulled out in the image below are Atlanta, Minneapolis, and Detroit.
Conclusion
I started with tables of text-based data and ended with a nice “network” visualization of my flights data. The power of graph visualization to see the relationships between my data rather than just reading tables is invaluable.
Next, I am excited to start to explore visualizations of the data for supply chains and product recommendations.
Continue reading...