MGDC for SharePoint FAQ: How to gather insights from a large Files dataset?

Jose_Barreto · Aug 2, 2024

In this post, we’ll cover recommendations on how to gather insights from the SharePoint Files dataset in the Microsoft Graph Data Connect (MGDC). If you’re not familiar with MGDC for SharePoint, start with Links about Microsoft Graph Data Connect for SharePoint.

1. DO – Follow the process for other MGDC datasets

The SharePoint Files dataset in MGDC delivers the largest results, reaching hundreds of millions and even billions of rows for the largest tenants. That would be one row for each document (item in a document library) in SharePoint or OneDrive.

Even though it's typically very large, this dataset follows the same MGDC process as other datasets. From a pipeline and data source standpoint, it’s just another dataset and it uses the same procedure. There is also support for sampling, filtering, deltas and history of the last 21 days. For details, see MGDC for SharePoint FAQ: Dataset types and features.

2. DO NOT – Load hundreds of millions of rows in Power BI

In general, when using MGDC, you can pull the resulting datasets from Power BI to create your own analytics. Power BI is an amazing tool that will make it easy to pull lots of data and do all sorts of aggregations and summarizations, showing them as reports, charts and dashboards. Power BI can easily pull that data, which is basically a set of folders with JSON files inside.

However, if your tenant ends up outputting tens of millions of rows or more, it is possible that Power BI won’t be able to handle that large dataset. From my experience, Power BI Desktop running on a fast PC with 32GB of RAM can typically handle a few million rows of data. If you have more than that, which is common for the Files dataset, you will need to do some preparation work outside Power BI.

We will discuss a few of these options below.

3. DO – Create a summarized table in ADLS Gen2

If you have too many rows to load into Power BI, you could run a Synapse pipeline to do some data aggregation before pulling the data into Power BI.

For instance, you could use a Synapse data flow to summarize the Files Dataset by “Extension”, calculating the number of sites, files, and bytes for each file extension and pulling that summary into Power BI. This will require some additional work in Synapse, and you will have less flexibility on how to pivot the data once you are in Power BI.

Here are the steps to follow, using a summary by file extension as an example:

Go to the Azure portal and create a new Synapse instance.
In Synapse, create a new data flow. Make sure to enable “data flow debug”.
Add a source to your data flow, pointing to the existing ADLS Gen2 storage, using JSON as the format
Use an aggregate transformation to summarize the data as needed
In the aggregate settings, for the group by column, choose extension
In the aggregate settings, for the aggregates, choose these 3 columns:
- SiteCount - countAllDistinct(SiteId)
- FileCount – count()
- TotalSizeGB – round(sum(SizeInBytes)/1024/1024/1024,2)
Use the “data preview” tab to make sure things are working as expected (see picture below).
Add a sink to write the summarized data back to a new JSON file in ADLS Gen2.
Trigger the pipeline to execute the data flow and summarize the data.
Pull the summarized data into Power BI.

4, DO – Create a summarized table using a Synapse Notebook with PySpark

You can use Azure Synapse Analytics with a notebook to summarize a JSON-file-based dataset in ADLS Gen2.

Here’s a step-by-step guide to help you get started:

Go to the Azure portal and create a new Synapse workspace if you don’t already have one.
Make sure the storage account is configured to allow access by the Synapse workspace. In the Access Control section of the storage account configuration, you can add a role to the app that represents the synapse workspace.
In Synapse Studio, create a new notebook.
Add some code to the notebook. Use PySpark to read the JSON file from ADLS Gen2, perform the necessary summarization operations (for example, group by a field and calculate the sum of another field) and write the summarized data back to ADLS Gen2.
Here’s an example:

Code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, countDistinct, sum

spark = SparkSession.builder.appName("SummarizeJSON").getOrCreate()

# Read the file as text and parse each line as JSON
input_json_path = "abfss://<container>@<account>.blob.core.windows.net/<filespath>/<filename.json>"
rdd = spark.sparkContext.textFile(input_json_path)
df = spark.read.json(rdd)

# Group by Extension and aggregate
summary_df = df.groupBy("Extension").agg(
    count("*").alias("FileCount"),
    countDistinct("SiteId").alias("SiteCount"),
    sum("SizeInBytes").alias("TotalBytes")
)

# Write the output
output_json_path = "abfss://<container>@<account>.dfs.core.windows.net/<extensionspath>"
summary_df.write.mode("overwrite").json(output_json_path)

Execute the cell in your notebook to perform the summarization.
Use the new summarized data when loading into Power BI.

5. DO – Load the data into a SQL Server database

For large datasets, you might also want to move the entire dataset from the folder with JSON files into tables in a SQL Server. If your dataset is larger than 100GB, this could become expensive, and you would need to consider using indexes to help with your query. Columnstore indexes might be particularly useful for analytical queries that end up reading the entire table.

In Azure Synapse, you can use a “Copy Data” task where the source is your Azure Data Lake Storage Gen2 (ADLSGen2) and the destination (called the sink) is a table in SQL. You could also use the “Data Flow” task shown previously to transform the data and sink to SQL.

Moving to SQL will typically also require you to flatten the dataset, projecting nested objects. That means that objects inside objects must be represented as a flat list of properties. For instance, instead of having an “Author” object with two properties inside (“Email” and “Name”), you get two columns (“Author.Email” and “Author.Name”). In the Files Dataset, you will need to flatten the “Sensitivity Label Info”, the “Author” and the “Modified By” columns.

Note that you must first land the Microsoft Graph Data Connect dataset in an Azure Storage account before you can transform it and/or move it to SQL Server.

After the data is available in SQL Server, use the Power BI option to get data using a SQL query. Here is an example of a SQL query to summarize the data in the Files table by extension:

Code:

SELECT
    Extension,
    COUNT(*) AS FileCount,
    COUNT(DISTINCT SiteId) AS SiteCount,
    SUM(SizeInBytes) AS TotalBytes
FROM Files
GROUP BY Extension

If performance is more important than absolute accuracy, you might want to help SQL by using approximate distinct counts. This delivers faster results and guarantees up to a 2% error rate within a 97% probability. Here is an example:

Code:

SELECT
    Extension,
    COUNT(*) AS FileCount,
    APPROX_COUNT_DISTINCT(SiteId) AS SiteCount,
    SUM(SizeInBytes) AS TotalBytes
FROM Files
GROUP BY Extension

6. DO – Filter data

Instead of reducing the size of the data by summarizing, you can also filter the data in the Files dataset. That could be done by filtering the dataset for a specific site or possibly looking only at files with a specific author. You can use any of the methods described here (Synapse data flow, Synapse notebook or SQL Server query) to perform this filtering.

7. DO – Join the Files dataset with the Sites dataset

It might also be useful to join the Files dataset with the Sites dataset, so you can do specific aggregations or filtering. For instance, you could look into how the Files are distributed across the different types of SharePoint site using the Template or Template Id.

Here is an example using a SQL query:

Code:

SELECT
    S.RootWebTemplate AS Template,
    COUNT(*) AS FileCount,
    COUNT(DISTINCT F.SiteId) AS SiteCount,
    SUM(F.SizeInBytes) AS TotalBytes
FROM Files AS F
LEFT OUTER JOIN Sites AS S ON F.SiteId = S.Id
GROUP BY 1

Here is an example where we first calculate a summary of files per site, then do the join. This eliminates the need to use a COUNT DISTINCT:

Code:

SELECT
    S.RootWebTemplate AS Template,
    COUNT(*) AS TotalSites,
    SUM(G.FilesPerSite) AS TotalFiles,
    SUM(G.BytesPerSite) AS TotalBytes
FROM
    (SELECT
        F.SiteId,
        COUNT(*) AS FilesPerSite,
        SUM(F.SizeInBytes) AS BytesPerSite
     FROM Files AS F
     GROUP BY 1) AS G
LEFT OUTER JOIN Sites AS S ON G.SiteId = S.Id
GROUP BY 1

8, DO NOT – Join Files with Permissions on ItemId

You should be very careful when you attempt to join the Files dataset with the Permissions dataset. These are typically huge datasets and the way they should be joined is a bit complicated. You definitely do not want to join them by Item Id, since not every permission has an ItemId (it could be a permission on a Site, Library or Folder) and not every file has an associated permission (again, it could be a permission declared further up in the hierarchy).

If you must find the permissions for a given ItemId, the correct way to do it is by ScopeId. I suggest that you first filter the Files dataset for a specific ItemId and then join that with the Permissions dataset using the ScopeId. Note that a single item might have multiple permissions (with different roles, for instance) and these permissions might be granted for different item types.

Here is an example of a SQL query that shows the permissions for a given file, identified by a SiteId and an ItemId. It is very important to filter the output, otherwise the query might return billions of rows and take a very long time to process.

Code:

SELECT
  F.SiteId,
  F.ItemId,
  F.ScopeId,
  F.SiteUrl,
  F.DirName,
  F.FileName,
  F.Extension,
  P.ItemType,
  P.ItemURL,
  P.RoleDefinition,
  P.LinkId,
  P.LinkScope
FROM Files AS F
LEFT OUTER JOIN Permissions AS P
  ON F.SiteId = P.SiteId AND F.ScopeId = P.ScopeId
WHERE F.SiteId = ' FCBDFC28-9335-4666-A852-6B1C1E7EC165'
  AND F.ItemId = ' 647DCA3A-A3B8-4DBA-B1E8-6000389E696A'

To understand more about permission scopes, see MGDC for SharePoint FAQ: What is in the Permissions dataset?

9, Conclusion

I hope this clarifies what you can do with the Files dataset in MGDC for SharePoint. Let us know in the comments if you have other suggestions on how to get more out of the Files dataset.

For further details about the schema of all SharePoint datasets in MGDC, including SharePoint Sites and SharePoint File Actions, see Datasets, regions, and sinks supported by Microsoft Graph Data Connect - Microsoft Graph.

Continue reading...

MGDC for SharePoint FAQ: How to gather insights from a large Files dataset?

Jose_Barreto

Similar threads