MGDC for SharePoint FAQ: Dataset types and features

Jose Barreto · Jul 8, 2024

Introduction

I commonly get asked questions about specific datasets and how they behave. For instance:

Why is the data for the SharePoint Sites delayed by 48 hours and some other MGDC datasets are not?
Why can’t I do deltas with the SharePoint File Actions dataset? Or any of the non-SharePoint datasets?
Why can you only get the latest OneDrive Sync Health state and not a snapshot from last week like I do with SharePoint Sites?
Why do Entra ID Group datasets sometimes take longer to collect than a larger SharePoint Permissions?

To answer these questions, you must first understand how these datasets are collected and how your data requests are processed. For simple situations, that is something you can overlook. But if you want this level of detail, we need to understand more about MGDC in general and SharePoint datasets specifically.

How datasets are pulled

The main difference between the SharePoint datasets and the other datasets in MGDC is that SharePoint datasets are pre-collected. That means that the datasets are collected ahead of any request, at the off-peak hours for the region where the data lives. This pre-collected data is then kept within the Microsoft 365 cloud to be used when an MGDC request comes in. This makes serving the data faster and avoids having the MGDC requests competing with regular user requests during peak hours.

SharePoint pre-collects data for SharePoint Sites, SharePoint Groups, SharePoint Permissions, SharePoint Files and SharePoint File Actions.

Other datasets in MGDC are collected at the time of the request, typically making Graph API calls in batch after your MGDC request comes in. These calls happen inside the Microsoft 365 cloud, in multiple parallel threads, so it’s faster than a regular user sending in Graph API calls.

This explains why the pre-collected SharePoint data pulls are usually faster than other MGDC datasets for a similar amount of data. It also explains why these SharePoint datasets are delayed by 2 days. The reason why the delay is 2 days and not only 1 day is a longer conversation, involving time zones and the time required to process the data after it is pre-collected.

21 days of SharePoint history

When SharePoint pre-collects datasets, the data is kept in a compliant workspace inside the Microsoft 365 cloud boundary for 21 days. We avoid storing for longer due to regulatory requirements like those in GDPR.

Because we have a few days of data stored, this allows us to look at the data as it was in the past. For instance, when pre-collecting the SharePoint Sites dataset, we have a full list of all sites in the tenant for each day. For instance, if it’s January 25th, we have a complete state of Sites for each day between January 2nd to January 23rd.

Having this history helps you request data in the recent past. That is useful if you missed one day of collection. This is also why we can do deltas. A delta dataset is basically created by pulling data for two different dates and comparing the two states. You can learn more about deltas at How can I use Delta State Datasets?

Note that Deltas only apply to SharePoint state datasets (Sites, Groups, Permissions and Files). This is why the File Actions dataset does not support it. The data returned in the File Actions dataset covers a specific date (or range of dates) and it shows only the events that occurred in that period.

30 days of OneDrive Sync history

Other unique datasets include the One Sync Health and the OneDrive Sync Errors. They use a mechanism to keep track of the last 30 days of Sync data. When you request data for Sync Health, it shows data about every device that has Sync’ed in the last 30 days. If it’s been longer than 30 days since the last Sync, the device is dropped from the list.

This optimized data collection, all stored within the Microsoft 365 cloud, allows MGDC to quickly gather the OneDrive Sync Health and Sync Errors information. Since we’re always looking at the last 30 days of data, there is no date ranges required as a parameter. It also means that new device information is constantly being added to this store and you could.

You don’t have a day-to-day history, so there is no ability to look back in time or do deltas. But you are also looking at the latest data, not a snapshot from 2 days ago.

Filtering and Sampling

The SharePoint team also implemented features to sample or filter at the source.

For sampling, you can get the top X rows in the dataset. This can be used to look at the data schema and also to estimate the number of total items in the dataset (this is returned in the request metadata). You can read more about sampling at How can I sample or estimate the number objects in a dataset?

A few SharePoint datasets also have filtering capabilities. That feature is limited to specific sites and specific columns (Site Id and Template Id). You can send a filter expression with the request and return only the rows that match the criteria. This can be used to include only OneDrive sites or to exclude a list of sensitive sites. You can read more about filtering at How can I filter rows on a dataset?

Some of the non-SharePoint datasets in MGDC also can limit the results to specific Scopes, including Groups and Users. More at Use MGDC to define the scope of a dataset.

Summary

Finally, here is the summary of the features offered for each of the SharePoint datasets in Microsoft Graph Data Connect:

Conclusion

I hope this post helped you understand the different mechanisms used by Microsoft Graph Data Connect (MGDC) and which features you can use in each type of dataset.

Continue reading...

MGDC for SharePoint FAQ: Dataset types and features

Jose Barreto

Similar threads