Updated Fabric GitHub Repo for 250M rows of CMS Healthcare data

Greg_Beaumont · Oct 15, 2024

Last year I teamed up with my colleague Inder Rana to build and release a GitHub repo for using CMS Medicare Part D data within Microsoft Fabric. The repo is intended to provide an example of an end-to-end analytics solution in Fabric that can be easily deployed by anyone with a Fabric environment. We have updated the analytics solution with some valuable improvements:

The ELT (extract, load, and transform) process, end-to-end from CMS to the Gold layer of the Lakehouse, now takes less than 20 minutes to run with increased automation.
The repo now contains logic to import new data for the year 2022 so that the solution contains 10 years of data (2013-2022) and nearly 250 million rows.
There are two simple options to move the data from the CMS servers to the Gold layer in less than 20 minutes:
1. Spark Notebooks orchestrated with a Pipeline, or 2) Spark Notebooks and SQL Stored Procedures to move the data to the Gold layer.
2. Option 2 lands the Gold layer in the Fabric Warehouse for those of you who come from a SQL versus a Python background

The updated GitHub repo can be found at this link, please give us a “Star” if you find it useful!: fabric-samples-healthcare/analytics-bi-directlake-starschema at main · isinghrana/fabric-samples-healthcare (github.com)

The first option, using three Spark Notebooks with a single Pipeline, is reviewed in the video below. A video reviewing the SQL Stored Procedure version is coming soon:

Here is a diagram reviewing the new and updated process:

Updated Fabric GitHub Repo for 250M rows of CMS Healthcare data

Greg_Beaumont