Updated Fabric GitHub Repo for 250M rows of CMS Healthcare data

  • Thread starter Thread starter Greg_Beaumont
  • Start date Start date
G

Greg_Beaumont

Last year I teamed up with my colleague Inder Rana to build and release a GitHub repo for using CMS Medicare Part D data within Microsoft Fabric. The repo is intended to provide an example of an end-to-end analytics solution in Fabric that can be easily deployed by anyone with a Fabric environment. We have updated the analytics solution with some valuable improvements:

  • The ELT (extract, load, and transform) process, end-to-end from CMS to the Gold layer of the Lakehouse, now takes less than 20 minutes to run with increased automation.
  • The repo now contains logic to import new data for the year 2022 so that the solution contains 10 years of data (2013-2022) and nearly 250 million rows.
  • There are two simple options to move the data from the CMS servers to the Gold layer in less than 20 minutes:
    1. Spark Notebooks orchestrated with a Pipeline, or 2) Spark Notebooks and SQL Stored Procedures to move the data to the Gold layer.
    2. Option 2 lands the Gold layer in the Fabric Warehouse for those of you who come from a SQL versus a Python background

The updated GitHub repo can be found at this link, please give us a “Star” if you find it useful!: fabric-samples-healthcare/analytics-bi-directlake-starschema at main · isinghrana/fabric-samples-healthcare (github.com)



The first option, using three Spark Notebooks with a single Pipeline, is reviewed in the video below. A video reviewing the SQL Stored Procedure version is coming soon:



Here is a diagram reviewing the new and updated process:

Logical_Diagram_Star_new.png

Continue reading...
 
Back
Top