M
Mustafa-Al-Durra
Leveraging the innovative medallion Lakehouse architecture, the DICOM data ingestion capability is built on top of the foundation of the healthcare data solutions (Preview) in Microsoft Fabric. This feature allows customers to conduct exploratory analysis and run large-scale imaging analytics and radiomics in OneLake. It provides capabilities in Microsoft Fabric that allow seamless transformation of imaging data from the native DICOM® format into tabular shapes that can be persisted in the lake in FHIR® (Silver) and OMOP (Gold) formats. The solution is designed for Fabric workspaces, utilizing SQL analytics endpoints that can scale automatically to meet demand.
The medallion Lakehouse for medical imaging in Microsoft Fabric
Leveraging the medallion Lakehouse architecture, the DICOM data ingestion capability is built on top of the foundation of the Healthcare data solutions in Microsoft Fabric, that comprises three fundamental layers:
Conceptual architecture
Data has gravity, and DICOM® Imaging datasets are usually in the order of petabytes for an average healthcare and life science organization. The DICOM data ingestion capability in Microsoft Fabric offers our customers and partners the flexibility to choose the ingestion pattern that best meets their existing data volume and storage needs. At a high level, there are three patterns for ingesting DICOM® data into the healthcare data solutions in Microsoft Fabric. As such, and depending on the ingestion pattern, there are up to seven end-to-end execution steps to consider from the ingestion of the raw DICOM® files to the transformation of the Gold Lakehouse in the OMOP CDM format, as depicted in the following conceptual architecture diagram:
Conceptual architecture and ingestion patterns of the DICOM data ingestion capability in Microsoft Fabric
You can find more details about each of those three ingestion patterns, and the seven E2E execution steps in our public documentation: Use DICOM data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn
Ingestion patterns
Option 1 (Ingest): This option is based on ingesting, i.e. copying, the DICOM® files, in their native or compressed format, into the Lakehouse – we call this option the Ingest option. Customers will consider this option if they intend to migrate their DICOM® data from on-prem storage to the cloud and Microsoft Fabric.
Option 2 (BYOS): Thanks to the shortcuts capability in Microsoft OneLake, this option is based on in-place access to the DICOM® files from Azure data lake storage (ADLS). Unlike the previous option, the DICOM® files in this option will not be copied, or moved, from their original location – we call this option Bring-You-Own-Storage (BYOS). In many cases, customers have already migrated their DICOM® data to Azure data lake storage, and this option will enable them to unlock the power of those imaging datasets without the need to move or relocate the existing DICOM® files. In other cases, customers may have a dependency on those cloud datasets, which add another layer of complexity in the absence of this ingestion pattern.
Options 3 (AHDS DICOM service): The ingestion is based on leveraging the DICOM service from the Azure Health Data Services (AHDS) – we call this option the AHDS DICOM service option. In this option, the imaging data, i.e. DICOM® files are placed in an ADLS Gen2 location by the AHDS DICOM service. From there, the data flow and ingestion will be identical to the BYOS flow as described in the previous option. Our customers and partners who are already using our AHDS DICOM service can take advantage of this option and immediately unlock the power of our DICOM data ingestion capability.
E2E execution steps
There are seven end-to-end execution steps in our DICOM data ingestion capability. All the seven steps are included in the first ingestion pattern (Ingest). However, in the other two ingestion patterns, i.e. BYOS and AHDS DICOM service, the first two steps will not be required, and you can start with the third steps onwards.
Step 1: Ingestion of DICOM® files in OneLake
The Ingest folder in the Bronze Lakehouse represents a drop, or a queue, folder. You can simply drop the DICOM® files inside the ingest folder in the Bronze Lakehouse. Go to the Ingest\Imaging\DICOM Folder in the Bronze Lakehouse and click on the ellipses => Upload => Upload files
Ingestion of DICOM® files in OneLake
Step 2: Organize the DICOM® files in OneLake
The data movement notebook, as part of the DICOM data ingestion capability, will now transfer all files from the Ingest Folder to a newly optimized directory structure within the Bronze Lakehouse: Files\Process\Imaging\DICOM\yyyy\mm\dd. This reorganization facilitates scalability and is more conducive to datalake storage (refer to data lake best practice for directory structure in Azure). If the files are compressed, the notebook will extract each dcm file and place it into the optimized directory, disregarding the original folder arrangement found within the ZIP. The data movement notebook will also append a Unix timestamp prefix to the filenames, accurate down to the millisecond, to maintain file name uniqueness. This measure is crucial for clients using multiple PACS and VNA systems where file name uniqueness may not be guaranteed.
DICOM® files are re-organized within the Process folder in OneLake
Step 3: Extract DICOM® metadata into the Bronze Lakehouse
In this step, the data extract notebook, as part of the DICOM data ingestion capability, tracks newly moved files in the Process folder and extracts the DICOM® tags (DICOM® Data Elements) available in the dcm files in the process folder and ingests them into the dicomimagingmetastore delta table in the Bronze Lakehouse.
DICOM® metadata extracted in the Bronze Lakehouse
Step 4: Conversion to FHIR® ImagingStudy NDJSON files in OneLake
This step converts the DICOM® metadata to the FHIR® format. The data conversion notebook, as part of the DICOM data ingestion capability, tracks and processes the recently modified delta table in the Bronze Lakehouse (including dicomimagingmetastore). It then converts the DICOM® metadata present from the dicomimagingmetastore delta table in the Bronze Lakehouse to the ImagingStudy FHIR® resource (R4.3) and saves the output in the form of NDJSON files.
Step 5: Ingestion into ImagingStudy delta table into the Bronze Lakehouse
From this step onward, you will be re-using the notebook from the FHIR® data ingestion capability in the healthcare data solution in Microsoft Fabric. The bronze ingestion notebook, as part of the FHIR data ingestion capability, tracks the newly generated files in the configured folder location. The notebook will group the instance-level data of the same study into one DICOM® Study record and insert a new record in the ImagingStudy delta table in the Bronze Lakehouse. Each record represents a Study object in the DICOM® hierarchy.
Step 6: Ingestion of ImagingStudy delta table into the Silver Lakehouse
The silver ingestion notebook, as part of the FHIR data ingestion capability, tracks the newly added records in the ImagingStudy delta table in the Bronze Lakehouse. The notebook will flatten and transform the data from the ImagingStudy delta table in the Brone Lakehouse to the ImagingStudy delta table in the Silver Lakehouse in accordance with the ImagingStudy FHIR® resource (R4.3).
Step 7: Conversion and Ingestion of Imaging_Occurence into the Gold Lakehouse
The OMOP notebook, as part of the FHIR data ingestion capability, leverages the OMOP mappings to transform resources from the Silver Lakehouse into OMOP delta tables in the Gold Lakehouse. This Notebook will convert the data in the FHIR® delta tables in the Silver Lakehouse (including the ImagingStudy delta table) to respective OMOP delta tables in the Gold Lakehouse (including the Image_Occurence) delta table. Each record in the Image_Occurence delta table in the Gold Lakehouse represents a Series object in the DICOM® hierarchy.
In this article, we shared how the DICOM data ingestion capability offers a robust and all-encompassing solution for unifying and analyzing the medical imaging data in a harmonized pattern with the clinical dataset in the healthcare data solutions in Microsoft Fabric. For more details, please review our documentations:
1 Park, W.Y., Jeon, K., Schmidt, T.S. et al. Development of Medical Imaging Data Standardization for Imaging-Based Observational Research: OMOP Common Data Model Extension. J Digit Imaging. Inform. med. 37, 899–908 (2024). https://doi.org/10.1007/s10278-024-00982-6
DICOM® is the registered trademark of the National Electrical Manufacturers Association (NEMA) for its Standards publications relating to digital communications of medical information.
FHIR® is a registered trademark of Health Level Seven International, registered in the U.S. Trademark Office, and is used with their permission.
Continue reading...
The medallion Lakehouse for medical imaging in Microsoft Fabric
Leveraging the medallion Lakehouse architecture, the DICOM data ingestion capability is built on top of the foundation of the Healthcare data solutions in Microsoft Fabric, that comprises three fundamental layers:
- Bronze: The raw zone, this first layer stores the source imaging data in its original DICOM® format (dcm files) as well as a meta store that contains the full set of metadata (tags) extracted from the DICOM® files. The DICOM data ingestion supports compression-by-design, i.e. dcm files both in native and zip format can be processed in the Bronze Lakehouse. Once the metadata extraction is completed, the dcm files will be compressed (configurable) in a zip format for more cost and storage efficiency.
- Silver: The enriched zone, this layer stores the metadata of imaging data sourced from the Bronze Lakehouse as well as referential file links to dcm file locations in the Bronze. The Silver is based on the FHIR® specification, and the imaging metadata and file references will be stored in the ImagingStudy delta table, whose schema is based on a flattened format of the ImagingStudy FHIR® resource (R4.3)
- Gold: The curated zone, this final layer stores imaging data sourced from the ImagingStudy delta-table in the Silver Lakehouse. The Gold is based on the OMOP specification, and the imaging metadata and file references will be stored in the Image_Occurence delta-table, whose schema is based on the latest development of medical imaging data standardization for imaging-based observational research1.
Conceptual architecture
Data has gravity, and DICOM® Imaging datasets are usually in the order of petabytes for an average healthcare and life science organization. The DICOM data ingestion capability in Microsoft Fabric offers our customers and partners the flexibility to choose the ingestion pattern that best meets their existing data volume and storage needs. At a high level, there are three patterns for ingesting DICOM® data into the healthcare data solutions in Microsoft Fabric. As such, and depending on the ingestion pattern, there are up to seven end-to-end execution steps to consider from the ingestion of the raw DICOM® files to the transformation of the Gold Lakehouse in the OMOP CDM format, as depicted in the following conceptual architecture diagram:
Conceptual architecture and ingestion patterns of the DICOM data ingestion capability in Microsoft Fabric
You can find more details about each of those three ingestion patterns, and the seven E2E execution steps in our public documentation: Use DICOM data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn
Ingestion patterns
Option 1 (Ingest): This option is based on ingesting, i.e. copying, the DICOM® files, in their native or compressed format, into the Lakehouse – we call this option the Ingest option. Customers will consider this option if they intend to migrate their DICOM® data from on-prem storage to the cloud and Microsoft Fabric.
Option 2 (BYOS): Thanks to the shortcuts capability in Microsoft OneLake, this option is based on in-place access to the DICOM® files from Azure data lake storage (ADLS). Unlike the previous option, the DICOM® files in this option will not be copied, or moved, from their original location – we call this option Bring-You-Own-Storage (BYOS). In many cases, customers have already migrated their DICOM® data to Azure data lake storage, and this option will enable them to unlock the power of those imaging datasets without the need to move or relocate the existing DICOM® files. In other cases, customers may have a dependency on those cloud datasets, which add another layer of complexity in the absence of this ingestion pattern.
Options 3 (AHDS DICOM service): The ingestion is based on leveraging the DICOM service from the Azure Health Data Services (AHDS) – we call this option the AHDS DICOM service option. In this option, the imaging data, i.e. DICOM® files are placed in an ADLS Gen2 location by the AHDS DICOM service. From there, the data flow and ingestion will be identical to the BYOS flow as described in the previous option. Our customers and partners who are already using our AHDS DICOM service can take advantage of this option and immediately unlock the power of our DICOM data ingestion capability.
E2E execution steps
There are seven end-to-end execution steps in our DICOM data ingestion capability. All the seven steps are included in the first ingestion pattern (Ingest). However, in the other two ingestion patterns, i.e. BYOS and AHDS DICOM service, the first two steps will not be required, and you can start with the third steps onwards.
Step 1: Ingestion of DICOM® files in OneLake
The Ingest folder in the Bronze Lakehouse represents a drop, or a queue, folder. You can simply drop the DICOM® files inside the ingest folder in the Bronze Lakehouse. Go to the Ingest\Imaging\DICOM Folder in the Bronze Lakehouse and click on the ellipses => Upload => Upload files
Ingestion of DICOM® files in OneLake
Step 2: Organize the DICOM® files in OneLake
The data movement notebook, as part of the DICOM data ingestion capability, will now transfer all files from the Ingest Folder to a newly optimized directory structure within the Bronze Lakehouse: Files\Process\Imaging\DICOM\yyyy\mm\dd. This reorganization facilitates scalability and is more conducive to datalake storage (refer to data lake best practice for directory structure in Azure). If the files are compressed, the notebook will extract each dcm file and place it into the optimized directory, disregarding the original folder arrangement found within the ZIP. The data movement notebook will also append a Unix timestamp prefix to the filenames, accurate down to the millisecond, to maintain file name uniqueness. This measure is crucial for clients using multiple PACS and VNA systems where file name uniqueness may not be guaranteed.
DICOM® files are re-organized within the Process folder in OneLake
Step 3: Extract DICOM® metadata into the Bronze Lakehouse
In this step, the data extract notebook, as part of the DICOM data ingestion capability, tracks newly moved files in the Process folder and extracts the DICOM® tags (DICOM® Data Elements) available in the dcm files in the process folder and ingests them into the dicomimagingmetastore delta table in the Bronze Lakehouse.
DICOM® metadata extracted in the Bronze Lakehouse
Step 4: Conversion to FHIR® ImagingStudy NDJSON files in OneLake
This step converts the DICOM® metadata to the FHIR® format. The data conversion notebook, as part of the DICOM data ingestion capability, tracks and processes the recently modified delta table in the Bronze Lakehouse (including dicomimagingmetastore). It then converts the DICOM® metadata present from the dicomimagingmetastore delta table in the Bronze Lakehouse to the ImagingStudy FHIR® resource (R4.3) and saves the output in the form of NDJSON files.
Step 5: Ingestion into ImagingStudy delta table into the Bronze Lakehouse
From this step onward, you will be re-using the notebook from the FHIR® data ingestion capability in the healthcare data solution in Microsoft Fabric. The bronze ingestion notebook, as part of the FHIR data ingestion capability, tracks the newly generated files in the configured folder location. The notebook will group the instance-level data of the same study into one DICOM® Study record and insert a new record in the ImagingStudy delta table in the Bronze Lakehouse. Each record represents a Study object in the DICOM® hierarchy.
Step 6: Ingestion of ImagingStudy delta table into the Silver Lakehouse
The silver ingestion notebook, as part of the FHIR data ingestion capability, tracks the newly added records in the ImagingStudy delta table in the Bronze Lakehouse. The notebook will flatten and transform the data from the ImagingStudy delta table in the Brone Lakehouse to the ImagingStudy delta table in the Silver Lakehouse in accordance with the ImagingStudy FHIR® resource (R4.3).
Step 7: Conversion and Ingestion of Imaging_Occurence into the Gold Lakehouse
The OMOP notebook, as part of the FHIR data ingestion capability, leverages the OMOP mappings to transform resources from the Silver Lakehouse into OMOP delta tables in the Gold Lakehouse. This Notebook will convert the data in the FHIR® delta tables in the Silver Lakehouse (including the ImagingStudy delta table) to respective OMOP delta tables in the Gold Lakehouse (including the Image_Occurence) delta table. Each record in the Image_Occurence delta table in the Gold Lakehouse represents a Series object in the DICOM® hierarchy.
In this article, we shared how the DICOM data ingestion capability offers a robust and all-encompassing solution for unifying and analyzing the medical imaging data in a harmonized pattern with the clinical dataset in the healthcare data solutions in Microsoft Fabric. For more details, please review our documentations:
- Overview of DICOM data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn
- Deploy and configure DICOM data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn
- Use DICOM data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn
- DICOM metadata transformation mapping - Microsoft Cloud for Healthcare | Microsoft Learn
- DICOM data ingestion usage considerations - Microsoft Cloud for Healthcare | Microsoft Learn
1 Park, W.Y., Jeon, K., Schmidt, T.S. et al. Development of Medical Imaging Data Standardization for Imaging-Based Observational Research: OMOP Common Data Model Extension. J Digit Imaging. Inform. med. 37, 899–908 (2024). https://doi.org/10.1007/s10278-024-00982-6
DICOM® is the registered trademark of the National Electrical Manufacturers Association (NEMA) for its Standards publications relating to digital communications of medical information.
FHIR® is a registered trademark of Health Level Seven International, registered in the U.S. Trademark Office, and is used with their permission.
Continue reading...