Efficient Data Partitioning with Microsoft Fabric: Best Practices and Implementation Guide

Posted August 15, 20231 yr

Introduction:

Data partitioning is a data management technique used to divide a large dataset into smaller, more manageable subsets called partitions or shards. Each partition contains a portion of the data, and these partitions can be stored and processed independently. The primary goal of data partitioning is to improve performance, scalability, and manageability in large-scale data systems.

Imagine having a vast encyclopedia, and your task is to find a specific page containing a particular word. Instead of searching through the entire encyclopedia, you opt to navigate directly to the relevant section that might contain the desired information. This approach allows you to narrow down your search scope and quickly find the page you need. This concept of narrowing down the search space by dividing data into manageable sections is precisely what partitioning entails.

In this blog, I will present a proposed method for data partitioning using Fabric notebooks.

Part 1: Prepare Data for Partitioning

in this blog, I'm using taxi data provided in this link: taxi_data | Kaggle

downloaded it and I'm using file: yellow_tripdata_2016-03.csv

Since Datasets in fabric does not support on-prem files that is more than 1 GB, and my file is 2 GB, i uploaded the file into my ADLS account and linked my ADLS as a shortcut into my fabric lake house check documentation here:

Options to get data into the Lakehouse - Microsoft Fabric | Microsoft Learn

Part 2: Create a new notebook and load data into it

Navigate through the following steps:

1. Click on your workspace.

2. Select "create," then choose "notebook."

3. Inside your notebook, locate your lakehouse on the left side.

4. Navigate to your desired file within the lakehouse.

5. Load the chosen file into your notebook using the provided options.

Part 3: Transform and partition data.

Now our task involves extracting the year, month, and day from the timestamp. Afterward, we'll proceed to save the data as parquet files in the designated destination.

from pyspark.sql.functions import year,month,days,col,to_timestamp,date_format

from pyspark.sql.functions import to_date

from notebookutils import mssparkutils

from pyspark.context import SparkContext

from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()

spark = SparkSession(sc)

#Load data from Lakehouse into notebook

df = spark.read.format("csv").option("header","true").load("Tables/ADLS-TestContainer/yellow_tripdata_2016-03.csv")

output_folder = 'output_partitioning'

#click on your datalake -> Copy ABFS path -> add your suffix to it.

output_path = 'abfss://31e03aae-13c0-4c9b-a9a7-8941b7b73c98@msit-onelake.dfs.fabric.microsoft.com/8384d9c6-f10c-413e-a3d6-21b90a7f76f2/Tables/ADLS-TestContainer/{output_folder}'

#adding new columns to my dataframe - year,month and day

df_output=df.withColumn("tpep_pickup_datetime",to_timestamp(col("tpep_pickup_datetime"))).withColumn("day", date_format(col("tpep_pickup_datetime"), "dd")).withColumn("year", date_format(col("tpep_pickup_datetime"), "yyyy")).withColumn("month", date_format(col("tpep_pickup_datetime"), "MM"))

df_output.select('day','month','year').show()

#write data into lakehouse using partitioning

df_output.write.mode("overwrite").partitionBy('year','month','day').parquet(output_path)

Part 4: View the output within the lake house.

1. refresh your lake house - navigate to the left in your notebook

2. click on refresh

3. view files in your destination, in my case this was the output: