Jump to content

Efficient Data Partitioning with Microsoft Fabric: Best Practices and Implementation Guide


Recommended Posts

Guest Sally_Dabbah
Posted

Introduction:

Data partitioning is a data management technique used to divide a large dataset into smaller, more manageable subsets called partitions or shards. Each partition contains a portion of the data, and these partitions can be stored and processed independently. The primary goal of data partitioning is to improve performance, scalability, and manageability in large-scale data systems.

Imagine having a vast encyclopedia, and your task is to find a specific page containing a particular word. Instead of searching through the entire encyclopedia, you opt to navigate directly to the relevant section that might contain the desired information. This approach allows you to narrow down your search scope and quickly find the page you need. This concept of narrowing down the search space by dividing data into manageable sections is precisely what partitioning entails.

 

553x287vv2.png.6b68a433161c2224b1818c3eaf5aece8.png

 

 

 

In this blog, I will present a proposed method for data partitioning using Fabric notebooks.

 

 

481x211vv2.png.9f73b4d6c3f180a369310c703d64fab6.png

 

 

 

 

Part 1: Prepare Data for Partitioning

in this blog, I'm using taxi data provided in this link: taxi_data | Kaggle

downloaded it and I'm using file: yellow_tripdata_2016-03.csv

Since Datasets in fabric does not support on-prem files that is more than 1 GB, and my file is 2 GB, i uploaded the file into my ADLS account and linked my ADLS as a shortcut into my fabric lake house check documentation here:

Options to get data into the Lakehouse - Microsoft Fabric | Microsoft Learn

 

Part 2: Create a new notebook and load data into it

 

Navigate through the following steps:

 

1. Click on your workspace.

2. Select "create," then choose "notebook."

3. Inside your notebook, locate your lakehouse on the left side.

4. Navigate to your desired file within the lakehouse.

5. Load the chosen file into your notebook using the provided options.

 

 

 

582x240vv2.png.1093d1e2525a6a0c7ee2293ee96fa83b.png

 

 

 

Part 3: Transform and partition data.

Now our task involves extracting the year, month, and day from the timestamp. Afterward, we'll proceed to save the data as parquet files in the designated destination.

 

 

 

 

 

 

 

from pyspark.sql.functions import year,month,days,col,to_timestamp,date_format

from pyspark.sql.functions import to_date

from notebookutils import mssparkutils

from pyspark.context import SparkContext

from pyspark.sql import SparkSession

 

 

sc = SparkContext.getOrCreate()

spark = SparkSession(sc)

#Load data from Lakehouse into notebook

df = spark.read.format("csv").option("header","true").load("Tables/ADLS-TestContainer/yellow_tripdata_2016-03.csv")

 

output_folder = 'output_partitioning'

#click on your datalake -> Copy ABFS path -> add your suffix to it.

output_path = 'abfss://31e03aae-13c0-4c9b-a9a7-8941b7b73c98@msit-onelake.dfs.fabric.microsoft.com/8384d9c6-f10c-413e-a3d6-21b90a7f76f2/Tables/ADLS-TestContainer/{output_folder}'

 

#adding new columns to my dataframe - year,month and day

df_output=df.withColumn("tpep_pickup_datetime",to_timestamp(col("tpep_pickup_datetime"))).withColumn("day", date_format(col("tpep_pickup_datetime"), "dd")).withColumn("year", date_format(col("tpep_pickup_datetime"), "yyyy")).withColumn("month", date_format(col("tpep_pickup_datetime"), "MM"))

df_output.select('day','month','year').show()

 

#write data into lakehouse using partitioning

df_output.write.mode("overwrite").partitionBy('year','month','day').parquet(output_path)

 

 

 

 

 

 

Part 4: View the output within the lake house.

1. refresh your lake house - navigate to the left in your notebook

2. click on refresh

3. view files in your destination, in my case this was the output:

 

581x374vv2.png.3d9ca7e43bab26be16e6cc669ba8b14e.png

 

 

 

Since we linked our data lake into our lake house, if you open your ADLS storage account, you can see your files there like so:

 

 

 

largevv2px999.png.cfb00871a2b3c4b516d09470c102ba0f.png

 

Links:

Microsoft Fabric documentation - Microsoft Fabric | Microsoft Learn

Create a workspace - Microsoft Fabric | Microsoft Learn

Develop, execute, and manage notebooks - Microsoft Fabric | Microsoft Learn

 

 

Call-To-Action:

 

- Make sure to establish all connections before starting to work on Fabric.

- check Linkes mentioned above.

 

- Follow me for more tutorials on LinkedIn: Sally Dabbah | LinkedIn

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...