Guest Sally_Dabbah Posted August 15, 2023 Posted August 15, 2023 Introduction: Data partitioning is a data management technique used to divide a large dataset into smaller, more manageable subsets called partitions or shards. Each partition contains a portion of the data, and these partitions can be stored and processed independently. The primary goal of data partitioning is to improve performance, scalability, and manageability in large-scale data systems. Imagine having a vast encyclopedia, and your task is to find a specific page containing a particular word. Instead of searching through the entire encyclopedia, you opt to navigate directly to the relevant section that might contain the desired information. This approach allows you to narrow down your search scope and quickly find the page you need. This concept of narrowing down the search space by dividing data into manageable sections is precisely what partitioning entails. In this blog, I will present a proposed method for data partitioning using Fabric notebooks. Part 1: Prepare Data for Partitioning in this blog, I'm using taxi data provided in this link: taxi_data | Kaggle downloaded it and I'm using file: yellow_tripdata_2016-03.csv Since Datasets in fabric does not support on-prem files that is more than 1 GB, and my file is 2 GB, i uploaded the file into my ADLS account and linked my ADLS as a shortcut into my fabric lake house check documentation here: Options to get data into the Lakehouse - Microsoft Fabric | Microsoft Learn Part 2: Create a new notebook and load data into it Navigate through the following steps: 1. Click on your workspace. 2. Select "create," then choose "notebook." 3. Inside your notebook, locate your lakehouse on the left side. 4. Navigate to your desired file within the lakehouse. 5. Load the chosen file into your notebook using the provided options. Part 3: Transform and partition data. Now our task involves extracting the year, month, and day from the timestamp. Afterward, we'll proceed to save the data as parquet files in the designated destination. from pyspark.sql.functions import year,month,days,col,to_timestamp,date_format from pyspark.sql.functions import to_date from notebookutils import mssparkutils from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) #Load data from Lakehouse into notebook df = spark.read.format("csv").option("header","true").load("Tables/ADLS-TestContainer/yellow_tripdata_2016-03.csv") output_folder = 'output_partitioning' #click on your datalake -> Copy ABFS path -> add your suffix to it. output_path = 'abfss://31e03aae-13c0-4c9b-a9a7-8941b7b73c98@msit-onelake.dfs.fabric.microsoft.com/8384d9c6-f10c-413e-a3d6-21b90a7f76f2/Tables/ADLS-TestContainer/{output_folder}' #adding new columns to my dataframe - year,month and day df_output=df.withColumn("tpep_pickup_datetime",to_timestamp(col("tpep_pickup_datetime"))).withColumn("day", date_format(col("tpep_pickup_datetime"), "dd")).withColumn("year", date_format(col("tpep_pickup_datetime"), "yyyy")).withColumn("month", date_format(col("tpep_pickup_datetime"), "MM")) df_output.select('day','month','year').show() #write data into lakehouse using partitioning df_output.write.mode("overwrite").partitionBy('year','month','day').parquet(output_path) Part 4: View the output within the lake house. 1. refresh your lake house - navigate to the left in your notebook 2. click on refresh 3. view files in your destination, in my case this was the output: Since we linked our data lake into our lake house, if you open your ADLS storage account, you can see your files there like so: Links: Microsoft Fabric documentation - Microsoft Fabric | Microsoft Learn Create a workspace - Microsoft Fabric | Microsoft Learn Develop, execute, and manage notebooks - Microsoft Fabric | Microsoft Learn Call-To-Action: - Make sure to establish all connections before starting to work on Fabric. - check Linkes mentioned above. - Follow me for more tutorials on LinkedIn: Sally Dabbah | LinkedIn Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.