Jump to content

How To Convert Parquet Files into GeoJson Files and Save it in Data Lake using Synapse Notebooks


Recommended Posts

Guest Sally_Dabbah
Posted

Introduction:

 

GeoJSON files contain various types of geospatial data, such as point, line, and polygon features, as well as metadata and attributes. They can be used for a variety of purposes, such as creating interactive maps, analyzing spatial patterns, and visualizing geospatial data.

 

In this blog, we will discuss how to transform Parquet files into GeoJSON files using Synapse Notebook, which is a workaround since this transformation is not currently supported in the Copy activity in Azure Synapse pipelines.

 

 

 

Prerequisites:

 


  1. Basic knowledge in Azure Synapse Analytics.
     
     

  2. Workspace in Azure Synapse Analytics.
     
     

  3. Storage account (in this blog, we are using ADLS) linked to the Synapse workspace.
     
     

  4. Python and PySpark knowledge.
     
     

  5. Mock data (in this example, a Parquet file that was generated from a CSV containing 3 columns: name, latitude, and longitude).
     

 

289x72vv2.png.72aabf71d6ed6978d376cfb17dbec146.png

 

 

 

Step 1: Create a Notebook in Azure Synapse Workspace

 

To create a notebook in Azure Synapse Workspace, click on Synapse Studio, then navigate to the Develop tab, and select Notebooks. From there, you can create a new notebook.

 

 

 

 

 

Step 2:

 


  • Attach a Spark Pool to the Notebook
     
    You can create your own Spark pool or attach the default one.
     
     
    mediumvv2px400.png.78446f5920bf00e1cd1bb466f2e94445.png
  • In the language drop-down list, select PySpark.
    192x142vv2.png.61c45d9f1789b4c754d86fc1f467b30b.png
     
     
     
  • In the notebook, open a code tab to install all the relevant packages that we will use later on:
     
    pip install geojson geopandas
     
    Next, open another code tab. In this tab, we will generate a GeoPandas DataFrame out of the Parquet files.
     
    %%pyspark
    from pyspark.sql import SparkSession
    from notebookutils import mssparkutils
    from geojson import Feature, FeatureCollection, Point , dump
    import pandas as pd
    import geopandas
    import json
     
    blob_account_name = "XXXX"
    blob_container_name = "XXX"
     
    sc = SparkSession.builder.getOrCreate()
    token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
    blob_sas_token = token_library.getConnectionString("AzureBlobStorage")
    output_path = 'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output/test.geojson'
     
    spark.conf.set(
    'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
    blob_sas_token)
    df = spark.read.load('wasbs://{blob_container_name}}@{blob_account_name}.blob.core.windows.net/staging/testVlaamse.parquet', format='parquet')
    pdf = df.toPandas()
     
     
    #Converting Pandas DF into geoPandasDF
    features = pdf.apply(
    lambda row: Feature(geometry=Point((float(row['lng']), float(row['lat'])))),
    axis=1).tolist()
     
    # all the other columns used as properties
    properties = pdf.drop(['lat', 'lng'], axis=1).to_dict('records')
     
    # whole geojson object
    feature_collection = FeatureCollection(features=features, properties=properties)
     
    gdf = geopandas.GeoDataFrame.from_features(feature_collection['features'])
     
    print(gdf) #checking geopandas dataframe structure.
    the output should look like this:
    248x59vv2.png.08e2ec222ad96119bf8188acfab3d4e4.png
     
    Open another code tab and let's use the Spark utils library provided by Microsoft to write the GeoPandas DataFrame as a GeoJSON file and save it in Azure Data Lake Gen 2.
    Unfortunately, copying the GeoPandas DataFrame directly from Synapse Notebook to Azure Data Lake Gen 2 is not yet supported. Therefore, we will use a workaround by writing the GeoPandas DataFrame into a local temporary file and then copying the file into Azure Data Lake Gen 2.
     
    Here's the code for copying the file into Azure Data Lake Gen 2:
     
     
    from notebookutils import mssparkutils
    tmp_file = 'file:/tmp/temporary/test.geojson'
     
    mssparkutils.fs.put(tmp_file, gdf.to_string(), True) # Set the last parameter as True to overwrite the file if it existed already
    mssparkutils.fs.cp('file:/tmp/temporary/test.geojson','wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output')

 

Links:

 

Apache Spark pool concepts - Azure Synapse Analytics | Microsoft Learn

 

Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn

 

 

 

Call-to-Action:

If you have any questions, comments, or feedback about this topic, please feel free to share them in the comments section below. Don't forget to subscribe to our blog for more Microsoft-related content and updates.

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...