Jump to content

MSSparkUtils is the Swiss Army knife inside Synapse Spark


Recommended Posts

Guest Liliam_Leme
Posted

I've been reviewing customer questions centered around "Have I tried using MSSparkUtils to solve the problem?"

 

 

 

One of the questions asked was how to share results between notebooks. Every time you hit "run" in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario.

 

 

 

What is MSSparkUtils?

 

MSSparkUtils is a built-in package to help you easily perform common tasks called Microsoft Spark utilities. It is like a Swiss knife inside of the Synapse Spark environment.

 

 

 

Some scenarios where it could be used are for example:

 

  • Work with file systems
  • Get environment variables
  • Chain notebooks together
  • Get a data frame shared execution between notebooks

 

These scenarios are covered in more detail in the doc: Introduction to Microsoft Spark utilities - Azure Synapse Analytics

 

 

 

This blog will focus on the chained execution and sharing results between notebooks.

 

 

 

Chain notebooks together

 

 

You can execute more than one notebook through a root notebook using the method run or exit.

 

  • Run allows running another notebook referenced on the same session of the main notebook
  • Exit allows a run of another notebook referenced on the same session of the main notebook

 

Azure Synapse will return an exit value, complete the pipeline run, and stop the Spark session.

 

 

 

Follow an example in 2 steps:

 

  1. A notebook called Simple_read_ is just reading data from a test.parquet into a data frame called df shown in Fig. 1- simple_read:

 

607x211vv2.png.5f6f88a00abf2fcd2e8c97d5277d59ec.png

 

Fig. 1 - simple_read

 

 

 

2. Now, I created a second notebook called
sharing_results
which is executing the first notebook - Simple_read_ created in step 1. Fig. 2 - sharing_results will show the results of the chained execution:

 

666x203vv2.png.9e0d4337b1bdf186e869dc1b7be1c39b.png

 

Fig. 2 - sharing_results

 

 

 

 

How to share a data frame execution between notebooks?

 

That is a quite simple enhancement of the chaining notebook logic. If we transfer the results from the simple_read_ notebook into a view and execute it from the main notebook the execution will be happening on the same session. Therefore, the Sharing_results notebook will be able to see the results from the simple_read notebook.

 

 

 

For more information about what it means same session, review the docs:

 

 

 

 

Code example for the notebook Simple_read_ in Pyspark:

 

 

 

 

 

%%pyspark

 

df = spark.read.load('abfss://parquet@contianername.dfs.core.windows.net/test.parquet', format='parquet')

 

#display(df.limit(10))

 

df.createOrReplaceTempView("pysparkdftemptable")

 

 

 

 

 

 

 

 

 

The following Fig. 3 - Enhanc. Simple_read_ shows this idea:

 

654x201vv2.png.1cc9f7913b240b83232252f8ff0e112c.png

 

Fig. 3 - Enhanc. Simple_read_

 

 

 

 

Code example for the notebook Sharing_results in Pyspark:

 

 

 

 

 

 

from pyspark.sql.functions import col, when

 

from pyspark.sql import SparkSession

 

mssparkutils.notebook.run("/Simple_read_", 1000)

 

dfread = spark.sql("select * from pysparkdftemptable")

 

 

 

display(dfread.limit(10))

 

 

 

 

 

 

 

 

 

Fig 4- Enhance, shows the results for the notebook that was chained and kept on the same session:

 

 

 

739x335vv2.png.8c8780f9f6295e4eca8283fd7409d133.png

 

Fig 4 - Enhance

 

 

 

 

The following image shows the process of how it works:

 

 

 

largevv2px999.png.94782f61113767f0f2de9f86f2529f04.png

 

Fig. 5 - Flux

 

 

 

 

Summary

 

 

MSSparkUtils is like a Swiss knife inside the Synapse Spark environment. That allows you to achieve more from the Synapse Spark environment and even share the same session between notebooks which could be also used even in other scenarios for example when you want to reuse parameters between notebooks on the same session.

 

 

 

That's it for this blog and I hope this can help you and your learning journey with Synapse!

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...