Guest Liliam_Leme Posted December 15, 2022 Posted December 15, 2022 I've been reviewing customer questions centered around "Have I tried using MSSparkUtils to solve the problem?" One of the questions asked was how to share results between notebooks. Every time you hit "run" in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario. What is MSSparkUtils? MSSparkUtils is a built-in package to help you easily perform common tasks called Microsoft Spark utilities. It is like a Swiss knife inside of the Synapse Spark environment. Some scenarios where it could be used are for example: Work with file systems Get environment variables Chain notebooks together Get a data frame shared execution between notebooks These scenarios are covered in more detail in the doc: Introduction to Microsoft Spark utilities - Azure Synapse Analytics This blog will focus on the chained execution and sharing results between notebooks. Chain notebooks together You can execute more than one notebook through a root notebook using the method run or exit. Run allows running another notebook referenced on the same session of the main notebook Exit allows a run of another notebook referenced on the same session of the main notebook Azure Synapse will return an exit value, complete the pipeline run, and stop the Spark session. Follow an example in 2 steps: A notebook called Simple_read_ is just reading data from a test.parquet into a data frame called df shown in Fig. 1- simple_read: Fig. 1 - simple_read 2. Now, I created a second notebook called sharing_results which is executing the first notebook - Simple_read_ created in step 1. Fig. 2 - sharing_results will show the results of the chained execution: Fig. 2 - sharing_results How to share a data frame execution between notebooks? That is a quite simple enhancement of the chaining notebook logic. If we transfer the results from the simple_read_ notebook into a view and execute it from the main notebook the execution will be happening on the same session. Therefore, the Sharing_results notebook will be able to see the results from the simple_read notebook. For more information about what it means same session, review the docs: Apache Spark core concepts - Azure Synapse Analytics Understand Synapse Spark basic configuration Code example for the notebook Simple_read_ in Pyspark: %%pyspark df = spark.read.load('abfss://parquet@contianername.dfs.core.windows.net/test.parquet', format='parquet') #display(df.limit(10)) df.createOrReplaceTempView("pysparkdftemptable") The following Fig. 3 - Enhanc. Simple_read_ shows this idea: Fig. 3 - Enhanc. Simple_read_ Code example for the notebook Sharing_results in Pyspark: from pyspark.sql.functions import col, when from pyspark.sql import SparkSession mssparkutils.notebook.run("/Simple_read_", 1000) dfread = spark.sql("select * from pysparkdftemptable") display(dfread.limit(10)) Fig 4- Enhance, shows the results for the notebook that was chained and kept on the same session: Fig 4 - Enhance The following image shows the process of how it works: Fig. 5 - Flux Summary MSSparkUtils is like a Swiss knife inside the Synapse Spark environment. That allows you to achieve more from the Synapse Spark environment and even share the same session between notebooks which could be also used even in other scenarios for example when you want to reuse parameters between notebooks on the same session. That's it for this blog and I hope this can help you and your learning journey with Synapse! Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.