Optimizing ETL Workflows: A Guide to Azure Integration and Authentication with Batch and Storage

  • Thread starter Thread starter Josedobla
  • Start date Start date
J

Josedobla

Introduction


When it comes to building a robust foundation for ETL (Extract, Transform, Load) pipelines, the trio of Azure Data Factory or Azure Synapse Analytics, Azure Batch, and Azure Storage is indispensable. These tools enable efficient data movement, transformation, and processing across diverse data sources, thereby helping us achieve our strategic goals.



This document provides a comprehensive guide on how to authenticate Azure Batch with SAMI and Azure Storage with Synapse SAMI. This enables user-driven connectivity to storage, facilitating data extraction. Furthermore, it allows the use of custom activities, such as High-Performance Computing (HPC), to process the extracted data.



The key enabler of these functionalities is the Synapse Pipeline. Serving as the primary orchestrator, the Synapse Pipeline is adept at integrating various Azure resources in a secure manner. Its capabilities can be extended to Azure Data Factory (ADF), providing a broader scope of data management and transformation.



Through this guide, you will gain insights into leveraging these powerful Azure services to optimize your data processing workflows.



Services Overview


During this procedure we will use different services, below you have more details about each of them.



Azure Synapse Analytics / Data Factory​

Azure Batch​

Azure Storage

Managed Identities​

  • Azure Managed Identities are a feature of Azure Active Directory that automatically manages credentials for applications to use when connecting to resources that support Azure AD authentication. They eliminate the need for developers to manage secrets, credentials, certificates, and keys.
  • There are two types of managed identities:
    • System-assigned: Tied to your application.
    • User-assigned: A standalone Azure resource that can be assigned to your app
  • Documentation: Managed identities for Azure resources - Managed identities for Azure resources | Microsoft Learn

Scenario


Run an ADF / Synapse Pipeline that pulls a script located in a Storage Account and execute it into the Batch nodes using User Assigned Managed Identities (UAMI) for Authentication to Storage and System Assigned Managed Identity (SAMI) to authenticate with Batch.



Prerequisites

Procedure Overview


During this procedure we will walk through step by step to complete the following actions:



  • Create UAMI Credentials
  • Create Linked Services for Storage and Batch Accounts
  • Add UAMI and SAMI to Storage and Batch Accounts
  • Create, Configure and Execute an ADF / Synapse Pipeline
    • We will refer to ADF (Portal, Workspace, Pipelines, Jobs, Linked Services) as Synapse during all the exercise and examples to avoid redundancy.
  • Debugging



Procedure

Create UAMI Credentials


1. In your Synapse Portal, go to Manage -> Credentials -> New and fill in the details and click Create.



large?v=v2&px=999.png



Create Linked Services Connections for Storage and Batch


2. In your Synapse Portal, go to Manage - Linked Services -> New -> Azure Blob Storage -> Continue and complete the form

a. Authentication Type: UAMI​

b. Azure Subscription: Choose your one​

c. Storage Account name: Choose your one where the script to be used is allocated​

d. Credentials: choose the created into the Step #1​

e. Click on Create​



large?v=v2&px=999.png



3. In Azure Portal go to your Batch Account -> Keys and Copy the Batch Account name & Account Endpoint to be used in next step, also copy the Pool Name to be used for this example.



large?v=v2&px=999.png



4. In your Synapse Portal, go to Manage -> Linked Services -> New -> Azure Batch -> Continue and fill in the information

a. Authentication Method: SAMI (Copy the Managed Identity Name to be used later)

b. Account Name, Batch URL and Pool Name: Paste on here the values copied from Step#3

c. Storage linked service Name: Choose the one created from Step#2



large?v=v2&px=999.png


5. Publish all your changes



large?v=v2&px=999.png



Adding UAMI RBAC Roles to Storage Account


6. In the Azure Portal, go to your Storage Account -> Access Control (IAM)

a. Click on Add Option and then on Add role assignment and search for "Storage Blob Data Contributor", then click on Next.



large?v=v2&px=999.png



large?v=v2&px=999.png



b. Choose Managed Identity and select your UAMI click on Select and then click Next, Next and Review + assign.



large?v=v2&px=999.png





large?v=v2&px=999.png



Adding SAMI RBAC Roles to Batch Account


7. In the Azure Portal, go to your Batch Account -> Access Control (IAM)

a. Click on Add Option and then on Add role assignment



large?v=v2&px=999.png



b. Click on "Privileged administrator roles" tab and then choose the Contributor role and click Next.



large?v=v2&px=999.png



c. Choose Managed Identity and under Managed Identity lookup for "Synapse workspace" and then choose the SAMI same as it is added into the step 4a., then click on Select and Next, Next and Review and Assign.


large?v=v2&px=999.png





large?v=v2&px=999.png



Adding UAMI to Batch Pool


If you need to create a new Batch Pool, you can follow the following procedure:


8. If you already have a Batch Pool created follow the next steps:

a. Into the Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Go to Identity



large?v=v2&px=999.png



b. Click on Add then choose the necessary UAMI (on this example it was selected the one used by the Synapse Linked Services for Storage and another one used for other integrations) and click on Add.

Important: In case your Batch Pool use multiples UAMI's (as example to connect with Key Vault or other services), you have first to remove the existing one and then add all of them together.





large?v=v2&px=999.png



large?v=v2&px=999.png



c. Then, it is required to Scale in and Scale out the Pool to apply the changes.​



large?v=v2&px=999.png



Setting up the Pipeline


9. In your Synapse Portal, go to Integrate -> Add New Resource -> Pipeline



large?v=v2&px=999.png



10. Into the right panel Activities -> Batch Services -> Drag and drop the Custom activities



large?v=v2&px=999.png



11. In the Azure Batch tab details for the Custom Activities, click on the Azure Batch linked service and click the one created in Step 4 and test the connection (if you receive a connection error, please go to the Troubleshooting scenario 1)



large?v=v2&px=999.png



large?v=v2&px=999.png



12. Then go to Settings tab and add your script. Ffor this example, we will use a Powershell script previously uploaded to a Storage Blob Container and send the output to txt file.


a. Command: your script details​

b. Resource linked Service: The Storage Service Linked connection configured previously on Step#2​

c. Browse Storage: lookup for the Container where your script was uploaded​



large?v=v2&px=999.png



d. Publish your Changes and perform a Debug



large?v=v2&px=999.png



large?v=v2&px=999.png



large?v=v2&px=999.png



Debugging


12. Check the Synapse Jobs Logs and outputs

a. Copy the Activity Run ID


large?v=v2&px=999.png





b. Then, in the Azure Portal Go to your Storage Account -> Containers -> adfjobs -> select the folder with the activityID -> output.

c. On here you will find two files, "stderr.txt" and "stdout.txt" both of them contains information about the errors or the outputs of the commands executed during the task execution



large?v=v2&px=999.png





13. Check the Batch Logs and outputs. To get the Batch logs you have different ways:


a. Over Nodes: In Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Nodes -> then into the Folders details go to the folder for this Synapse execution -> job-x -> lookup for the activityID


large?v=v2&px=999.png




b. Over Jobs: In Azure Portal go to your Batch Account -> Jobs -> Select a pool with a name of adfv2-yourPoolName -> click on the Task with the ID same as it was the ActivityID of the Synapse Pipeline from step 12a.



large?v=v2&px=999.png



What we have learned




During this walkthrough procedure we have learned and implemented about



  • Authentication: Utilizing User Assigned Managed Identities (UAMI) and System Assigned Managed Identity (SAMI) for secure connections.
  • Linked Services: Creation and configuration of linked services for Azure Storage and Azure Batch accounts.
  • Pipeline Execution: Steps to create, configure, and execute an ADF/Synapse Pipeline, emphasizing the use of Synapse as a unified term to avoid redundancy.
  • Debugging: Detailed instructions for creating credentials, adding RBAC roles, and setting up pipelines, along with troubleshooting tips.
  • Logs Analysis: How to access and analyze Synapse Jobs logs and Azure Batch logs for troubleshooting.
  • Error Handling: Understanding the significance of ‘stderr.txt’ and ‘stdout.txt’ files in identifying and resolving errors during task execution.

If you have any questions or feedback, please leave a comment below!

Continue reading...
 
Back
Top