J
Josedobla
Introduction
When it comes to building a robust foundation for ETL (Extract, Transform, Load) pipelines, the trio of Azure Data Factory or Azure Synapse Analytics, Azure Batch, and Azure Storage is indispensable. These tools enable efficient data movement, transformation, and processing across diverse data sources, thereby helping us achieve our strategic goals.
This document provides a comprehensive guide on how to authenticate Azure Batch with SAMI and Azure Storage with Synapse SAMI. This enables user-driven connectivity to storage, facilitating data extraction. Furthermore, it allows the use of custom activities, such as High-Performance Computing (HPC), to process the extracted data.
The key enabler of these functionalities is the Synapse Pipeline. Serving as the primary orchestrator, the Synapse Pipeline is adept at integrating various Azure resources in a secure manner. Its capabilities can be extended to Azure Data Factory (ADF), providing a broader scope of data management and transformation.
Through this guide, you will gain insights into leveraging these powerful Azure services to optimize your data processing workflows.
Services Overview
During this procedure we will use different services, below you have more details about each of them.
Azure Synapse Analytics / Data Factory
- Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.
- Documentation:
Azure Batch
- Azure Batch is a powerful platform service designed for running large-scale parallel and high-performance computing (HPC) applications in the cloud.
- Documentation: Azure Batch runs large parallel jobs in the cloud - Azure Batch | Microsoft Learn
Azure Storage
- Azure Storage provides scalable and secure storage services for various data types, including services like Azure Blob storage, Azure Table storage, and Azure Queue storage.
- Documentation: Introduction to Azure Storage - Cloud storage on Azure | Microsoft Learn
Managed Identities
- Azure Managed Identities are a feature of Azure Active Directory that automatically manages credentials for applications to use when connecting to resources that support Azure AD authentication. They eliminate the need for developers to manage secrets, credentials, certificates, and keys.
- There are two types of managed identities:
- System-assigned: Tied to your application.
- User-assigned: A standalone Azure resource that can be assigned to your app
- Documentation: Managed identities for Azure resources - Managed identities for Azure resources | Microsoft Learn
Scenario
Run an ADF / Synapse Pipeline that pulls a script located in a Storage Account and execute it into the Batch nodes using User Assigned Managed Identities (UAMI) for Authentication to Storage and System Assigned Managed Identity (SAMI) to authenticate with Batch.
Prerequisites
- ADF / Synapse Workspace
- UA Managed Identity
- Storage Account
- Documentation: Create a storage account - Azure Storage | Microsoft Learn
Procedure Overview
During this procedure we will walk through step by step to complete the following actions:
- Create UAMI Credentials
- Create Linked Services for Storage and Batch Accounts
- Add UAMI and SAMI to Storage and Batch Accounts
- Create, Configure and Execute an ADF / Synapse Pipeline
- We will refer to ADF (Portal, Workspace, Pipelines, Jobs, Linked Services) as Synapse during all the exercise and examples to avoid redundancy.
- Debugging
Procedure
Create UAMI Credentials
1. In your Synapse Portal, go to Manage -> Credentials -> New and fill in the details and click Create.
Create Linked Services Connections for Storage and Batch
2. In your Synapse Portal, go to Manage - Linked Services -> New -> Azure Blob Storage -> Continue and complete the form
a. Authentication Type: UAMI
b. Azure Subscription: Choose your one
c. Storage Account name: Choose your one where the script to be used is allocated
d. Credentials: choose the created into the Step #1
e. Click on Create
3. In Azure Portal go to your Batch Account -> Keys and Copy the Batch Account name & Account Endpoint to be used in next step, also copy the Pool Name to be used for this example.
4. In your Synapse Portal, go to Manage -> Linked Services -> New -> Azure Batch -> Continue and fill in the information
a. Authentication Method: SAMI (Copy the Managed Identity Name to be used later)
b. Account Name, Batch URL and Pool Name: Paste on here the values copied from Step#3
c. Storage linked service Name: Choose the one created from Step#2
5. Publish all your changes
Adding UAMI RBAC Roles to Storage Account
6. In the Azure Portal, go to your Storage Account -> Access Control (IAM)
a. Click on Add Option and then on Add role assignment and search for "Storage Blob Data Contributor", then click on Next.
b. Choose Managed Identity and select your UAMI click on Select and then click Next, Next and Review + assign.
Adding SAMI RBAC Roles to Batch Account
7. In the Azure Portal, go to your Batch Account -> Access Control (IAM)
a. Click on Add Option and then on Add role assignment
b. Click on "Privileged administrator roles" tab and then choose the Contributor role and click Next.
c. Choose Managed Identity and under Managed Identity lookup for "Synapse workspace" and then choose the SAMI same as it is added into the step 4a., then click on Select and Next, Next and Review and Assign.
Adding UAMI to Batch Pool
If you need to create a new Batch Pool, you can follow the following procedure:
- Documentation: Configure managed identities in Batch pools - Azure Batch | Microsoft Learn
- Make sure to select the UAMI configured into the Step 1
8. If you already have a Batch Pool created follow the next steps:
a. Into the Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Go to Identity
b. Click on Add then choose the necessary UAMI (on this example it was selected the one used by the Synapse Linked Services for Storage and another one used for other integrations) and click on Add.
Important: In case your Batch Pool use multiples UAMI's (as example to connect with Key Vault or other services), you have first to remove the existing one and then add all of them together.
c. Then, it is required to Scale in and Scale out the Pool to apply the changes.
Setting up the Pipeline
9. In your Synapse Portal, go to Integrate -> Add New Resource -> Pipeline
10. Into the right panel Activities -> Batch Services -> Drag and drop the Custom activities
11. In the Azure Batch tab details for the Custom Activities, click on the Azure Batch linked service and click the one created in Step 4 and test the connection (if you receive a connection error, please go to the Troubleshooting scenario 1)
12. Then go to Settings tab and add your script. Ffor this example, we will use a Powershell script previously uploaded to a Storage Blob Container and send the output to txt file.
a. Command: your script details
b. Resource linked Service: The Storage Service Linked connection configured previously on Step#2
c. Browse Storage: lookup for the Container where your script was uploaded
d. Publish your Changes and perform a Debug
Debugging
12. Check the Synapse Jobs Logs and outputs
a. Copy the Activity Run ID
b. Then, in the Azure Portal Go to your Storage Account -> Containers -> adfjobs -> select the folder with the activityID -> output.
c. On here you will find two files, "stderr.txt" and "stdout.txt" both of them contains information about the errors or the outputs of the commands executed during the task execution
13. Check the Batch Logs and outputs. To get the Batch logs you have different ways:
a. Over Nodes: In Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Nodes -> then into the Folders details go to the folder for this Synapse execution -> job-x -> lookup for the activityID
b. Over Jobs: In Azure Portal go to your Batch Account -> Jobs -> Select a pool with a name of adfv2-yourPoolName -> click on the Task with the ID same as it was the ActivityID of the Synapse Pipeline from step 12a.
What we have learned
During this walkthrough procedure we have learned and implemented about
- Authentication: Utilizing User Assigned Managed Identities (UAMI) and System Assigned Managed Identity (SAMI) for secure connections.
- Linked Services: Creation and configuration of linked services for Azure Storage and Azure Batch accounts.
- Pipeline Execution: Steps to create, configure, and execute an ADF/Synapse Pipeline, emphasizing the use of Synapse as a unified term to avoid redundancy.
- Debugging: Detailed instructions for creating credentials, adding RBAC roles, and setting up pipelines, along with troubleshooting tips.
- Logs Analysis: How to access and analyze Synapse Jobs logs and Azure Batch logs for troubleshooting.
- Error Handling: Understanding the significance of ‘stderr.txt’ and ‘stdout.txt’ files in identifying and resolving errors during task execution.
If you have any questions or feedback, please leave a comment below!
Continue reading...