Jump to content

Authenticating to an Azure CycleCloud Slurm cluster with Azure Active Directory


Recommended Posts

Guest trcooper
Posted

Overview:

 

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.

 

 

 

As enterprises increasingly move to using Azure Active Directory for their authentication needs this blog explores how Azure AD and OpenSSH certificate-based authentication may be used to provide authentication to a Slurm cluster. We also utilise the recent Azure Bastion native client support feature to provide remote access to the Login Node over the public internet.

 

 

 

Summarising, we will use the native Azure AD Linux authentication to access the Login Node through the Azure Bastion host, using a temporal, provisioned ssh key. Once logged into the Login Node, the CycleCloud provisioned user account and ssh keys will guarantee authentications to the scheduler and compute nodes. AAD authentication can improve the security to access our environment by enabling the possibility to use conditional access enabling for example multi-factor authentication before being able to use SSH.

 

 

 

largevv2px999.png.06b1e3f602aad4cdeb9ed3c9b8bb0154.png

 

 

 

Components:

 

This solution uses an existing Azure AD Tenant and very standard deployments of CycleCloud 8.2, Azure Files NFS (to provide a persistent /shared folder), a Login Node (more details later) and Azure Bastion (Standard SKU). I prefer to deploy these using Bicep. The OS used for all VMs is the AlmaLinux 8.5 HPC image.

 

 

 

Solution:

 

 

 

1) Azure Bastion, this is a typical deployment of Azure Bastion with the only additional considerations being to ensure it is the Standard SKU and that enableTunneling is set to true.

 

 

 

 

 

 

 

 

 

 

 

resource azureBastion 'Microsoft.Network/bastionHosts@2022-01-01' = {

name: bastionName

location: location

properties: {

enableTunneling: true

ipConfigurations: [

{

name: 'IpConf'

properties: {

subnet: {

id: '${vnetId}/subnets/AzureBastionSubnet'

}

publicIPAddress: {

id: pip.id

}

}

}

]

}

sku: {

name: 'Standard'

}

}

 

 

 

 

 

 

 

 

 

 

 

 

 

2) Login Node, again a standard virtual machine deployment. To interact with the Slurm cluster it should have Slurm and Munge installed with configurations matching your Slurm cluster. The /shared folder is also mounted to provide access to the shared home folders. Typically to enable AAD auth for Linux we would ensure the VM has a System Assigned Managed Identity and add the AADSSHLoginForLinux extension. As the extension does not currently support AlmaLinux it has been installed using Cloud Init referencing the RHEL 8 RPMs. Additionally note how the default home directory for new users has been changed to /shared/home and use of NFS for home directories enabled.

 

 

 

 

 

 

 

 

 

 

 

#cloud-config

 

#[bUG] WALinuxAgent service should start after the cloud-config service · Issue #1938 · Azure/WALinuxAgent

bootcmd:

- mkdir -p /etc/systemd/system/walinuxagent.service.d

- echo "[unit]\nAfter=cloud-final.service" > /etc/systemd/system/walinuxagent.service.d/override.conf

- sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service

- systemctl daemon-reload

 

yum_repos:

packages-microsoft-com-prod:

baseurl: Index of /rhel/8/prod/

enabled: true

gpgcheck: true

gpgkey: https://packages.microsoft.com/keys/microsoft.asc

name: packages-microsoft-com-prod

 

packages:

- munge

- nfs-utils

- aadsshlogin-selinux.x86_64

- aadsshlogin.x86_64

 

mounts:

- ["nfsshares2960c680b0a1578.file.core.windows.net:/nfsshares2960c680b0a1578/shared", /shared, nfs, "vers=4,minorversion=1,sec=sys"]

 

runcmd:

- mkdir -p /shared

- mount -t nfs nfsshares2960c680b0a1578.file.core.windows.net:/nfsshares2960c680b0a1578/shared /shared -o vers=4,minorversion=1,sec=sys

- cp /shared/apps/slurm/munge.key /etc/munge

- chown -R munge.munge /etc/munge/ /var/log/munge/

- chmod 0700 /etc/munge/ /var/log/munge/

- systemctl enable munge

- systemctl stop munge

- systemctl start munge

- wget https://github.com/Azure/cyclecloud-slurm/releases/download/2.4.1/slurm-20.11.0-0rc2.el8.x86_64.rpm

- wget https://github.com/Azure/cyclecloud-slurm/releases/download/2.4.1/slurm-perlapi-20.11.0-0rc2.el8.x86_64.rpm

- dnf localinstall ./slurm-20.11.0-0rc2.el8.x86_64.rpm -y

- dnf localinstall ./slurm-perlapi-20.11.0-0rc2.el8.x86_64 -y

- groupadd slurm --gid 11100

- useradd -m -d /home/slurm --gid 11100 --uid 11100 slurm

- mkdir /etc/slurm

- cp /shared/apps/slurm/slurm.conf /etc/slurm

- cp /shared/apps/slurm/cyclecloud.conf /etc/slurm

- chown -R slurm.slurm /etc/slurm

- sed -i --follow-symlinks "s/HOME=.*/HOME=\/shared\/home/g" /etc/default/useradd

- setsebool -P use_nfs_home_dirs on

 

 

 

 

 

 

 

 

 

 

 

To be able to access the VM, the last thing to do is to assign an RBAC role to allow the user to login. This could be the Vitual Machine User Login role, for normal users, or the Virtual Machine Administrator Login, for system administrators. Here an example to assign the role to a standard user:

 

 

 

 

 

 

 

 

 

 

 

username=$(az account show --query user.name --output tsv)

rg=$(az group show --resource-group myResourceGroup --query id -o tsv)

vm=$(az vm show -g $rg --name loginnode --query id -o tsv)

 

az role assignment create \

--role "Virtual Machine User Login" \

--assignee $username \

--scope $vm

 

 

 

 

 

 

 

 

 

 

 

More details about the necessary steps to enable the SSH login on the VM can be found here.

 

 

 

With this in place we now configure Azure role assignments authorizing our user to log in to the VM and use the Azure CLI to connect to the Login Node via Azure Bastion.

 

 

 

largevv2px999.png.210338c1ccadfbc1add37c969b034deb.png

 

Note the user’s home directory name, uid/gid and create an ssh pubkey pair for use with the cluster. This will be used to create a matching ‘local’ user in the CycleCloud user management system.

 

 

 

3) CycleCloud, with the user access to the Login Node now established we must grant the user access to the Compute Nodes. For this the built-in CycleCloud user management system is used and we will create a matching ‘local’ user to our AAD principal.

 

 

 

largevv2px999.png.47012cb9b78bd7dc81bfc24be5bc2c9f.png

 

The default CycleCloud Slurm template is used to create the cluster with the default NFS share mounted from the Azure Files NFS share.

 

 

 

largevv2px999.png.f9d66cd169170f70b0bd9380c354d6e7.png

 

 

 

Conclusion:

 

With the cluster started and ‘local’ user assigned we can update the Login Node to ensure it has the correct munge key and the slum.conf is pointing to the scheduler. From the Login Node our AAD Authenticated user can now submit and run jobs on the cluster.

 

 

 

largevv2px999.png.b1192940647190bb781852a0598a9ac9.png

 

 

 

Reference:

 

Azure AD and OpenSSH

 

Azure CycleCloud

 

Azure HPC

 

Azure Bastion - Native Client

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...