Where Does One Machine End and the Next Begin?

Posted July 13, 2024Jul 13

The process of integrating Azure HPC into Azure has raised many questions about why we do things differently from the general-purpose fleet. It turns out, though, that we don’t work so differently after all, once you change your perspective and start thinking of the whole cluster as a single supercomputer.

[HEADING=1]Why can’t we split this cluster in two?[/HEADING]

Replacing a big cluster with two smaller ones is kind of like replacing a big server with two smaller ones. To our customers, us having a bigger cluster isn’t just more inventory; it’s a faster supercomputer. The bigger we go, the harder it is to place them into datacenters, but also the more customers’ requirements we can meet.

[HEADING=1]What’s wrong with a deployment that spans clusters?[/HEADING]

Imagine the headaches from a VM that’s somehow split across two machines. Even if you can get the software stack to work, performance will inevitably take a nosedive.

[HEADING=1]Why do we need multi-node testing?[/HEADING]

You can stress test general-purpose nodes independently, but when it comes to HPC clusters, anything less than running the whole supercomputer together is incomplete. Although, you can get 90% of the value for 10% of the work if you run at least two nodes together.

[HEADING=1] [/HEADING]

[HEADING=1]Does this belong in Azure Compute or Azure Networking?[/HEADING]

It takes a little bit of both. The backend network in an HPC cluster uses switches and routing like normal networking, but all these connections are technically internal within the supercomputer, which changes how you manage them.

[HEADING=1] [/HEADING]

[HEADING=1]Why do whole clusters go down for maintenance at once?[/HEADING]

An HPC cluster going down to upgrade all the firmware on all the backend switches is the equivalent of a single server going down for an OS upgrade. With how tightly coupled these components are, you don’t expect them to perform optimally unless they’re upgraded together in lockstep.

[HEADING=1] [/HEADING]

[HEADING=1]Why is live migration so hard?[/HEADING]

You live migrate a VM by gradually cloning it onto other hardware and until it can replace the original. Trying to do that to just one node in a supercomputer is the equivalent of asking to restructure a VM in-place while it’s running.

[HEADING=1] [/HEADING]

[HEADING=1]Why don’t IB SKUs support partial VM sizes?[/HEADING]

From our point of view, we already do, and with more flexibility than any general-purpose SKU. If you land on a 256-node cluster, you have up to 256 choices for the size of your partial supercomputer.

While that doesn’t stop us from being even more granular, you’ve got to remember that people renting supercomputers tend to think big, and then weigh that against figuring out problems like noisy neighbors on NICs or tenant isolation on internal fabrics like NVLink.

In fact, a lot of the hot new tech assumes that there’s an appetite for a bigger atomic unit of compute. One of the biggest examples is Nvidia’s NVL72 system, where the NVLink domain has outgrown the server to envelope a whole rack.

That’s going as far as to introduce a new unit size between a server and a cluster, and you bet there’s been plenty of discussion around here about the right way to fit that into the Azure model. But that’s what’s fun about this space. The landscape is constantly changing, and you’ve got to keep updating your mental models to keep up.

Quote

Sign In

Where Does One Machine End and the Next Begin?

Featured Replies

Join the conversation

Account

Navigation

Search