HPC Lift and Shift Cloud Migration: Architecture and Best Practices

  • Thread starter Thread starter Marco_Netto
  • Start date Start date
M

Marco_Netto

Have you ever wondered what it takes to have an enterprise level HPC environment in the Cloud? What components should be in place and what steps should be taken to move from an on-premises environment to a Cloud environment? And what are the best practices in this process? Everything starts with a Proof-of-Concept (PoC), in which an organization assesses how the key applications will perform in the Cloud, considering not only performance but also the costs involved. Once a decision is made, it is important to understand what it takes to have an enterprise level HPC Cloud environment.



Based on our experience with various clients, partners, and product groups, we have put together a comprehensive documentation on HPC lift and shift Cloud migration and this blog post gives an overview on what we cover in the document. Feedback is always welcome, as we will keep improving the documentation over time.



TL;DR

- We just made available a detailed documentation on HPC lift and shift cloud migration, containing components, steps, examples, and best practices. We also provide references for products, code repositories, and blog posts.

- Documentation can be accessed here: End-to-end high-performance computing (HPC) lift and shift architecture overview





DOCUMENTATION OVERVIEW

Here we provide an overview of the documentation: LINK



On-premises.
We start the document by describing what a typical on-premises HPC environment looks like, which includes compute nodes, job schedulers like SLURM, PBS, or LSF, identity management, storage options, and monitoring tools, all hosted within a private network.


Marco_Netto_0-1727808206694.png






Personas. After discussing the on-premises environment, we talk about the personas. From our experience, we observe a lot of discussion on what changes and what does not change for all people involved when moving from on-premises to the Cloud. We discuss their responsibilities and new tasks in an HPC Cloud setup, considering four personas:

- End-user (engineer / scientist / researcher)

- HPC administrator

- Cloud administrator

- Business manager / owner



HPC Cloud target architecture. The next discussion is an overview of the target HPC Cloud architecture, which highlights that there is not much change compared to an on-premises environment in terms of the conceptual components involved. One of the key differentiators is that resources are allocated on demand, allowing users to access more resources as needed.




Marco_Netto_0-1727808296650.png




Migration guide. After a brief discussion on exploring the Cloud environment through a Proof-of-Concept (PoC), we dive deep into the migration guide itself. We have broken the guide into five steps.



  1. Basic infrastructure. The focus here is on setting up resource groups, networking, and basic storage, which serve as the backbone of a successful HPC lift-and-shift deployment;



  1. Base services. This section covers the core components related to the job scheduler, including the resource orchestrator for provisioning and setting up resources, identity management for user authentication, monitoring (including node health checks), and accounting to better understand the status and usage of resources. Each component plays a crucial role in ensuring the performance, scalability, and security of the HPC environment.



  1. Storage. This section highlights the critical considerations for managing storage in an HPC cloud environment, focusing on the variety of cloud storage options and the processes for migrating data. Also, it offers practical guidance for setting up storage and managing data migration, with an emphasis on scalability and automation as the HPC environment evolves.



  1. Compute nodes. This section provides guidance on selecting and managing compute resources efficiently for HPC workloads in the cloud, including some recommendations and pointers on VM images.



  1. End user entry point. This section explores the options for user interaction, emphasizing the importance of addressing potential latency issues that may arise when moving to the cloud. It also provides guidance on tools, services, and best practices optimizing the user entry point for HPC lift-and-shift deployments. A quick start setup is included to help establish this component efficiently, with the goal of automating it as the cloud infrastructure matures.



WHAT IS NEXT?

We will continue to improve and expand the documentation on this topic as new services, products, and learnings become available. The documentation is not targeted to cover all the possible deployments in the Cloud, but provide guidance based on patterns we observe in how customers use the Cloud to run their HPC workloads. If there is any subject on which more details are required, please send us a note!



LINK TO FULL DOCUMENTATION

End-to-end high-performance computing (HPC) lift and shift architecture overview





#AzureHPCAI

Continue reading...
 
Back
Top