Deploy Slurm on GKE
Introduction
Slurm (formerly known as Simple Linux Utility for Resource Management) is a powerful open-source workload manager that’s designed for Linux and Unix-like systems. It’s used extensively in high performance computing (HPC) environments, including many of the world’s supercomputers and large computer clusters.
Slurm uses a centralized manager (slurmctld) to monitor resources and work, with an optional backup manager for fault tolerance. Each node runs a slurmd daemon to execute work. An optional slurmdbd (Slurm DataBase Daemon) can record accounting information for multiple Slurm-managed clusters in a single database. More information about the Slurm architecture can be found on the SchedMD documentation website.
Why Slurm on GKE
Because Slurm is a workload orchestrator, it might seem counterintuitive to run it on Kubernetes, and this scenario is not a common practice. However, a team in an enterprise or an AI startup might run Slurm on Kubernetes for the following reasons:
- To avoid the sparse allocation of scarce resources, such as GPUs, between different infrastructures–for instance, splitting GPUs between Kubernetes clusters and Slurm clusters running on VMs.
- To help teams that are familiar with Slurm, and less familiar with Kubernetes, learn how to use Kubernetes more quickly. These teams typically already use Kubernetes to run other types of workloads, such as inference workloads, web apps, or stateful apps.
This solution lets a platform administrator team with previous Kubernetes experience, or an AI/ML team with little-to-no Kubernetes experience, set up the cluster with ready-to-use Terraform modules.
Scope
This guide is intended for platform administrators in an enterprise environment who are already managing Kubernetes or GKE clusters, and who need to set up Slurm clusters for AI/ML teams on Kubernetes. This guide is also for AI/ML startups that already use Kubernetes or GKE to run their workloads, such as inference workloads or web apps, and want to use their existing infrastructure to run training workloads with a Slurm interface. AI/ML teams that are already using Slurm can continue to use a familiar interface while onboarding on to Kubernetes or GKE.
Solution details
This solution, and the Slurm setup presented here, does not support all the Slurm features, and is not intended to be used outside the AI/ML domain. This guide is also not a recommendation document about how or when to use Slurm for HPC workloads.
The following topics are also out of scope for this guide:
- How to integrate the various combinations of the Google Cloud VM types.
- How to dynamically split Kubernetes nodes between two orchestrators.
- How to do any advanced configuration that’s different from the one that’s presented in the introduction.
If you are searching for a guide or an automation tool that can help you set up Slurm on Google Cloud, with a focus on HPC workloads, we recommend that you use the Google Cloud Cluster Toolkit.
Solution architecture
The implemented architecture is composed of multiple StatefulSets and Deployments in order to cover the different Slurm components. The following diagram shows a Slurm control plane and a data plane that are running in GKE.
Although the function of each component won’t be described in detail, it’s important to highlight how each component is configured and how it can be customized. The previous diagram identifies two main components: the Slurm control plane and the data plane.
The Slurm control pane is hosted on nodepool-nogpu-1, and it contains at least three services:
- The slurmctld-0 Pod runs slurmctld, which is piloted by the slurmctld StatefulSet.
- The slurmdbd Pod, which is controlled by a Deployment.
- The login Pod.
The data plane consists of two different slurmd-X instances:
- The first instance, called slurmd-00, is hosted on the G2-standard-4 node pool.
- The second instance, called slurmd-01, is hosted on the N1-standard-8 node pool.
We open sourced the full solution with terraform code to set up Slurm on GKE. You can find the details in this repo. We will not go into all the details in this blog but here is a high level description of the steps you will have to follow to set up the solution:
- Create a GCP project, a vpc, a NAT gateway, configure firewall rules and artifact registry to store the Slurm Docker Image.
- Build a push the Slurm container image to Artifact Registry
- Setup a GKE cluster
- Setup Slurm Control Plane
- Setup Slurm worker nodes (this step can be repeated as needed)
Limitations
Because Slurm is a workload orchestrator, it overlaps with some of the components, scope, and features of Kubernetes. For example, both Slurm and Kubernetes can scale up a new node to accommodate requested resources, or allocate resources within each single node.
Although Kubernetes shouldn’t limit the number of StatefulSets that can be added to the Slurm cluster, this solution addresses DNS names and pointing by injecting the names of the StatefulSets directly into the manifests. This type of injection supports up to five static names, which include the three StatefulSets and two additional system domains.
Because of this limitation, we recommend that you have no more than three nodesets per cluster and to always use nodeset names such as slurmd, slurmd-1, or slurmd-2.
Credits
Credits for this work, the code and the article goes to Daniel Marzini who did a great job on putting together the solution guide.
This is all Open Source work. So contributions and feedback are more than welcome. We would love to see people deploy the solution and provide a statement about it working.
Thank you