Stay organized with collections Save and categorize content based on your preferences.
This document outlines the deployment steps for provisioning an A3 Mega (a3-megagpu-8g
) Slurm cluster that is ideal for running large-scale artificial intelligence (AI) and machine learning (ML) training workloads.
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Identify the regions and zones where the a3-megagpu-8g
machine type is available, run the following command:
gcloud compute machine-types list --filter="name=a3-megagpu-8g"
Verify that you have enough GPU quotas. Each a3-megagpu-8g
machine has eight H100 80GB GPUs attached, so you'll need at least eight NVIDIA H100 80GB GPUs in your selected region.
gpu_family:NVIDIA_H100_MEGA
.Verify that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.
From the CLI, complete the following steps:
Install dependencies.
To provision Slurm clusters, we recommend that you use Cluster Toolkit version v1.51.1
or later. To install Cluster Toolkit, see Set up Cluster Toolkit.
After you have installed the Cluster Toolkit, check that you are in the Cluster Toolkit directory.
To go to the main Cluster Toolkit working directory, run the following command.
cd cluster-toolkitSet up Cloud Storage bucket
Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
To create this bucket and enable versioning from the CLI, run the following commands:
gcloud storage buckets create gs://BUCKET_NAME \ --project=PROJECT_ID \ --default-storage-class=STANDARD --location=BUCKET_REGION \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following:
BUCKET_NAME
: a name for your Cloud Storage bucket that meets the bucket naming requirements.PROJECT_ID
: your project ID.BUCKET_REGION
: any Google Cloud region of your choice.If you don't have a reservation provided by a Technical Account Manager (TAM), we recommend creating a reservation. For more information, see Choose a reservation type.
Reservations incur ongoing costs even after the Slurm cluster is destroyed. To manage your costs, we recommend the following options:
To create a reservation, run the gcloud compute reservations create
command and specify the --require-specific-reservation
flag.
gcloud compute reservations create RESERVATION_NAME \ --require-specific-reservation \ --project=PROJECT_ID \ --machine-type=a3-megagpu-8g \ --vm-count=NUMBER_OF_VMS \ --zone=ZONE
Replace the following:
RESERVATION_NAME
: a name for your reservation.PROJECT_ID
: your project ID.NUMBER_OF_VMS
: the number of VMs needed for the cluster.ZONE
: a zone that has a3-megagpu-8g
machine types.Using a text editor, open the examples/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yaml
file.
In the deployment file, specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as project ID, region, and zone.
--- terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: a3mega-base project_id: PROJECT_ID region: REGION zone: ZONE network_name_system: NETWORK_NAME subnetwork_name_system: SUBNETWORK_NAME enable_ops_agent: true enable_nvidia_dcgm: true enable_nvidia_persistenced: true disk_size_gb: 200 final_image_family: slurm-a3mega slurm_cluster_name: a3mega a3mega_reservation_name: RESERVATION_NAME a3mega_maintenance_interval: MAINTENANCE_INTERVAL a3mega_cluster_size: NUMBER_OF_VMS
Replace the following:
BUCKET_NAME
: the name of your Cloud Storage bucket, created in the previous section.PROJECT_ID
: your project ID.REGION
: a region that has a3-megagpu-8g
machine types.ZONE
: a zone that has a3-megagpu-8g
machine types.NETWORK_NAME
: a name for your network. For example, a3mega-sys-net
.SUBNETWORK_NAME
: a name for your subnetwork. For example, a3mega-sys-subnet
.RESERVATION_NAME
: the name of your reservation provided by your Google Cloud account team when you requested capacity.MAINTENANCE_INTERVAL
: specify one of the following:
a3mega_maintenance_interval: PERIODIC
a3mega_maintenance_interval: ""
. This action sets the maintenance value to an empty string which is the default value.NUMBER_OF_VMS
: the number of VMs needed for the cluster.If you have multiple reservations, you can update the deployment file to specify the additional reservations. To do this, see Scale A3 Mega clusters across multiple reservations.
Provision a Slurm clusterCluster Toolkit provisions the cluster based on the deployment file you created in the previous step and the default cluster blueprint.
To provision the cluster, run the command for your machine type from the Cluster Toolkit directory. This step takes approximately 30-40 minutes.
./gcluster deploy -d examples/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yaml \ examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml \ --auto-approveConnect to the A3 Mega Slurm cluster
To enable optimized NCCL communication tuning on your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.
ConsoleGo to the Compute Engine > VM instances page.
Locate the login node. It should have a name similar to a3mega-login-001
.
From the Connect column of the login node, click SSH.
To connect to the login node, use the gcloud compute ssh
command.
gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \ --tunnel-through-iap \ --zone ZONERun a NCCL test
After you connect to the login node, you can then Enable GPUDirect-TCPXO optimized NCCL communication.
Redeploy the ClusterIf you need to increase the number of compute nodes or add new partitions to your cluster, you might need to update configurations for your Slurm cluster by redeploying. Redeployment can be sped up by using an existing image from a previous deployment. To avoid creating new images during a redeploy, specify the --only
flag. To redeploy the cluster using an existing image run the following command from the main Cluster Toolkit directory:
./gcluster deploy -d \ examples/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yaml \ examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml \ --only primary,cluster --auto-approve -w
This command is only for redeployments where an image already exists as it only redeploys the cluster and its infrastructure.
Destroy the Slurm clusterBy default the A3 Mega blueprints enable deletion protection on the Filestore instance. For the Filestore instance to be deleted when destroying the Slurm cluster, learn how to set or remove deletion protection on an existing instance to disable deletion protection before running the destroy command.
Disconnect from the cluster if you haven't already.
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:
./gcluster destroy DEPLOYMENT_FOLDER --auto-approve
Replace DEPLOYMENT_FOLDER
with the name of the deployment folder. It's typically the same as DEPLOYMENT_NAME.
When destruction is complete you should see a message similar to the following:
Destroy complete! Resources: xx destroyed.
To learn how to cleanly destroy infrastructure and for advanced manual deployment instructions, see the deployment folder located at the root of the Cluster Toolkit directory: DEPLOYMENT_FOLDER
/instructions.txt
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-07 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[[["This guide details how to deploy an A3 Mega (`a3-megagpu-8g`) Slurm cluster, which is optimized for large-scale AI and ML training workloads, using the Cluster Toolkit."],["Before deployment, ensure you have activated Cloud Shell, identified regions and zones where the `a3-megagpu-8g` machine type is available, and secured enough GPU and Filestore quotas."],["The deployment process involves installing the Cluster Toolkit, setting up a Cloud Storage bucket, creating a reservation, updating base and cluster deployment files, setting up Virtual Private Cloud and Filestore, and building a custom OS image."],["After building the custom image, the final steps include provisioning the Slurm cluster and connecting to the A3 Mega Slurm cluster login node via either the Google Cloud console or the Google Cloud CLI to configure optimized NCCL communication tuning."],["The recommended version of the cluster toolkit to provision Slurm clusters is `v1.39.0` or later."]]],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4