Stay organized with collections Save and categorize content based on your preferences.
This page shows you how to create an AI-optimized Google Kubernetes Engine (GKE) cluster that uses Cluster Director for GKE to support your AI and ML workloads with A4 or A3 Ultra virtual machines (VMs).
Cluster Director for GKE lets you deploy and manage large AI-optimized clusters of accelerated VMs with features such as targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. For more information, see Cluster Director.
GKE provides a single platform surface to run a diverse set of workloads for your organizations, reducing the operational burden of managing multiple platforms. You can run workloads such as high-performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services.
On this page, you learn how to create a GKE cluster with the Google Cloud CLI for maximum flexibility in configuring your cluster based on the needs of your workload. Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For instructions on how to do this, see Create an AI-optimized GKE cluster with default configuration.
Cluster configuration options with GPUDirect RDMATo create your cluster with the Google Cloud CLI, you can choose one of the following cluster configuration options:
Before you start, make sure that you have performed the following tasks:
gcloud components update
. Note: For existing gcloud CLI installations, make sure to set the compute/region
property. If you use primarily zonal clusters, set the compute/zone
instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location
. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.Choose a consumption option. Make your choice based on how you want to get and use GPU resources. For more information, see Choose a consumption option.
For GKE, consider the following additional information when you choose a consumption option:
Obtain capacity. Learn how to obtain capacity for your consumption option.
The following requirements apply to an AI-optimized GKE cluster:
Ensure you use the minimum GPU driver version, depending on the machine type:
latest
driver version. For A3 Ultra VMs, you must set the value of the gpu-driver-version=latest
field with GKE version 1.31. For GKE version 1.31.5-gke.1169000 or later, GKE automatically installs 550 GPU driver versions on A3 Ultra nodes by default, including when you omit the gpu-driver-version
flag.To use GPUDirect RDMA, the following additional requirements apply:
2
(that is, 16 GPUs if you use the a4-highgpu-8g
or a3-ultragpu-8g
machine types).Ensure that you use a location which has availability for the machine type that you choose. For more information, see GPU availability by regions and zones.
Follow the instructions in this section to create a GKE cluster that meets the requirements for AI-optimized GKE clusters. You can choose between creating a cluster with or without GPUDirect RDMA.
Considerations for creating a clusterWhen you create a cluster, consider the following information:
--region
flag with the --zone=COMPUTE_ZONE
flag, where COMPUTE_ZONE
is the zone of your control plane.--node-locations
flag to specify the zones for your GKE nodes.default
: install the default driver version for your GKE node version. For more information about the requirements for default driver versions, see the Requirements section.latest
: install the latest available driver version for your GKE version. This option is available only for nodes that use Container-Optimized OS.disabled
: skip automatic driver installation. You must manually install a driver after you create the node pool.Choose a reservation affinity:
--reservation-affinity
flag can take the values of specific
or any
.When you use a specific reservation, including shared reservations, specify the value of the --reservation
flag in the following format:
projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME
Replace the following:
PROJECT_ID
: your Google Cloud project ID.RESERVATION_NAME
: the name of your reservation.BLOCK_NAME
: the name of a specific block within the reservation.projects/PROJECT_ID/reservations/
from the reservation value.To create a cluster without GPUDirect RDMA, use the following instructions to create a cluster with a CPU-based default node pool and additional node pools with GPUs. This approach allows the default node pool to run other services.
Create the cluster:
gcloud container clusters create CLUSTER_NAME \
--cluster-version=CLUSTER_VERSION \
--region=COMPUTE_REGION
Replace the following:
CLUSTER_NAME
: the name of your new cluster.CLUSTER_VERSION
: the version of your new cluster. For more information about which version of GKE supports your configuration, see the Requirements section.COMPUTE_REGION
: the region of your new cluster. If you plan to create a zonal cluster, use the --zone
flag instead of the --region
flag, for example: --zone=COMPUTE_ZONE
. Replace COMPUTE_ZONE
with the zone of the control plane.Create the GPU-based node pool with one of the following commands. The command that you need to run depends on the consumption option that you use for your deployment. Select the tab that corresponds to your consumption option's provisioning model.
Reservation-boundFor reservation-bound provisioning, run the following command:
gcloud container node-pools create NODE_POOL_NAME \
--region COMPUTE_REGION --cluster CLUSTER_NAME \
--node-locations COMPUTE_ZONE \
--accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type MACHINE_TYPE \
--num-nodes=NUM_NODES \
--reservation-affinity=specific \
--reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME
Replace the following:
NODE_POOL_NAME
: the name of the node pool.COMPUTE_REGION
: the region of your new cluster.CLUSTER_NAME
: the name of your new cluster.COMPUTE_ZONE
: the zone of your node pool.GPU_TYPE
: the type of GPU accelerator.
nvidia-b200
.nvidia-h200-141gb
.AMOUNT
: the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g
and a3-ultragpu-8g
VMs, the amount of GPUs is 8
.DRIVER_VERSION
: the NVIDIA driver version to install. It can be one of the following values: default
, latest
, or disabled
.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.NUM_NODES
: the number of nodes for the node pool.RESERVATION_NAME
: the name of your reservation. To find this value, you can query your reservation.BLOCK_NAME
: the name of a specific block within the reservation. To find this value, you can query your reservation.Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions.
For flex-start provisioning, run the following command:
gcloud container node-pools create NODE_POOL_NAME \
--region COMPUTE_REGION --cluster CLUSTER_NAME \
--node-locations COMPUTE_ZONE \
--accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type MACHINE_TYPE \
--flex-start --enable-autoscaling --num-nodes=0 \
--total-max-nodes TOTAL_MAX_NODES \
--no-enable-autorepair --location-policy=ANY \
--reservation-affinity=none [\
--enable-queued-provisioning]
Replace the following:
NODE_POOL_NAME
: the name of the node pool.COMPUTE_REGION
: the region of your new cluster.CLUSTER_NAME
: the name of your new cluster.COMPUTE_ZONE
: the zone of your node pool.GPU_TYPE
: the type of GPU accelerator. * A4 VMs: enter nvidia-b200
. * A3 Ultra VMs: enter nvidia-h200-141gb
.AMOUNT
: the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g
and a3-ultragpu-8g
VMs, the amount of GPUs is 8
.DRIVER_VERSION
: the NVIDIA driver version to install. It can be one of the following values: default
, latest
, or disabled
.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.TOTAL_MAX_NODES
: the maximum number of nodes to automatically scale for the entire node pool.
If you want to use flex-start with queued provisioning, include the --enable-queued-provisioning
flag.
For more information about using flex-start, see Run large-scale workload with flex-start with queued provisioning.
For spot provisioning, run the following command:
gcloud container node-pools create NODE_POOL_NAME \
--region COMPUTE_REGION --cluster CLUSTER_NAME \
--node-locations COMPUTE_ZONE \
--accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type MACHINE_TYPE \
--num-nodes=NUM_NODES \
--spot
Replace the following:
NODE_POOL_NAME
: the name of the node pool.COMPUTE_REGION
: the region of your new cluster.CLUSTER_NAME
: the name of your new cluster.COMPUTE_ZONE
: the zone of your node pool.GPU_TYPE
: the type of GPU accelerator.
nvidia-b200
.nvidia-h200-141gb
.AMOUNT
: the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g
and a3-ultragpu-8g
VMs, the amount of GPUs is 8
.DRIVER_VERSION
: the NVIDIA driver version to install. It can be one of the following values: default
, latest
, or disabled
.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.NUM_NODES
: the number of nodes for the node pool.
For more information about creating clusters with Spot VMs, see Run fault-tolerant workloads at lower costs with Spot VMs.
Connect to your cluster, so that you can run the kubectl
commands in the next sections:
gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
Replace the following:
CLUSTER_NAME
: the name of your cluster.COMPUTE_REGION
: the name of the compute region.
For more information, see Install kubectl and configure cluster access.
For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. The A4 VMs and A3 Ultra VMs come with the Titanium ML network adapter, which is built on NVIDIA ConnectX-7 (CX7) NICs. Both A4 VMs and A3 Ultra VMs deliver non-blocking 3.2 Tbps of inter-node GPU-to-GPU traffic by using RDMA over Converged Ethernet (RoCE). RoCE enables scaling and collaboration across multiple GPUs by delivering a high-performance cloud experience for AI workloads.
For more information about how to create your GKE clusters by using the Google Cloud CLI and GPUDirect TCPX (A3 High VMs) or TCPXO (A3 Mega VMs), see maximize GPU network bandwidth in Autopilot mode clusters, or maximize GPU network bandwidth in Standard mode clusters.
To create your GKE clusters in Autopilot or Standard mode with GPUDirect RDMA, complete the following steps, which are described in the next sections:
Both A4 VMs and A3 Ultra VMs have the following configuration:
AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. For workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the network hops that are required to transfer payloads to and from GPUs, which more efficiently uses the network bandwidth that's available. GPUDirect RDMA is designed to significantly improve throughput at scale compared to GPUs that don't use GPUDirect.
One of the Google Titanium NICs that's associated with the CPU uses the default network in GKE. You don't need to create a new VPC for this NIC if you have enough IP address ranges for the default network.
You can create one VPC for the second CPU Titanium NIC (gVNIC) and another VPC for the eight CX-7 RDMA NICs by using these commands.
Set environment variables to match your deployment:
export REGION="COMPUTE_REGION"
export ZONE="COMPUTE_ZONE"
export PROJECT="PROJECT_ID"
export GVNIC_NETWORK_PREFIX="GVNIC_NETWORK_PREFIX"
export RDMA_NETWORK_PREFIX="RDMA_NETWORK_PREFIX"
Replace the following variables:
COMPUTE_REGION
: the region of your cluster.COMPUTE_ZONE
: the zone of your node pool.PROJECT_ID
: your Google Cloud project ID.GVNIC_NETWORK_PREFIX
: either a4high-gvnic
for A4 VMs, or a3ultra-gvnic
for A3 Ultra VMs.RDMA_NETWORK_PREFIX
: either a4high-rdma
for A4 VMs, or a3ultra-rdma
for A3 Ultra VMs.Create two VPC networks:
# Create a VPC for the additional Google Titanium CPU NIC
gcloud compute --project=${PROJECT} \
networks create \
${GVNIC_NETWORK_PREFIX}-net \
--subnet-mode=custom
gcloud compute --project=${PROJECT} \
networks subnets create \
${GVNIC_NETWORK_PREFIX}-sub \
--network=${GVNIC_NETWORK_PREFIX}-net \
--region=${REGION} \
--range=192.168.0.0/24
gcloud compute --project=${PROJECT} \
firewall-rules create \
${GVNIC_NETWORK_PREFIX}-internal \
--network=${GVNIC_NETWORK_PREFIX}-net \
--action=ALLOW \
--rules=tcp:0-65535,udp:0-65535,icmp \
--source-ranges=192.168.0.0/16
# Create HPC VPC for the RDMA NICs with 8 subnets.
gcloud beta compute --project=${PROJECT} \
networks create ${RDMA_NETWORK_PREFIX}-net \
--network-profile=${ZONE}-vpc-roce \
--subnet-mode=custom
# Create subnets for the HPC VPC.
for N in $(seq 0 7); do
gcloud compute --project=${PROJECT} \
networks subnets create \
${RDMA_NETWORK_PREFIX}-sub-$N \
--network=${RDMA_NETWORK_PREFIX}-net \
--region=${REGION} \
--range=192.168.$((N+1)).0/24 & # offset to avoid overlap with gvnics
done
Create a GKE Autopilot cluster with multi-networking:
gcloud container clusters create-auto CLUSTER_NAME \
--enable-multi-networking \
--cluster-version=CLUSTER_VERSION \
--region=COMPUTE_REGION
Replace the following:
CLUSTER_NAME
: the name of your cluster.CLUSTER_VERSION
: the version of your new cluster. To find out which version of GKE supports your configuration, see the Requirements section.COMPUTE_REGION
: the name of the compute region.Connect to your cluster, so that you can run the kubectl
commands in the next sections:
gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
Replace the following:
CLUSTER_NAME
: the name of your cluster.COMPUTE_REGION
: the name of the compute region.For more information, see Install kubectl and configure cluster access.
Create a GKE Standard cluster and GPU node pool with multi-networking:
Create the cluster:
gcloud container clusters create CLUSTER_NAME \
--region=COMPUTE_REGION \
--cluster-version=CLUSTER_VERSION \
--enable-dataplane-v2 --enable-ip-alias --enable-multi-networking [\
--services-ipv4-cidr=SERVICE_CIDR \
--cluster-ipv4-cidr=POD_CIDR]
Replace the following:
CLUSTER_NAME
: the name of your cluster.CLUSTER_VERSION
: the version of your new cluster. To find out which version of GKE supports your configuration, see the Requirements section.COMPUTE_REGION
: the name of the compute region.Optionally, you can explicitly provide the secondary CIDR ranges for services and Pods. If you use these optional flags, replace the following variables:
SERVICE_CIDR
: the secondary CIDR range for services.POD_CIDR
: the secondary CIDR range for Pods.When you use these flags, you must verify that the CIDR ranges don't overlap with subnet ranges for additional node networks. For example, the ranges in the SERVICE_CIDR=10.65.0.0/19
and POD_CIDR=10.64.0.0/19
values don't overlap with each other.
Create the node pool. The command that you need to run depends on the consumption option that you use for your deployment. Select the tab that corresponds to your consumption option's provisioning model.
Reservation-boundFor reservation-bound provisioning, run the following command:
gcloud container node-pools create NODE_POOL_NAME \
--region COMPUTE_REGION --cluster CLUSTER_NAME \
--node-locations COMPUTE_ZONE \
--accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type MACHINE_TYPE \
--num-nodes=NUM_NODES \
--reservation-affinity=specific \
--reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME \
--additional-node-network network=${GVNIC_NETWORK_PREFIX}-net,subnetwork=${GVNIC_NETWORK_PREFIX}-sub \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-0 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-1 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-2 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-3 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-4 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-5 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-6 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-7
Replace the following:
NODE_POOL_NAME
: the name of the node pool.COMPUTE_REGION
: the region of your new cluster.CLUSTER_NAME
: the name of your new cluster.COMPUTE_ZONE
: the zone of your node pool.GPU_TYPE
: the type of GPU accelerator.
nvidia-b200
.nvidia-h200-141gb
.AMOUNT
: the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g
and a3-ultragpu-8g
VMs, the amount of GPUs is 8
.DRIVER_VERSION
: the NVIDIA driver version to install. It can be one of the following values: default
, latest
, or disabled
.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.NUM_NODES
: the number of nodes for the node pool. For flex-start, this value must be set to 0
.RESERVATION_NAME
: the name of your reservation. To find this value, you can query your reservation.BLOCK_NAME
: the name of a specific block within the reservation. To find this value, you can query your reservation.Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions.
For flex-start provisioning, run the following command:
gcloud container node-pools create NODE_POOL_NAME \
--region COMPUTE_REGION --cluster CLUSTER_NAME \
--node-locations COMPUTE_ZONE \
--accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type MACHINE_TYPE \
--num-nodes=NUM_NODES \
--flex-start --num-nodes=0 --enable-autoscaling \
--total-max-nodes TOTAL_MAX_NODES \
--no-enable-autorepair --location-policy=ANY \
--reservation-affinity=none \
[--enable-queued-provisioning \]
--additional-node-network network=${GVNIC_NETWORK_PREFIX}-net,subnetwork=${GVNIC_NETWORK_PREFIX}-sub \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-0 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-1 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-2 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-3 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-4 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-5 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-6 \
--additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-7
Replace the following:
NODE_POOL_NAME
: the name of the node pool.COMPUTE_REGION
: the region of your new cluster.CLUSTER_NAME
: the name of your new cluster.COMPUTE_ZONE
: the zone of your node pool.GPU_TYPE
: the type of GPU accelerator.
nvidia-b200
.nvidia-h200-141gb
.AMOUNT
: the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g
and a3-ultragpu-8g
VMs, the amount of GPUs is 8
.DRIVER_VERSION
: the NVIDIA driver version to install. It can be one of the following values: default
, latest
, or disabled
.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.MACHINE_TYPE
: the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g
for A4 VMs, and a3-ultragpu-8g
for A3 Ultra VMs.NUM_NODES
: the number of nodes for the node pool. For flex-start, this value must be set to 0
.TOTAL_MAX_NODES
: the maximum number of nodes to automatically scale for the entire node pool.If you want to use flex-start with queued provisioning, include the --enable-queued-provisioning
flag.
For more information about using flex-start, see Run large-scale workload with flex-start with queued provisioning.
Connect to your cluster, so that you can run the kubectl
commands in the next sections:
gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
Replace the following:
CLUSTER_NAME
: the name of your cluster.COMPUTE_REGION
: the name of the compute region.For more information, see Install kubectl and configure cluster access.
The VPC networks created in the previous section need to be configured through GKE network parameter sets. Specifically, the second CPU Titanium NIC (gVNIC) needs to be configured in NetDevice mode and each of the eight CX-7 RDMA NICs need to be configured in RDMA mode.
This command uses the following names:
${GVNIC_NETWORK_PREFIX}-net
with subnet named ${GVNIC_NETWORK_PREFIX}-sub
${RDMA_NETWORK_PREFIX}-net
with subnets named ${RDMA_NETWORK_PREFIX}-sub-[0…7]
Create the GKE network objects by running the following command:
kubectl apply -f - <<EOF
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: gvnic-1
spec:
vpc: ${GVNIC_NETWORK_PREFIX}-net
vpcSubnet: ${GVNIC_NETWORK_PREFIX}-sub
deviceMode: NetDevice
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: gvnic-1
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: gvnic-1
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-0
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-0
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-0
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-0
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-1
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-1
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-1
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-1
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-2
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-2
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-2
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-2
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-3
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-3
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-3
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-3
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-4
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-4
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-4
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-4
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-5
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-5
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-5
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-5
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-6
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-6
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-6
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-6
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
name: rdma-7
spec:
vpc: ${RDMA_NETWORK_PREFIX}-net
vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-7
deviceMode: RDMA
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
name: rdma-7
spec:
type: "Device"
parametersRef:
group: networking.gke.io
kind: GKENetworkParamSet
name: rdma-7
EOF
Install the RDMA binary and configure NCCL
Apply the following DaemonSet to install the RDMA binaries and the NCCL library on each node. On each underlying VM, the RDMA binaries are installed in the /home/kubernetes/bin/gib
directory, and the NCCL library is installed in the /home/kubernetes/bin/nvidia/lib64
directory.
For GKE Autopilot mode, run the following command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer-autopilot.yaml
Standard
For GKE Standard mode, run the following command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer.yaml
Run NCCL tests
To validate the functionality of the provisioned cluster, you can run a NCCL test. For instructions, see Deploy and run a NCCL test.
Configure your Pod manifests for GPUDirect RDMATo run your workloads by using GPUDirect RDMA, configure your Pod manifests with the following steps:
Add the following annotations to the Pod metadata.
AutopilotUse the following annotation for GKE Autopilot mode:
metadata:
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"gvnic-1"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
Standard
The following annotation for GKE Standard mode doesn't include agvnic-1
specification, but you can add it if your workloads require it.
Use the following annotation for GKE Standard mode:
metadata:
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
Add the following volumes to the Pod spec:
spec:
volumes:
- name: library-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
Add the following volume mounts, environment variables, and resources to the container that requests GPUs. Your workload container must request all eight GPUs:
AutopilotFor GKE Autopilot mode, configure the following resources:
containers:
- name: my-container
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
readOnly: true
- name: gib
mountPath: /usr/local/gib
readOnly: true
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
nvidia.com/gpu: 8
Standard
For GKE Standard mode, configure the following resources:
containers:
- name: my-container
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
nvidia.com/gpu: 8
Set all the required environment variables to configure NCCL by using the following shell script from the workload container:
source /usr/local/gib/scripts/set_nccl_env.sh
The following tabs include examples of completed Pod manifests.
AutopilotFor GKE Autopilot mode, a completed Pod manifest should look similar to the following:
apiVersion: apps/v1
kind: Pod
metadata:
name: my-pod
labels:
k8s-app: my-pod
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"gvnic-1"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
spec:
...
volumes:
- name: library-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
containers:
- name: my-container
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
readOnly: true
- name: gib
mountPath: /usr/local/gib
readOnly: true
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
nvidia.com/gpu: 8
...
Standard
For GKE Standard mode, a completed Pod manifest should look similar to the following:
apiVersion: apps/v1
kind: Pod
metadata:
name: my-pod
labels:
k8s-app: my-pod
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
spec:
...
volumes:
- name: library-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
containers:
- name: my-container
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
nvidia.com/gpu: 8
...
Deploy and run a NCCL test for clusters with GPUDirect RDMA
To validate the functionality of the provisioned cluster which uses GPUDirect RDMA, you can run a NCCL test. You can run a basic test on two nodes, which you must use for nodes that are provisioned with flex-start (Preview). Or, if you have a larger number of nodes that are not provisioned with flex-start, you can use a NCCL test with Topology Aware Scheduling.
Test on two nodesRun the two node test:
A4To deploy a NCCL test workload of two test Pods that are running on two A4 nodes, apply one of the following manifests:
For an Autopilot cluster:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4-autopilot.yaml
For a Standard cluster:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4.yaml
Check if the Pods are scheduled to and running on some nodes:
kubectl get pods nccl-test-host-1 nccl-test-host-2
If the two Pods have the Running
status, you can proceed to the next step. For nodes that are provisioned by flex-start, it might take a few minutes before the nodes are created and the Pods are scheduled on those nodes.
Trigger a NCCL all-gather test for the nodes:
kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2
The output should be similar to the following:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 float none -1 48.17 0.02 0.02 0 47.21 0.02 0.02 0
2048 32 float none -1 47.23 0.04 0.04 0 47.17 0.04 0.04 0
4096 64 float none -1 47.43 0.09 0.08 0 47.48 0.09 0.08 0
8192 128 float none -1 47.93 0.17 0.16 0 47.98 0.17 0.16 0
16384 256 float none -1 48.90 0.34 0.31 0 48.75 0.34 0.32 0
32768 512 float none -1 50.10 0.65 0.61 0 49.59 0.66 0.62 0
65536 1024 float none -1 51.70 1.27 1.19 0 51.66 1.27 1.19 0
131072 2048 float none -1 52.23 2.51 2.35 0 55.60 2.36 2.21 0
262144 4096 float none -1 53.89 4.86 4.56 0 53.39 4.91 4.60 0
524288 8192 float none -1 56.80 9.23 8.65 0 57.66 9.09 8.52 0
1048576 16384 float none -1 87.85 11.94 11.19 0 88.47 11.85 11.11 0
2097152 32768 float none -1 92.52 22.67 21.25 0 93.22 22.50 21.09 0
4194304 65536 float none -1 97.41 43.06 40.37 0 96.15 43.62 40.90 0
8388608 131072 float none -1 110.0 76.27 71.51 0 110.9 75.66 70.93 0
16777216 262144 float none -1 141.3 118.77 111.35 0 140.7 119.27 111.81 0
33554432 524288 float none -1 203.2 165.14 154.82 0 202.3 165.90 155.53 0
67108864 1048576 float none -1 303.3 221.25 207.42 0 301.9 222.27 208.38 0
134217728 2097152 float none -1 513.2 261.56 245.21 0 509.3 263.56 247.08 0
268435456 4194304 float none -1 842.4 318.64 298.72 0 832.3 322.54 302.38 0
536870912 8388608 float none -1 1511.8 355.12 332.92 0 1502.5 357.31 334.98 0
1073741824 16777216 float none -1 2976.7 360.72 338.17 0 2923.2 367.32 344.36 0
2147483648 33554432 float none -1 5888.9 364.66 341.87 0 5766.2 372.43 349.15 0
4294967296 67108864 float none -1 11722 366.39 343.49 0 11457 374.88 351.45 0
8589934592 134217728 float none -1 23379 367.43 344.46 0 22818 376.45 352.92 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 120.845
To deploy a NCCL test workload of two test Pods that are running on two A3 Ultra nodes, apply one of the following manifests:
For an Autopilot cluster:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-autopilot.yaml
For a Standard cluster:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test.yaml
Check if the Pods are scheduled to and running on some nodes:
kubectl get pods nccl-test-host-1 nccl-test-host-2
If the two Pods have the Running
status, you can proceed to the next step. For nodes that are provisioned by flex-start, it might take a few minutes before the nodes are created and the Pods are scheduled on those nodes.
Trigger a NCCL all-gather test for the nodes:
kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2
The output should be similar to the following:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 float none -1 56.00 0.02 0.02 0 55.59 0.02 0.02 0
2048 32 float none -1 55.79 0.04 0.03 0 55.57 0.04 0.03 0
4096 64 float none -1 56.29 0.07 0.07 0 57.35 0.07 0.07 0
8192 128 float none -1 56.44 0.15 0.14 0 56.32 0.15 0.14 0
16384 256 float none -1 57.57 0.28 0.27 0 57.60 0.28 0.27 0
32768 512 float none -1 57.92 0.57 0.53 0 59.35 0.55 0.52 0
65536 1024 float none -1 59.92 1.09 1.03 0 60.15 1.09 1.02 0
131072 2048 float none -1 59.21 2.21 2.08 0 61.82 2.12 1.99 0
262144 4096 float none -1 63.58 4.12 3.87 0 63.34 4.14 3.88 0
524288 8192 float none -1 64.89 8.08 7.57 0 65.09 8.06 7.55 0
1048576 16384 float none -1 80.90 12.96 12.15 0 77.49 13.53 12.69 0
2097152 32768 float none -1 80.22 26.14 24.51 0 79.88 26.25 24.61 0
4194304 65536 float none -1 82.86 50.62 47.45 0 82.47 50.86 47.68 0
8388608 131072 float none -1 95.83 87.53 82.06 0 93.27 89.94 84.32 0
16777216 262144 float none -1 122.8 136.58 128.04 0 121.7 137.86 129.24 0
33554432 524288 float none -1 180.6 185.75 174.14 0 179.2 187.19 175.49 0
67108864 1048576 float none -1 279.7 239.90 224.90 0 277.0 242.26 227.12 0
134217728 2097152 float none -1 507.5 264.46 247.93 0 485.1 276.66 259.37 0
268435456 4194304 float none -1 866.3 309.88 290.51 0 864.0 310.70 291.28 0
536870912 8388608 float none -1 1576.1 340.62 319.33 0 1558.2 344.54 323.01 0
1073741824 16777216 float none -1 3096.6 346.75 325.08 0 3047.5 352.33 330.31 0
2147483648 33554432 float none -1 6148.0 349.30 327.47 0 6034.3 355.88 333.64 0
4294967296 67108864 float none -1 12226 351.29 329.33 0 12000 357.92 335.55 0
8589934592 134217728 float none -1 24391 352.17 330.16 0 23920 359.11 336.67 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 120.94
If you have more than two nodes, we recommend using the following test, which uses TAS. Follow the steps in the next sections to prepare and run the test on your cluster.
Set up your cluster with Jobset and the TAS pluginInstall the TAS plugin:
Clone the container-engine-accelerators
git repository:
cd ~
git clone https://github.com/GoogleCloudPlatform/container-engine-accelerators.git
Apply the TAS plugin:
cd container-engine-accelerators/gke-topology-scheduler
kubectl create configmap topology-scheduler-scripts --namespace kube-system --from-file=schedule-daemon.py=schedule-daemon.py --from-file=label-nodes-daemon.py=label-nodes-daemon.py
kubectl apply -f service-account.yaml
kubectl apply -f schedule-daemon.yaml
kubectl apply -f label-nodes-daemon.yaml
Create the following nccl-jobset-test.yaml
manifest:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: nccl-allgather
spec:
ttlSecondsAfterFinished: 1200
suspend: False
network:
enableDNSHostnames: true
replicatedJobs:
- name: worker
template:
spec:
parallelism: NUM_NODES
completions: NUM_NODES
template:
metadata:
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
spec:
activeDeadlineSeconds: 3600
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
tolerations:
- key: cloud.google.com/gke-queued
effect: NoSchedule
value: "true"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
setHostnameAsFQDN: true
volumes:
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
- name: nvidia
hostPath:
path: /home/kubernetes/bin/nvidia
- name: lib64
hostPath:
path: /lib64
- name: shared-memory
emptyDir:
medium: "Memory"
sizeLimit: 250Gi
schedulingGates:
- name: "gke.io/topology-aware-auto-nccl-test"
containers:
- name: nccl-test
stdin: true
tty: true
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:v1.0.6
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: OMPI_ALLOW_RUN_AS_ROOT
value: "1"
- name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
value: "1"
- name: N_NODES
value: "NUM_NODES"
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
command:
- bash
- -c
- |
set -x
echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark"
# Install ping
apt update -y
apt install -y iputils-ping
# Start sshd
/scripts/container_entry.sh daemon &
# Get helper variables to form all hostnames
export POSTFIX=$(hostname | cut -d . -f 2-)
export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
export NODE_RANK=$JOB_COMPLETION_INDEX
# For every worker, wait till online and add to hostfile
for i in `seq 0 $(($N_NODES-1))`; do
OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX}
until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
echo Waiting for ${OTHER}...
sleep 10
done
echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
done
cat /tmp/hostfile
# Launch from head node
if [[ "${NODE_RANK}" -eq "0" ]]; then
# World Level = 0x0, Rail Aligned = 0x7
export NCCL_TESTS_SPLIT_MASK="0x0";
# Force use of libnccl-gib
export NCCL_NET=gIB
# Set all the correct libnccl-gib environment variables
source /usr/local/gib/scripts/set_nccl_env.sh
# Get all relevant NCCL / env vars to pass to all workers
ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')
mpirun --hostfile /tmp/hostfile \
-x $ENV_VARS \
-mca plm_rsh_no_tree_spawn 1 \
--mca mtl ^ofi \
--mca orte_keep_fqdn_hostnames 1 \
--mca btl self,tcp \
--mca btl_tcp_if_include eth0 \
--bind-to none \
--mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
/third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1
else
while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do
sleep 5
done
fi
exit 0
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
- name: shared-memory
mountPath: /dev/shm
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
restartPolicy: Never
Replace NUM_NODES
with the number of nodes in the node pool.
Make sure that you understand the following about this manifest:
nccl-allgather
.gke.io/topology-aware-auto-nccl-test
scheduling gate is used to verify the Pods are scheduled for colocation.parallelism
and completions
fields are both set to the number of nodes that you want to use to run the NCCL test.Apply the manifest:
kubectl apply -f nccl-jobset-test.yaml
Confirm that the workload is admitted:
kubectl get jobsets
The output is similar to the following:
NAME RESTARTS COMPLETED AGE
nccl-allgather 3s
Confirm that the workload is in the Completed
state:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
nccl-allgather-worker-0-0-n9s6j 0/1 Completed 0 9m34s
nccl-allgather-worker-0-1-rsf7r 0/1 Completed 0 9m34s
...
The logs of the Pod with the pattern nccl-allgather-worker-0-0-.*
contain the results of the test.
Fetch the logs for this Pod:
kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-allgather-worker-0-0)
The output should be similar to the following:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) ∂ç (GB/s) (GB/s)
1024 16 float none -1 54.07 0.02 0.02 0 55.80 0.02 0.02 0
2048 32 float none -1 55.46 0.04 0.03 0 55.31 0.04 0.03 0
4096 64 float none -1 55.59 0.07 0.07 0 55.38 0.07 0.07 0
8192 128 float none -1 56.05 0.15 0.14 0 55.92 0.15 0.14 0
16384 256 float none -1 57.08 0.29 0.27 0 57.75 0.28 0.27 0
32768 512 float none -1 57.49 0.57 0.53 0 57.22 0.57 0.54 0
65536 1024 float none -1 59.20 1.11 1.04 0 59.20 1.11 1.04 0
131072 2048 float none -1 59.58 2.20 2.06 0 63.57 2.06 1.93 0
262144 4096 float none -1 63.87 4.10 3.85 0 63.61 4.12 3.86 0
524288 8192 float none -1 64.83 8.09 7.58 0 64.40 8.14 7.63 0
1048576 16384 float none -1 79.74 13.15 12.33 0 76.66 13.68 12.82 0
2097152 32768 float none -1 78.41 26.74 25.07 0 79.05 26.53 24.87 0
4194304 65536 float none -1 83.21 50.41 47.26 0 81.25 51.62 48.39 0
8388608 131072 float none -1 94.35 88.91 83.35 0 99.07 84.68 79.38 0
16777216 262144 float none -1 122.9 136.55 128.02 0 121.7 137.83 129.21 0
33554432 524288 float none -1 184.2 182.19 170.80 0 178.1 188.38 176.60 0
67108864 1048576 float none -1 294.7 227.75 213.51 0 277.7 241.62 226.52 0
134217728 2097152 float none -1 495.4 270.94 254.00 0 488.8 274.60 257.43 0
268435456 4194304 float none -1 877.5 305.92 286.80 0 861.3 311.65 292.17 0
536870912 8388608 float none -1 1589.8 337.71 316.60 0 1576.2 340.61 319.33 0
1073741824 16777216 float none -1 3105.7 345.74 324.13 0 3069.2 349.85 327.98 0
2147483648 33554432 float none -1 6161.7 348.52 326.74 0 6070.7 353.75 331.64 0
4294967296 67108864 float none -1 12305 349.03 327.22 0 12053 356.35 334.08 0
8589934592 134217728 float none -1 24489 350.77 328.85 0 23991 358.05 335.67 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 120.248
Create the following nccl-jobset-test.yaml
manifest:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: nccl-allgather
spec:
ttlSecondsAfterFinished: 1200
suspend: False
network:
enableDNSHostnames: true
replicatedJobs:
- name: worker
template:
spec:
parallelism: NUM_NODES
completions: NUM_NODES
template:
metadata:
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
spec:
activeDeadlineSeconds: 3600
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-h200-141gb
tolerations:
- key: cloud.google.com/gke-queued
effect: NoSchedule
value: "true"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
setHostnameAsFQDN: true
volumes:
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
- name: nvidia
hostPath:
path: /home/kubernetes/bin/nvidia
- name: lib64
hostPath:
path: /lib64
- name: shared-memory
emptyDir:
medium: "Memory"
sizeLimit: 250Gi
schedulingGates:
- name: "gke.io/topology-aware-auto-nccl-test"
containers:
- name: nccl-test
stdin: true
tty: true
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:v1.0.6
securityContext:
privileged: true
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: OMPI_ALLOW_RUN_AS_ROOT
value: "1"
- name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
value: "1"
- name: N_NODES
value: "NUM_NODES"
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
command:
- bash
- -c
- |
set -x
echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark"
# Install ping
apt update -y
apt install -y iputils-ping
# Start sshd
/scripts/container_entry.sh daemon &
# Get helper variables to form all hostnames
export POSTFIX=$(hostname | cut -d . -f 2-)
export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
export NODE_RANK=$JOB_COMPLETION_INDEX
# For every worker, wait till online and add to hostfile
for i in `seq 0 $(($N_NODES-1))`; do
OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX}
until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
echo Waiting for ${OTHER}...
sleep 10
done
echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
done
cat /tmp/hostfile
# Launch from head node
if [[ "${NODE_RANK}" -eq "0" ]]; then
# World Level = 0x0, Rail Aligned = 0x7
export NCCL_TESTS_SPLIT_MASK="0x0";
# Force use of libnccl-gib
export NCCL_NET=gIB
# Set all the correct libnccl-gib environment variables
source /usr/local/gib/scripts/set_nccl_env.sh
# Get all relevant NCCL / env vars to pass to all workers
ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')
mpirun --hostfile /tmp/hostfile \
-x $ENV_VARS \
-mca plm_rsh_no_tree_spawn 1 \
--mca orte_keep_fqdn_hostnames 1 \
--mca btl self,tcp \
--mca btl_tcp_if_include eth0 \
--bind-to none \
--mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
/third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1
else
while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do
sleep 5
done
fi
exit 0
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
- name: shared-memory
mountPath: /dev/shm
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
restartPolicy: Never
Replace NUM_NODES
with the number of nodes in the node pool.
Make sure that you understand the following about this manifest:
nccl-allgather
.gke.io/topology-aware-auto-nccl-test
scheduling gate is used to verify the Pods are scheduled for colocation.parallelism
and completions
fields are both set to the number of nodes that you want to use to run the NCCL test.Apply the manifest:
kubectl apply -f nccl-jobset-test.yaml
Confirm that the workload is admitted:
kubectl get jobsets
The output is similar to the following:
NAME RESTARTS COMPLETED AGE
nccl-allgather 3s
Confirm that the workload is in the Completed
state:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
nccl-allgather-worker-0-0-n9s6j 0/1 Completed 0 9m34s
nccl-allgather-worker-0-1-rsf7r 0/1 Completed 0 9m34s
...
The logs of the Pod with the pattern nccl-allgather-worker-0-0-.*
contain the results of the test.
Fetch the logs for this Pod:
kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-allgather-worker-0-0)
The output should be similar to the following:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) ∂ç (GB/s) (GB/s)
1024 16 float none -1 54.07 0.02 0.02 0 55.80 0.02 0.02 0
2048 32 float none -1 55.46 0.04 0.03 0 55.31 0.04 0.03 0
4096 64 float none -1 55.59 0.07 0.07 0 55.38 0.07 0.07 0
8192 128 float none -1 56.05 0.15 0.14 0 55.92 0.15 0.14 0
16384 256 float none -1 57.08 0.29 0.27 0 57.75 0.28 0.27 0
32768 512 float none -1 57.49 0.57 0.53 0 57.22 0.57 0.54 0
65536 1024 float none -1 59.20 1.11 1.04 0 59.20 1.11 1.04 0
131072 2048 float none -1 59.58 2.20 2.06 0 63.57 2.06 1.93 0
262144 4096 float none -1 63.87 4.10 3.85 0 63.61 4.12 3.86 0
524288 8192 float none -1 64.83 8.09 7.58 0 64.40 8.14 7.63 0
1048576 16384 float none -1 79.74 13.15 12.33 0 76.66 13.68 12.82 0
2097152 32768 float none -1 78.41 26.74 25.07 0 79.05 26.53 24.87 0
4194304 65536 float none -1 83.21 50.41 47.26 0 81.25 51.62 48.39 0
8388608 131072 float none -1 94.35 88.91 83.35 0 99.07 84.68 79.38 0
16777216 262144 float none -1 122.9 136.55 128.02 0 121.7 137.83 129.21 0
33554432 524288 float none -1 184.2 182.19 170.80 0 178.1 188.38 176.60 0
67108864 1048576 float none -1 294.7 227.75 213.51 0 277.7 241.62 226.52 0
134217728 2097152 float none -1 495.4 270.94 254.00 0 488.8 274.60 257.43 0
268435456 4194304 float none -1 877.5 305.92 286.80 0 861.3 311.65 292.17 0
536870912 8388608 float none -1 1589.8 337.71 316.60 0 1576.2 340.61 319.33 0
1073741824 16777216 float none -1 3105.7 345.74 324.13 0 3069.2 349.85 327.98 0
2147483648 33554432 float none -1 6161.7 348.52 326.74 0 6070.7 353.75 331.64 0
4294967296 67108864 float none -1 12305 349.03 327.22 0 12053 356.35 334.08 0
8589934592 134217728 float none -1 24489 350.77 328.85 0 23991 358.05 335.67 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 120.248
```
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-07 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[[["This page details the process of creating a Hypercompute Cluster using Google Kubernetes Engine (GKE) to run AI and ML workloads with A3 Ultra GPUs, providing instructions for both manual configuration and quick deployment via Cluster Toolkit."],["You can create a GKE cluster either without GPUDirect RDMA for non-distributed AI tasks, or with GPUDirect RDMA for enhanced performance in distributed AI workloads, which includes using multiple GPU nodes as one entity."],["Creating a cluster with GPUDirect RDMA requires specific configurations, such as using GKE patch version 1.31.5-gke.1169000 or higher, Container-Optimized OS node images, and dedicated network settings using VPCs and subnets."],["Before starting, ensure the Google Kubernetes Engine API is enabled, the gcloud CLI is installed and updated, and that you have sufficient quota for A3 Ultra GPUs."],["The process includes steps to create VPC networks, GKE cluster and GPU node pool with multi-networking, configure GKE network objects, install RDMA binaries, and configure Pod manifests for GPUDirect RDMA, as well as running NCCL tests to validate cluster performance."]]],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4