Stay organized with collections Save and categorize content based on your preferences.
This page shows you how to create your own AI-optimized Google Kubernetes Engine (GKE) cluster that uses Cluster Director for GKE to support your AI and ML workloads, using A4 or A3 Ultra virtual machines (VMs).
Cluster Director for GKE lets you deploy and manage large AI-optimized clusters of accelerated VMs with features such as targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. For more information, see Cluster Director.
GKE provides a single platform surface to run a diverse set of workloads for your organization's needs. This includes high performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. GKE reduces the operational burden of managing multiple platforms.
Choose how to create an AI-optimized GKE clusterThe following options for cluster creation each provide varying degrees of ease and flexibility in cluster configuration and workload scheduling:
Create clusters with the default configuration for compute, storage, and networking resources, and with GPUDirect RDMA-over-Converged-Ethernet (RoCE) enabled:
Alternatively, you can create your GKE cluster manually for precise customization or expansion of existing production GKE environments. To create an AI-optimized GKE cluster manually, see Create a custom AI-optimized GKE cluster.
Before you start, make sure that you have performed the following tasks:
gcloud components update
. Note: For existing gcloud CLI installations, make sure to set the compute/region
property. If you use primarily zonal clusters, set the compute/zone
instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location
. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.roles/container.admin
)roles/compute.admin
)roles/storage.admin
)roles/resourcemanager.projectIamAdmin
)roles/iam.serviceAccountAdmin
)roles/iam.serviceAccountUser
)roles/serviceusage.serviceUsageConsumer
)Choose a consumption option. Make your choice based on how you want to get and use GPU resources. To learn more, see Choose a consumption option.
For GKE, consider the following additional information when choosing a consumption option:
Obtain capacity. Learn how to obtain capacity for your consumption option.
To learn more, see Capacity overview.
The following requirements apply to an AI-optimized GKE cluster:
Ensure you use the minimum GPU driver version, depending on the machine type:
latest
driver version. For A3 Ultra, you must set gpu-driver-version=latest
with GKE 1.31. For GKE version 1.31.5-gke.1169000 or later, GKE, by default, automatically installs 550 GPU driver versions on A3 Ultra nodes.For A3 Ultra node pools, you must set the disk type to hyperdisk-balanced
.
To use GPUDirect RDMA, use the following minimum versions depending on the machine type:
To use GPUDirect RDMA, the GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
Use the following instructions to create a cluster either using Cluster Toolkit or XPK.
Create a cluster using Cluster ToolkitThis section guides you through the cluster creation process, ensuring that your project follows best practices and meets the requirements for an AI-optimized GKE cluster.
Note: If you create multiple clusters using these same cluster blueprints, ensure that all VPC and subnet names are unique per project to prevent errors. A4Clone the Cluster Toolkit from the git repository:
cd ~
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
Install the Cluster Toolkit:
cd cluster-toolkit && git checkout main && make
Create a Cloud Storage bucket to store the state of the Terraform deployment:
gcloud storage buckets create gs://BUCKET_NAME \
--default-storage-class=STANDARD \
--project=PROJECT_ID \
--location=COMPUTE_REGION_TERRAFORM_STATE \
--uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following variables:
BUCKET_NAME
: the name of the new Cloud Storage bucket.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION_TERRAFORM_STATE
: the compute region where you want to store the state of the Terraform deployment.The files that you need to edit to create a cluster depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.
Reservation-boundIn the examples/gke-a4/gke-a4-deployment.yaml
blueprint from the GitHub repo, fill in the following settings in the terraform_backend_defaults
and vars
sections to match the specific values for your deployment:
DEPLOYMENT_NAME
: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.BUCKET_NAME
: the name of the Cloud Storage bucket you created in the previous step.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION
: the compute region for the cluster.COMPUTE_ZONE
: the compute zone for the node pool of A4 machines. Note that this zone should match the zone where machines are available in your reservation.NODE_COUNT
: the number of A4 nodes in your cluster.IP_ADDRESS/SUFFIX
: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.For the extended_reservation
field, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:
RESERVATION_NAME
).To target a specific block within your reservation, use the reservation and block names in the following format:
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
If you don't know which blocks are available in your reservation, see View a reservation topology.
Set the boot disk sizes for each node of the system and A4 node pools. The disk size that you need depends on your use case. For example, if using the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model or container image:
SYSTEM_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is 10
. The default value is 100
.A4_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the A4 node pool. The smallest allowed disk size is 10
. The default value is 100
.To modify advanced settings, edit examples/gke-a4/gke-a4.yaml
.
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions.
examples/gke-a4/gke-a4-deployment.yaml
blueprint from the GitHub repo, fill in the following settings in the terraform_backend_defaults
and vars
sections to match the specific values for your deployment:
DEPLOYMENT_NAME
: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.BUCKET_NAME
: the name of the Cloud Storage bucket you created in the previous step.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION
: the compute region for the cluster.COMPUTE_ZONE
: the compute zone for the node pool of A4 machines.static_node_count
.IP_ADDRESS/SUFFIX
: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.extended_reservation
field, and replace the field with enable_flex_start: true
. Add on the next line enable_queued_provisioning: true
if you'd also like to use queued provisioning. For more information, see Use node pools with flex-start with queued provisioning.SYSTEM_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is 10
. The default value is 100
.A4_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the A4 node pool. The smallest allowed disk size is 10
. The default value is 100
.In the examples/gke-a4/gke-a4.yaml
blueprint from the GitHub repo, make the following changes:
vars
block, remove static_node_count
.vars
block, make sure the version_prefix
number is "1.32."
or higher. To use flex-start in GKE, your cluster must use version 1.32.2-gke.1652000 or later.vars
block, replace the entire extended_reservation
block (including the extended_reservation
line itself) with enable_flex_start: true
, and, optionally, enable_queued_provisioning: true
.vars
block, if you don't require queued provisioning, remove the following line: kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl"))
.id: a4-pool
, remove the following line: static_node_count: $(vars.static_node_count)
.Under id: a4-pool
, remove the reservation_affinity
block. Replace this block with the following lines:
enable_flex_start: $(vars.enable_flex_start)
auto_repair: false
enable_queued_provisioning: $(vars.enable_queued_provisioning)
autoscaling_total_min_nodes: 0
Under id: workload-manager-install
, remove the following block:
kueue:
install: true
config_path: $(vars.kueue_configuration_path)
config_template_vars:
num_gpus: $(a3-ultragpu-pool.static_gpu_count)
accelerator_type: $(vars.accelerator_type)
For flex-start with queued provisioning, do the following:
Add gpu_nominal_quota: NOMINAL_QUOTA
to the vars
block. The gpu_nominal_quota
value is used to set the nominalQuota
of GPUs in the ClusterQueue
spec (see the third step below). In this example, the ClusterQueue
only admits workloads if the sum of the GPU requests is less than or equal to the NOMINAL_QUOTA
value. More information on ClusterQueue
can be found in this Cluster Queue.
Update the kueue
block to this:
kueue:
install: true
config_path: $(vars.kueue_configuration_path)
config_template_vars:
num_gpus: $(vars.gpu_nominal_quota)
Replace the content of kueue-configuration.yaml.tftpl
file with the following:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
name: dws-prov
spec:
controllerName: kueue.x-k8s.io/provisioning-request
parameters:
apiGroup: kueue.x-k8s.io
kind: ProvisioningRequestConfig
name: dws-config
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
name: dws-config
spec:
provisioningClassName: queued-provisioning.gke.io
managedResources:
- nvidia.com/gpu
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "dws-cluster-queue"
spec:
namespaceSelector: {}
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: ${num_gpus}
admissionChecks:
- dws-prov
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: "default"
name: "dws-local-queue"
spec:
clusterQueue: "dws-cluster-queue"
---
Under id: job-template
, replace the node_count
variable with 2
.
Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command:
gcloud auth application-default login
Deploy the blueprint to provision the GKE infrastructure using A4 machine types:
cd ~/cluster-toolkit
./gcluster deploy -d \
examples/gke-a4/gke-a4-deployment.yaml \
examples/gke-a4/gke-a4.yaml
When prompted, select (A)pply to deploy the blueprint.
fio-bench-job-template
job template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.Clone the Cluster Toolkit from the git repository:
cd ~
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
Install the Cluster Toolkit:
cd cluster-toolkit && git checkout main && make
Create a Cloud Storage bucket to store the state of the Terraform deployment:
gcloud storage buckets create gs://BUCKET_NAME \
--default-storage-class=STANDARD \
--project=PROJECT_ID \
--location=COMPUTE_REGION_TERRAFORM_STATE \
--uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following variables:
BUCKET_NAME
: the name of the new Cloud Storage bucket.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION_TERRAFORM_STATE
: the compute region where you want to store the state of the Terraform deployment.The files that you need to edit to create a cluster depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.
Reservation-boundIn the examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml
blueprint from the GitHub repo, replace the following variables in the terraform_backend_defaults
and vars
sections to match the specific values for your deployment:
DEPLOYMENT_NAME
: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.BUCKET_NAME
: the name of the Cloud Storage bucket you created in the previous step.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION
: the compute region for the cluster.COMPUTE_ZONE
: the compute zone for the node pool of A3 Ultra machines. Note that this zone should match the zone where machines are available in your reservation.NODE_COUNT
: the number of A3 Ultra nodes in your cluster.IP_ADDRESS/SUFFIX
: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.For the extended_reservation
field, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:
RESERVATION_NAME
).To target a specific block within your reservation, use the reservation and block names in the following format:
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
If you don't know which blocks are available in your reservation, see View a reservation topology.
Set the boot disk sizes for each node of the system and A3 Ultra node pools. The disk size that you need depends on your use case. For example, if using the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model or container image:
SYSTEM_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is 10
. The default value is 100
.A3ULTRA_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the A3 Ultra node pool. The smallest allowed disk size is 10
. The default value is 100
. To modify advanced settings, edit examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
.Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions.
In the examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml
blueprint from the GitHub repo, replace the following variables in the terraform_backend_defaults
and vars
sections to match the specific values for your deployment:
DEPLOYMENT_NAME
: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.BUCKET_NAME
: the name of the Cloud Storage bucket you created in the previous step.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION
: the compute region for the cluster.COMPUTE_ZONE
: the compute zone for the node pool of A3 Ultra machines.static_node_count
.IP_ADDRESS/SUFFIX
: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.extended_reservation
field, and replace the field with enable_flex_start: true
. Add on the next line enable_queued_provisioning: true
if you'd also like to use queued provisioning. For more information, see Use node pools with flex-start with queued provisioning.SYSTEM_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is 10
. The default value is 100
.A3ULTRA_NODE_POOL_DISK_SIZE_GB
: the size of the boot disk for each node of the A3 Ultra node pool. The smallest allowed disk size is 10
. The default value is 100
.In the examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
blueprint from the GitHub repo, make the following changes:
vars
block, remove static_node_count
.vars
block, update version_prefix
number to "1.32."
or higher. To use flex-start in GKE, your cluster must use version 1.32.2-gke.1652000 or later.vars
block, replace the entire extended_reservation
block (including the extended_reservation
line itself) with enable_flex_start: true
, and, optionally, enable_queued_provisioning: true
.vars
block, remove the following line: kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl"))
.id: a3-ultragpu-pool
, remove the following line: static_node_count: $(vars.static_node_count)
.Under id: a3-ultragpu-pool
, remove the reservation_affinity
block. Replace this block with the following lines:
enable_flex_start: $(vars.enable_flex_start)
auto_repair: false
enable_queued_provisioning: $(vars.enable_queued_provisioning)
autoscaling_total_min_nodes: 0
Under id: workload-manager-install
, remove the following block:
config_path: $(vars.kueue_configuration_path)
config_template_vars:
num_gpus: $(a4-pool.static_gpu_count)
accelerator_type: $(vars.accelerator_type)
For flex-start with queued provisioning, follow these three steps:
Add gpu_nominal_quota: NOMINAL_QUOTA
to the vars
block. The gpu_nominal_quota
value is used to set the nominalQuota
of GPUs in the ClusterQueue
specification. In this example, the ClusterQueue
only admits workloads if the sum of the GPU requests is less than or equal to the NOMINAL_QUOTA
value. More information on ClusterQueue
can be found in this Kueue document of Cluster Queue.
Update the kueue
block to the following:
kueue:
install: true
config_path: $(vars.kueue_configuration_path)
config_template_vars:
num_gpus: $(vars.gpu_nominal_quota)
Replace the content of kueue-configuration.yaml.tftpl
file with the following:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
name: dws-prov
spec:
controllerName: kueue.x-k8s.io/provisioning-request
parameters:
apiGroup: kueue.x-k8s.io
kind: ProvisioningRequestConfig
name: dws-config
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
name: dws-config
spec:
provisioningClassName: queued-provisioning.gke.io
managedResources:
- nvidia.com/gpu
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "dws-cluster-queue"
spec:
namespaceSelector: {}
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: ${num_gpus}
admissionChecks:
- dws-prov
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: "default"
name: "dws-local-queue"
spec:
clusterQueue: "dws-cluster-queue"
---
In the id: job-template
field, replace the node_count
variable with 2
.
Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command:
gcloud auth application-default login
Deploy the blueprint to provision the GKE infrastructure using A3 Ultra machine types:
cd ~/cluster-toolkit
./gcluster deploy -d \
examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml \
examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
When prompted, select (A)pply to deploy the blueprint.
fio-bench-job-template
job template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.Accelerated Processing Kit (XPK) lets you quickly provision and utilize clusters. XPK generates preconfigured, training-optimized infrastructure, ideal for when workload execution is your primary focus.
Create a cluster and run workloads with A3 Ultra VMs using XPK:
XPK_TAG
with the latest XPK version number.Open a shell window on a Linux machine, and enter the following commands to clone XPK from the Git repository, and install the required packages:
## Setup virtual environment.
VENV_DIR=~/venvp3
python3 -m venv $VENV_DIR
source $VENV_DIR/bin/activate
## Clone the repository.
git clone --branch XPK_TAG https://github.com/google/xpk.git
cd xpk
## Install required packages
make install && export PATH=$PATH:$PWD/bin
Create a Standard cluster using A3 Ultra VMs. You can provision the cluster's nodes using reserved capacity:
python3 xpk.py cluster create \
--cluster=CLUSTER_NAME \
--device-type=h200-141gb-8 \
--zone=COMPUTE_ZONE \
--project=PROJECT_ID \
--num-nodes=NUM_NODES \
--reservation=RESERVATION_NAME
Replace the following variables:
CLUSTER_NAME
: a name for the cluster.COMPUTE_ZONE
: the compute zone for the node pool of A3 Ultra machines. To use reserved capacity, ensure that you use the zone where you reserved the capacity. And, we generally recommend choosing a zone near the user to minimize latency.PROJECT_ID
: your Google Cloud project ID.NUM_NODES
: the number of worker nodes in the node pool.RESERVATION_NAME
: the name of your reservation.
XPK offers additional arguments for cluster creation, including those for creating private clusters, creating Vertex AI Tensorboards, and using node auto-provisioning. For more information, refer to the cluster creation guide for XPK.
Verify that the cluster was created successfully:
python3 xpk.py cluster list --zone=COMPUTE_ZONE --project=PROJECT_ID
Optional: Run a workload to test the cluster environment:
python3 xpk.py workload create \
--workload WORKLOAD_NAME --command "echo goodbye" \
--cluster CLUSTER_NAME \
--device-type=h200-141gb-8 \
--num-nodes=WORKLOAD_NUM_NODES \
--zone=COMPUTE_ZONE \
--project=PROJECT_ID
Replace the following variables:
WORKLOAD_NAME
: name of your workload.CLUSTER_NAME
: the name of the cluster.WORKLOAD_NUM_NODES
: number of worker nodes used for workload execution.COMPUTE_ZONE
: the compute zone for the node pool of A3 Ultra machines.PROJECT_ID
: your Google Cloud project ID.To validate the functionality of the provisioned cluster, you can run the following NCCL test. With nodes provisioned with reservations you run this NCCL test with Topology Aware Scheduling. Nodes that are provisioned with flex-start don't use TAS.
Run the NCCL test by completing the following steps:
Connect to your cluster:
gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
Replace CLUSTER_NAME
with the name of your cluster, which, for the clusters created with Cluster Toolkit, is based on the DEPLOYMENT_NAME
variable. Replace COMPUTE_REGION
with the name of the compute region.
Deploy an all-gather NCCL performance test with Topology Aware Scheduling enabled by using the gke-a3-ultragpu/nccl-jobset-example.yaml file for A3 Ultra VMs and the gke-4/nccl-jobset-example.yaml file for A4 VMs:
Modify the YAML file in the following ways if you meet the conditions:
The tests use a certain number of nodes by default. If you want to change the number of nodes, change the following values to your required number of nodes:
parallelism
completions
N_NODES
If you want to test nodes provisioned by flex-start, in the metadata
field, do the following:
kueue.x-k8s.io/queue-name
value with dws-local-queue
.Add the following annotation:
annotations:
provreq.kueue.x-k8s.io/maxRunDurationSeconds: "600"
Create the resources to run the test.
For A3 Ultra VMs, use the following:
kubectl create -f ~/cluster-toolkit/examples/gke-a3-ultragpu/nccl-jobset-example.yaml
For A4 VMs, use the following:
kubectl create -f ~/cluster-toolkit/examples/gke-a4/nccl-jobset-example.yaml
This command returns a JobSet name.
The output should be similar to the following:
jobset.jobset.x-k8s.io/all-gather8t7dt created
To view the results of the NCCL test, run this command to view all of the running Pods:
kubectl get pods
The output should be similar to the following:
NAME READY STATUS RESTARTS AGE
all-gather8t7dt-w-0-0-n9s6j 0/1 Completed 0 9m34s
all-gather8t7dt-w-0-1-rsf7r 0/1 Completed 0 9m34s
Find a Pod name matching the pattern jobset-name-w-0-0-*
. The logs of this Pod contain the results of the NCCL test.
To fetch the logs for this Pod, run this command:
kubectl logs all-gather8t7dt-w-0-0-n9s6j
The output should be similar to the following:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 float none -1 54.07 0.02 0.02 0 55.80 0.02 0.02 0
2048 32 float none -1 55.46 0.04 0.03 0 55.31 0.04 0.03 0
4096 64 float none -1 55.59 0.07 0.07 0 55.38 0.07 0.07 0
8192 128 float none -1 56.05 0.15 0.14 0 55.92 0.15 0.14 0
16384 256 float none -1 57.08 0.29 0.27 0 57.75 0.28 0.27 0
32768 512 float none -1 57.49 0.57 0.53 0 57.22 0.57 0.54 0
65536 1024 float none -1 59.20 1.11 1.04 0 59.20 1.11 1.04 0
131072 2048 float none -1 59.58 2.20 2.06 0 63.57 2.06 1.93 0
262144 4096 float none -1 63.87 4.10 3.85 0 63.61 4.12 3.86 0
524288 8192 float none -1 64.83 8.09 7.58 0 64.40 8.14 7.63 0
1048576 16384 float none -1 79.74 13.15 12.33 0 76.66 13.68 12.82 0
2097152 32768 float none -1 78.41 26.74 25.07 0 79.05 26.53 24.87 0
4194304 65536 float none -1 83.21 50.41 47.26 0 81.25 51.62 48.39 0
8388608 131072 float none -1 94.35 88.91 83.35 0 99.07 84.68 79.38 0
16777216 262144 float none -1 122.9 136.55 128.02 0 121.7 137.83 129.21 0
33554432 524288 float none -1 184.2 182.19 170.80 0 178.1 188.38 176.60 0
67108864 1048576 float none -1 294.7 227.75 213.51 0 277.7 241.62 226.52 0
134217728 2097152 float none -1 495.4 270.94 254.00 0 488.8 274.60 257.43 0
268435456 4194304 float none -1 877.5 305.92 286.80 0 861.3 311.65 292.17 0
536870912 8388608 float none -1 1589.8 337.71 316.60 0 1576.2 340.61 319.33 0
1073741824 16777216 float none -1 3105.7 345.74 324.13 0 3069.2 349.85 327.98 0
2147483648 33554432 float none -1 6161.7 348.52 326.74 0 6070.7 353.75 331.64 0
4294967296 67108864 float none -1 12305 349.03 327.22 0 12053 356.35 334.08 0
8589934592 134217728 float none -1 24489 350.77 328.85 0 23991 358.05 335.67 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 120.248
You can use reproduce pre-training benchmarks for large machine learning open models on A4 and A3 Ultra VMs on GKE.
Each recipe provides you with the instructions to complete the following tasks:
To view all the recipes available, see the GPU recipes GitHub repository.
Note: The following table lists pre-training benchmark recipes that are tested on A3 Ultra VMs on GKE clusters that were created using Cluster Toolkit.To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:
cd ~/cluster-toolkit
./gcluster destroy CLUSTER_NAME/
Replace CLUSTER_NAME
with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster names will be based on the DEPLOYMENT_NAME
.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-07 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[[["This page details how to create a Hypercompute Cluster on Google Kubernetes Engine (GKE) using A3 Ultra GPUs for AI and ML workloads, leveraging the Cluster Toolkit for simplified setup and best practices."],["Pre-GA products, including this offering, are subject to the \"Pre-GA Offerings Terms,\" available \"as is\" with potentially limited support, as detailed in the General Service Terms section."],["GKE provides a versatile platform for a variety of tasks, including distributed pre-training, model fine-tuning, inference, and application serving, while reducing the burden of managing multiple platforms."],["Utilizing Cluster Toolkit, users can access reference design blueprints that pre-configure compute, storage, and networking resources, and enable GPUDirect RDMA-over-Converged-Ethernet (RoCE) for distributed AI workloads."],["Capacity for A3 Ultra GPU resources can be reserved through a future reservation request, ensuring that workloads have the necessary resources, with instructions provided for creating GKE nodes on specific blocks within the reservation."]]],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4