Stay organized with collections Save and categorize content based on your preferences.
This page shows you how to use the NVIDIA Collective Communication Library (NCCL) Fast Socket plugin to run more efficient workloads on your Google Kubernetes Engine (GKE) clusters.
Before you beginBefore you start, make sure that you have performed the following tasks:
gcloud components update
. Note: For existing gcloud CLI installations, make sure to set the compute/region
property. If you use primarily zonal clusters, set the compute/zone
instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location
. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.GKE Autopilot:
For details, see Creating an Autopilot cluster.
GKE Standard:
For details, see Creating a regional cluster.
Enable NCCL Fast Socket in Standard clustersThis section shows you how to enable the NCCL Fast Socket plugin in GKE Standard node pools. If you use GKE Autopilot clusters, GKE automatically enables the plugin when you request NCCL Fast Socket in your workloads. For instructions, see the NCCL Fast Socket in Autopilot section.
For Standard clusters, create a node pool that uses the NCCL Fast Socket plugin. You can also update an existing node pool using gcloud container node-pools update
.
gcloud container node-pools create NODEPOOL_NAME \
--accelerator type=ACCELERATOR_TYPE,count=ACCELERATOR_COUNT \
--machine-type=MACHINE_TYPE \
--cluster=CLUSTER_NAME \
--enable-fast-socket \
--enable-gvnic
Replace the following:
NODEPOOL_NAME
: the name of the new node pool.CLUSTER_NAME
: the name of the cluster.ACCELERATOR_TYPE
: the type of GPU accelerator that you use. For example, nvidia-tesla-t4
.ACCELERATOR_COUNT
: the number of GPUs per node.MACHINE_TYPE
: the type of machine you want to use. NCCL Fast Socket is not supported on memory-optimized machine types.In Autopilot, GPU device drivers are automatically installed.
For Standard clusters, follow the instructions in Installing NVIDIA GPU device drivers to install the required NVIDIA device drivers on your nodes.
NCCL Fast Socket in AutopilotIn Autopilot clusters, you request NCCL Fast Socket in your workloads by using the cloud.google.com/gke-nccl-fastsocket
node selector. When you request NCCL Fast Socket in a workload, GKE enables gVNIC and NCCL Fast Socket on nodes that GKE provisions for the workload. You can use NCCL Fast Socket with any GPU type that Autopilot supports.
The following pod requests NCCL Fast Socket:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
nodeSelector:
cloud.google.com/gke-accelerator: GPU_TYPE
cloud.google.com/gke-nccl-fastsocket: "true"
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
limits:
nvidia.com/gpu: GPU_QUANTITY
Replace the following:
GPU_TYPE
: the type of GPU hardware. Allowed values are the following:
nvidia-b200
: NVIDIA B200 (180GB)nvidia-h200-141gb
: NVIDIA H200 (141GB)nvidia-h100-mega-80gb
: NVIDIA H100 Mega (80GB)nvidia-h100-80gb
: NVIDIA H100 (80GB)nvidia-a100-80gb
: NVIDIA A100 (80GB)nvidia-tesla-a100
: NVIDIA A100 (40GB)nvidia-l4
: NVIDIA L4nvidia-tesla-t4
: NVIDIA T4GPU_QUANTITY
: the number of GPUs to allocate to the container.To verify that NCCL Fast Socket is enabled, view the kube-system pods:
kubectl get pods -n kube-system
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
nccl-fastsocket-installer-qvfdw 2/2 Running 0 10m
nccl-fastsocket-installer-rtjs4 2/2 Running 0 10m
nccl-fastsocket-installer-tm294 2/2 Running 0 10m
In this output, the number of Pods should be equal to the number of nodes in the node pool.
Disable NCCL Fast SocketIn GKE Autopilot clusters, the NCCL Fast Socket plugin is disabled by default. To disable the plugin on an existing workload, redeploy the workload without the NCCL Fast Socket node selector.
To disable NCCL Fast Socket for a node pool in Standard clusters, run the following command:
gcloud container node-pools update NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--no-enable-fast-socket
Existing nodes still have the plugin installed. You must manually resize the node pool to migrate workloads to new nodes.
TroubleshootingTo troubleshoot gVNIC, see Troubleshooting Google Virtual NIC.
What's nextExcept as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-12 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-12 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4