This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.
Cluster issues Cannot create a cluster: 0 nodes registeredThe following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled or doesn't have the required permissions. Cluster creation fails with the following error message:
All cluster resources were brought up, but: only 0 nodes out of 2 have registered.
To resolve the issue, do the following:
Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:
gcloud iam service-accounts describe SERVICE_ACCOUNT
Replace SERVICE_ACCOUNT
with service account email address, such as my-iam-account@my-first-project.iam.gserviceaccount.com
.
If the service account is disabled, the output is similar to the following:
disabled: true
displayName: my-service-account
email: my-service-account@my-project.iam.gserviceaccount.com
...
If the service account is disabled, enable it:
gcloud iam service-accounts enable SERVICE_ACCOUNT
If the service account is enabled and the error persists, grant the service account the minimum permissions required for GKE:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:SERVICE_ACCOUNT" \
--role roles/container.defaultNodeServiceAccount
Namespace stuck in the Terminating state when cluster has 0 nodes
The following issue occurs when you delete a namespace in a cluster after the cluster scales down to zero nodes. The metrics-server
component can't accept the namespace deletion request because the component has zero replicas.
To diagnose this issue, run the following command:
kubectl describe ns/NAMESPACE_NAME
Replace NAMESPACE_NAME
with the name of the namespace.
The output is the following:
Discovery failed for some groups, 1 failing: unable to retrieve the complete
list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to
handle the request
To resolve this issue, scale any workload up to trigger GKE to create a new node. When the node is ready, the namespace deletion request automatically completes. After GKE deletes the namespace, scale the workload back down.
Scaling issues Node scale up failed: Pod is at risk of not being scheduledThe following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.
The error message in your Kubernetes event log is similar to the following:
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning FailedScaleUp pod/pod-test-5b97f7c978-h9lvl Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled
Serial port logging might be disabled at the organization level through an organization policy that enforces the compute.disableSerialPortLogging
constraint. Serial port logging could also be disabled at the project or virtual machine (VM) instance level.
To resolve this issue, do the following:
compute.disableSerialPortLogging
constraint in the project with your Autopilot cluster.compute.projects.setCommonInstanceMetadata
IAM permission.The following issue occurs when your workloads request more resources than are available to use in that Compute Engine region or zone. Your Pods might remain in the Pending
state.
Check your Pod events:
kubectl events --for='pod/POD_NAME' --types=Warning
Replace RESOURCE_NAME
with the name of the pending Kubernetes resource. For example pod/example-pod
.
The output is similar to the following:
LAST SEEN TYPE REASON OBJECT Message
19m Warning FailedScheduling pod/example-pod gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
14m Warning FailedScheduling pod/example-pod gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
12m (x2 over 18m) Warning FailedScaleUp cluster-autoscaler Node scale up in zones us-central1-f associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
34s (x3 over 17m) Warning FailedScaleUp cluster-autoscaler Node scale up in zones us-central1-b associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
To resolve this issue, try the following:
To avoid scale-up issues caused by resource availability in the future, consider the following approaches:
The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.
The error message in your logs is similar to the following:
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
...
This error refers to a noScaleUp
event, where node auto-provisioning did not provision any node group for the Pod in the zone.
If you encounter this error, confirm the following:
GKE won't create Pods if your Pod ephemeral storage requests exceed the Autopilot maximum of 10GiB in GKE version 1.28.6-gke.1317000 and later.
To diagnose this issue, describe the workload controller, like the Deployment or the Job:
kubectl describe CONTROLLER_TYPE/CONTROLLER_NAME
Replace the following:
CONTROLLER_TYPE
: the type of workload controller, like replicaset
or daemonset
. For a list of controller types, see Workload management.CONTROLLER_NAME
: the name of the stuck workload.If the Pod is not created because of the ephemeral storage request exceeding the maximum, the output is similar to the following:
# lines omitted for clarity
Events:
{"[denied by autogke-pod-limit-constraints]":["Max ephemeral-storage requested by init containers for workload '' is higher than the Autopilot maximum of '10Gi'.","Total ephemeral-storage requested by containers for workload '' is higher than the Autopilot maximum of '10Gi'."]}
To resolve this issue, update your ephemeral storage requests so that the total ephemeral storage requested by workload containers and by containers that webhooks inject is at less than or equal to the allowed maximum. For more information about the maximum, see Resource requests in Autopilot. for the workload configuration.
Pods stuck in Pending stateA Pod might get stuck in the Pending
status if you select a specific node for your Pod to use, but the sum of resource requests in the Pod and in DaemonSets that must run on the node exceeds the maximum allocatable capacity of the node. This might cause your Pod to get a Pending
status and remain unscheduled.
To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.
You can also try scheduling your DaemonSets before you schedule your regular workload Pods.
Consistently unreliable workload performance on a specific nodeIn GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by cordoning it using the following command:
kubectl drain NODE_NAME --ignore-daemonsets
Replace NODE_NAME
with the name of the problematic node. You can find the node name by running kubectl get nodes
.
GKE does the following:
kube-system
namespace. You can safely ignore draining errors in this namespace. GKE might also rate-limit your use of this command. Pods take longer than expected to schedule on empty clusters
This event occurs when you deploy a workload to an Autopilot cluster that has no other workloads. Autopilot clusters start with zero usable nodes and scale to zero nodes if the cluster is empty to avoid having unutilized compute resources in the cluster. Deploying a workload in a cluster that has zero nodes triggers a scale-up event.
If you experience this, Autopilot is functioning as intended, and no action is necessary. Your workload will deploy as expected after the new nodes boot up.
Check whether your Pods are waiting for new nodes:
Describe your pending Pod:
kubectl describe pod POD_NAME
Replace POD_NAME
with the name of your pending Pod.
Check the Events
section of the output. If the Pod is waiting for new nodes, the output is similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 11s gke.io/optimize-utilization-scheduler no nodes available to schedule pods
Normal TriggeredScaleUp 4s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-9293c6db-grp 0->1 (max: 1000)} {https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-d99371e7-grp 0->1 (max: 1000)}]
The TriggeredScaleUp
event shows that your cluster is scaling up from zero nodes to as many nodes are required to run your deployed workload.
This event occurs when none of your own workloads are running in a cluster, which results in the cluster scaling down to zero nodes. Autopilot clusters start with zero usable nodes and scale down to zero nodes if you aren't running any of your workloads in the cluster. This behavior minimizes wasted compute resources in the cluster.
When a cluster scales down to zero nodes, GKE system workloads won't schedule and stay in the Pending
state. This is intended behavior and no action is necessary. The next time that you deploy a workload to the cluster, GKE will scale the cluster up and the pending system Pods will run on those nodes.
To check whether system Pods are pending because of an empty cluster, do the following:
Check whether your cluster has any nodes:
kubectl get nodes
The output is the following, which indicates that the cluster has zero nodes:
No resources found
Check the status of system Pods:
kubectl get pods --namespace=kube-system
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
antrea-controller-horizontal-autoscaler-6d97f7cf7c-ngfd2 0/1 Pending 0 9d
egress-nat-controller-84bc985778-6jcwl 0/1 Pending 0 9d
event-exporter-gke-5c5b457d58-7njv7 0/2 Pending 0 3d5h
event-exporter-gke-6cd5c599c6-bn665 0/2 Pending 0 9d
konnectivity-agent-694b68fb7f-gws8j 0/2 Pending 0 3d5h
konnectivity-agent-7d659bf64d-lp4kt 0/2 Pending 0 9d
konnectivity-agent-7d659bf64d-rkrw2 0/2 Pending 0 9d
konnectivity-agent-autoscaler-5b6ff64fcd-wn7fw 0/1 Pending 0 9d
konnectivity-agent-autoscaler-cc5bd5684-tgtwp 0/1 Pending 0 3d5h
kube-dns-65ccc769cc-5q5q7 0/5 Pending 0 3d5h
kube-dns-7f7cdb9b75-qkq4l 0/5 Pending 0 9d
kube-dns-7f7cdb9b75-skrx4 0/5 Pending 0 9d
kube-dns-autoscaler-6ffdbff798-vhvkg 0/1 Pending 0 9d
kube-dns-autoscaler-8b7698c76-mgcx8 0/1 Pending 0 3d5h
l7-default-backend-87b58b54c-x5q7f 0/1 Pending 0 9d
metrics-server-v1.31.0-769c5b4896-t5jjr 0/1 Pending 0 9d
Check the reason why the system Pods have the Pending
status:
kubectl describe pod --namespace=kube-system SYSTEM_POD_NAME
Replace SYSTEM_POD_NAME
with the name of any system Pod from the output of the preceding command.
The output is similar to the following:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m35s (x27935 over 3d5h) default-scheduler no nodes available to schedule pods
...
In the output, the no nodes available to schedule pods
value in the Message
field for the FailedScheduling
event indicates that the system Pod didn't schedule because the cluster is empty.
Access to underlying nodes is prohibited in a GKE Autopilot cluster. Thus, it is required to run tcpdump
utility from within a Pod and then copy it using kubectl cp command. If you generally run tcpdump utility from within a Pod in a GKE Autopilot cluster, you might see the following error:
tcpdump: eth0: You don't have permission to perform this capture on that device
(socket: Operation not permitted)
This happens because GKE Autopilot, by default, applies a security context to all Pods that drops the NET_RAW
capability to mitigate potential vulnerabilities. For example:
apiVersion: v1
kind: Pod
metadata:
labels:
app: tcpdump
name: tcpdump
spec:
containers:
- image: nginx
name: nginx
resources:
limits:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
securityContext:
capabilities:
drop:
- NET_RAW
As a solution, if your workload requires the NET_RAW
capability, you can re-enable it:
Add the NET_RAW
capability to the securityContext
section of your Pod's YAML specification:
securityContext:
capabilities:
add:
- NET_RAW
Run tcpdump
from within a Pod:
tcpdump port 53 -w packetcap.pcap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
Use kubectl cp
command to copy it to your local machine for further analysis:
kubectl cp POD_NAME:/PATH_TO_FILE/FILE_NAME/PATH_TO_FILE/FILE_NAME
Use kubectl exec
to run the tcpdump
command to perform network packet capture and redirect the output:
kubectl exec -it POD_NAME -- bash -c "tcpdump port 53 -w -" > packet-new.pcap
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
google-kubernetes-engine
tag to search for similar issues. You can also join the #kubernetes-engine
Slack channel for more community support.RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4