This page introduces you to fundamental troubleshooting techniques for Google Kubernetes Engine (GKE). This page is for users who are new to Kubernetes and GKE who want to learn effective troubleshooting practices.
This page provides an overview of the following tools and techniques for monitoring, diagnosing, and resolving issues with GKE:
kubectl
command-line tool: use these commands to view the live status of resources such as nodes and Pods.If you're new to Kubernetes and GKE, understanding core concepts, like cluster architecture and the relationship between Pods and nodes, is essential before you start to troubleshoot. If you want to learn more, see Start learning about GKE.
It's also helpful to understand which parts of GKE you're responsible for maintaining and which parts Google Cloud is responsible for maintaining. For more information, see GKE shared responsibility.
Review cluster and workload health in the Google Cloud consoleThe Google Cloud console is a good starting point for troubleshooting because it provides a quick view of the health of your clusters and workloads. Cluster health refers to the health of the underlying GKE infrastructure like nodes and networking, while workload health refers to the status and performance of your apps running on the cluster.
The following sections describe the cluster and workload pages. To provide a complete picture of your app's health, the Google Cloud console also gives you access to powerful logging and monitoring tools, letting you investigate the root cause of past failures and proactively prevent future ones. For more information about these tools, see the Conduct historical analysis with Cloud Logging and Perform proactive monitoring with Cloud Monitoring sections.
Find cluster issuesThe Kubernetes clusters page provides you with an overview of the health of your clusters. To identify problems with any of your clusters, start on this page.
To get started, in the Google Cloud console, go to the Kubernetes clusters page.
Here are some examples of how you can use this page for troubleshooting:
After you discover a problem with a cluster, explore the cluster's Details page for in-depth information that helps you troubleshoot your cluster and understand its configuration.
To go to a cluster's Details page, do the following:
Go to the Kubernetes clusters page.
Review the Name column and click the name of the cluster that you want to investigate.
Here are some examples of how to use the cluster Details page to troubleshoot your cluster:
For general health checks, try the following options:
To view cluster-level dashboards, go to the Observability tab. By default, GKE enables Cloud Monitoring when you create a cluster. When Cloud Monitoring is enabled, GKE automatically sets up the dashboards on this page. Here are some of the views you might find most useful for troubleshooting:
Control plane: view the control plane's health and performance. This dashboard lets you monitor key metrics of components such as kube-apiserver
and etcd
, identify performance bottlenecks, and detect component failures.
To view recent app errors, go to the App errors tab. The information on this tab can help you prioritize and resolve errors by showing the number of occurrences, when an error first appeared, and when it last happened.
To investigate an error further, click the error message to view a detailed error report, including links to relevant logs.
If you're troubleshooting issues after a recent upgrade or change, check the Cluster basics section in the cluster Details tab. Confirm that the version listed in the Version field is what you expect. For further investigation, click Show upgrade history in the Upgrades section.
If you're using a Standard cluster and your Pods are stuck in a Pending
state, or you suspect that nodes are overloaded, check the Nodes tab. The Nodes tab isn't available for Autopilot clusters because GKE manages nodes for you.
Ready
. A NotReady
status indicates a problem with the node itself, such as resource pressure or an issue with the kubelet (the kubelet is the agent that runs on each node to manage containers).When you suspect that there's a problem with a specific app, like a failed Deployment, go to the Workloads page in the Google Cloud console. This page provides a centralized view of all of the apps that run within your clusters.
To get started, in the Google Cloud console, go to the Workloads page.
Here are some examples of how you can use this page for troubleshooting:
After you identify a problematic workload from the overview, explore the workload Details page to begin to isolate the root cause.
To go to a workload's Details page, do the following:
Go to the Workloads page.
View the Name column and click the name of the workload that you want to investigate.
Here are some examples of how to use the workload Details page to troubleshoot your workloads:
To check the workload's configuration, use the workload Overview and Details tabs. You can use this information to verify events such as whether the correct container image tag was deployed or check the workload's resource requests and limits.
Note: Depending on your workload type, you might not have an Overview tab. For example, StatefulSets have only a Details tab. However, both tabs help you review your configuration.To find the name of a specific crashing Pod, go to the Managed Pods section. You might need this information for kubectl
commands. This section lists all the Pods controlled by the workload, along with their statuses. To see a history of recent changes to a workload, go to the Revision history tab. If you notice performance issues after a new deployment, then use this section to identify which revision is active. You can then compare the configurations of the current revision with previous ones to pinpoint the source of the problem. If this tab isn't visible, the workload is either a type that doesn't use revisions or it hasn't yet had any updates.
If a Deployment seems to have failed, go to the Events tab. This page is often the most valuable source of information because it shows Kubernetes-level events.
To look at your app's logs, click the Logs tab. This page helps you understand what's happening inside your cluster. Look here for error messages and stack traces that can help you diagnose issues.
To confirm exactly what was deployed, view the YAML tab. This page shows the live YAML manifest for the workload as it exists on the cluster. This information is useful for finding any discrepancies from your source-controlled manifests. If you're viewing a single Pod's YAML manifest, this tab also shows you the status of the Pod, which provides insights about Pod-level failures.
kubectl
command-line tool
Although the Google Cloud console helps you understand if there's a problem, the kubectl
command-line tool is essential for discovering why. By communicating directly with the Kubernetes control plane, the kubectl
command-line tool lets you gather the detailed information that you need to troubleshoot your GKE environment.
The following sections introduce you to some essential commands that are a powerful starting point for GKE troubleshooting.
Before you beginBefore you start, perform the following tasks:
Configure the kubectl
command-line tool to communicate with your cluster:
gcloud container clusters get-credentials CLUSTER_NAME \
--location=LOCATION
Replace the following:
CLUSTER_NAME
: the name of your cluster.LOCATION
: the Compute Engine location of the control plane of your cluster. Provide a region for regional clusters, or a zone for zonal clusters.Review your permissions. To see if you have the required permissions to run kubectl
commands, use the kubectl auth can-i
command. For example, to see if you have permission to run kubectl get nodes
, run the kubectl auth can-i get nodes
command.
If you have the required permissions, the command returns yes
; otherwise, the command returns no
.
If you lack permission to run a kubectl
command, you might see an error message similar to the following:
Error from server (Forbidden): pods "POD_NAME" is forbidden: User
"USERNAME@DOMAIN.com" cannot list resource "pods" in API group "" in the
namespace "default"
If you don't have the required permissions, ask your cluster administrator to assign the necessary roles to you.
The kubectl get
command helps you to see an overall view of what's happening in your cluster. Use the following commands to see the status of two of the most important cluster components, nodes and Pods:
To check if your nodes are healthy, view details about all nodes and their statuses:
kubectl get nodes
The output is similar to the following:
NAME STATUS ROLES AGE VERSION
gke-cs-cluster-default-pool-8b8a777f-224a Ready <none> 4d23h v1.32.3-gke.1785003
gke-cs-cluster-default-pool-8b8a777f-egb2 Ready <none> 4d22h v1.32.3-gke.1785003
gke-cs-cluster-default-pool-8b8a777f-p5bn Ready <none> 4d22h v1.32.3-gke.1785003
Any status other than Ready
requires additional investigation.
To check if your Pods are healthy, view details about all Pods and their statuses:
kubectl get pods --all-namespaces
The output is similar to the following:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system netd-6nbsq 3/3 Running 0 4d23h
kube-system netd-g7tpl 3/3 Running 0 4d23h
Any status other than Running
requires additional investigation. Here are some common statuses that you might see:
Running
: a healthy, running state.Pending
: the Pod is waiting to be scheduled on a node.CrashLoopBackOff
: the containers in the Pod are repeatedly crashing in a loop because the app starts, exits with an error, and is then restarted by Kubernetes.ImagePullBackOff
: the Pod can't pull the container image.The preceding commands are only two examples of how you can use the kubectl get
command. You can also use the command to learn more about many types of Kubernetes resources. For a full list of the resources that you can explore, see kubectl get in the Kubernetes documentation.
-o wide
to your kubectl get
commands to see additional information about your resources. For example, the kubectl get nodes -o wide
command adds the following columns to the output: Internal-IP
, External-IP
, OS-Image
, Kernel-Version
, and Container-Runtime
. Learn more about specific resources
After you identify a problem, you need to get more details. An example of a problem could be a Pod that doesn't have a status of Running
. To get more details, use the kubectl describe
command.
For example, to describe a specific Pod, run the following command:
kubectl describe pod POD_NAME -n NAMESPACE_NAME
Replace the following:
POD_NAME
: the name of the Pod experiencing issues.NAMESPACE_NAME
: the namespace that the Pod is in. If you're not sure what the namespace is, review the Namespace
column from the output of the kubectl get pods
command.The output of the kubectl describe
command includes detailed information about your resource. Here are some of the most helpful sections to review when you troubleshoot a Pod:
Status
: the current status of the Pod.Conditions
: the overall health and readiness of the Pod.Restart Count
: how many times the containers in the Pod have restarted. High numbers can be a cause of concern.Events
: a log of important things that have happened to this Pod, like being scheduled to a node, pulling its container image, and whether any errors occurred. The Events
section is often where you can find the direct clues to why a Pod is failing.Like the kubectl get
command, you can use the kubectl describe
command to learn more about multiple types of resources. For a full list of the resources that you can explore, see kubectl describe in the Kubernetes documentation.
Although the kubectl
command-line tool is invaluable for inspecting the live state of your Kubernetes objects, its view is often limited to the present moment. To understand the root cause of a problem, you often need to investigate what happened over time. When you need that historical context, use Cloud Logging.
Cloud Logging aggregates logs from your GKE clusters, containerized apps, and other Google Cloud services.
Understand key log types for troubleshootingCloud Logging automatically collects several different types of GKE logs that can help you troubleshoot:
Node and runtime logs (kubelet
, containerd
): the logs from the underlying node services. Because the kubelet
manages the lifecycle of all Pods on the node, its logs are essential for troubleshooting issues like container startups, Out of Memory (OOM) events, probe failures, and volume mount errors. These logs are also crucial for diagnosing node-level problems, such as a node that has a NotReady
status.
Because containerd manages the lifecycle of your containers, including pulling images, its logs are crucial for troubleshooting issues that happen before the kubelet can start the container. containerd logs help you diagnose node-level problems in GKE, as they document the specific activities and potential errors of the container runtime.
App logs (stdout
, stderr
): the standard output and error streams from your containerized processes. These logs are essential for debugging app-specific issues like crashes, errors, or unexpected behavior.
Audit logs: these logs answer "who did what, where, and when?" for your cluster. They track administrative actions and API calls made to the Kubernetes API server, which is useful for diagnosing issues caused by configuration changes or unauthorized access.
After you identify an issue, you can query these logs to find out what happened. To help get you started, reviewing logs can help you with these issues:
NotReady
status, review its node logs. The kubelet
and containerd
logs often reveal the underlying cause, such as network problems or resource constraints.Use Logs Explorer to query, view, and analyze GKE logs in the Google Cloud console. Logs Explorer provides powerful filtering options that help you to isolate your issue.
To access and use Logs Explorer, complete the following steps:
In the Google Cloud console, go to the Logs Explorer page.
In the query pane, enter a query. Use the Logging query language to write targeted queries. Here are some common filters to get you started:
Filter type Description Example valueresource.type
The type of Kubernetes resource. k8s_cluster
, k8s_node
, k8s_pod
, k8s_container
log_id
The log stream from the resource. stdout
, stderr
resource.labels.RESOURCE_TYPE.name
Filter for resources with a specific name.RESOURCE_TYPE
with the name of the resource that you want to query. For example, namespace
or pod
. example-namespace-name
, example-pod-name
severity
The log severity level. DEFAULT
, INFO
, WARNING
, ERROR
, CRITICAL
jsonPayload.message=~
A regular expression search for text within the log message. scale.down.error.failed.to.delete.node.min.size.reached
For example, to troubleshoot a specific Pod, you might want to isolate its error logs. To see only logs with an ERROR
severity for that Pod, use the following query:
resource.type="k8s_container"
resource.labels.pod_name="POD_NAME"
resource.labels.namespace_name="NAMESPACE_NAME"
severity=ERROR
Replace the following:
POD_NAME
: the name of the Pod experiencing issues.NAMESPACE_NAME
: the namespace that the Pod is in. If you're not sure what the namespace is, review the Namespace
column from the output of the kubectl get pods
command.For more examples, see Kubernetes-related queries in the Google Cloud Observability documentation.
Click Run query.
To see the full log message, including the JSON payload, metadata, and timestamp, click the log entry.
For more information about GKE logs, see About GKE logs.
Perform proactive monitoring with Cloud MonitoringAfter an issue occurs, reviewing logs is a critical step in troubleshooting. However, a truly resilient system also requires a proactive approach to identify problems before they cause an outage.
To proactively identify future problems and track key performance indicators over time, use Cloud Monitoring. Cloud Monitoring provides dashboards, metrics, and alerting capabilities. These tools help you find rising error rates, increasing latency, or resource constraints, which help you act before users are affected.
Review useful metricsGKE automatically sends a set of metrics to Cloud Monitoring. The following sections list some of the most important metrics for troubleshooting:
For a complete list of GKE metrics, see GKE system metrics.
Container performance and health metricsStart with these metrics when you suspect a problem with a specific app. These metrics help you monitor the health of your app, including discovering if a container is restarting frequently, running out of memory, or being throttled by CPU limits.
Metric Description Troubleshooting significancekubernetes.io/container/cpu/limit_utilization
The fraction of the CPU limit that is currently in use on the instance. This value can be greater than 1 as a container might be allowed to exceed its CPU limit. Identifies CPU throttling. High values can lead to performance degradation. kubernetes.io/container/memory/limit_utilization
The fraction of the memory limit that is currently in use on the instance. This value cannot exceed 1. Monitors for risk of OutOfMemory (OOM) errors. kubernetes.io/container/memory/used_bytes
Actual memory consumed by the container in bytes. Tracks memory consumption to identify potential memory leaks or risk of OOM errors. kubernetes.io/container/memory/page_fault_count
Number of page faults, broken down by type: major and minor. Indicates significant memory pressure. Major page faults mean memory is being read from disk (swapping), even if memory limits aren't reached. kubernetes.io/container/restart_count
Number of times the container has restarted. Highlights potential problems such as crashing apps, misconfigurations, or resource exhaustion through a high or increasing number of restarts. kubernetes.io/container/ephemeral_storage/used_bytes
Local ephemeral storage usage in bytes. Monitors temporary disk usage to prevent Pod evictions due to full ephemeral storage. kubernetes.io/container/cpu/request_utilization
The fraction of the requested CPU that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. Identifies over or under-provisioned CPU requests to help you optimize resource allocation. kubernetes.io/container/memory/request_utilization
The fraction of the requested memory that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. Identifies over or under-provisioned memory requests to improve scheduling and prevent OOM errors. Node performance and health metrics
Examine these metrics when you need to diagnose issues with the underlying GKE infrastructure. These metrics are crucial for understanding the overall health and capacity of your nodes, helping you investigate whether the node is unhealthy or under pressure, or whether the node has enough memory to schedule new Pods.
Metric Description Troubleshooting significancekubernetes.io/node/cpu/allocatable_utilization
The fraction of the allocatable CPU that is currently in use on the instance. Indicates if the sum of Pod usage is straining the node's available CPU resources. kubernetes.io/node/memory/allocatable_utilization
The fraction of the allocatable memory that is currently in use on the instance. This value cannot exceed 1 as usage cannot exceed allocatable memory bytes. Suggests that the node lacks memory for scheduling new Pods or for existing Pods to operate, especially when values are high. kubernetes.io/node/status_condition
(BETA) Condition of a node from the node status condition field. Reports node health conditions like Ready
, MemoryPressure
, or DiskPressure
. kubernetes.io/node/ephemeral_storage/used_bytes
Local ephemeral storage bytes used by the node. Helps prevent Pod startup failures or evictions by providing warnings about high ephemeral storage usage. kubernetes.io/node/ephemeral_storage/inodes_free
Free number of index nodes (inodes) on local ephemeral storage. Monitors the number of free inodes. Running out of inodes can halt operations even if disk space is available. kubernetes.io/node/interruption_count
(BETA) Interruptions are system evictions of infrastructure while the customer is in control of that infrastructure. This metric is the current count of interruptions by type and reason. Explains why a node might disappear unexpectedly due to system evictions. Pod performance and health metrics
These metrics help you troubleshoot issues related to a Pod's interaction with its environment, such as networking and storage. Use these metrics when you need to diagnose slow-starting Pods, investigate potential network connectivity issues, or proactively manage storage to prevent write failures from full volumes.
Metric Description Troubleshooting significancekubernetes.io/pod/network/received_bytes_count
Cumulative number of bytes received by the Pod over the network. Identifies unusual network activity (high or low) that can indicate app or network issues. kubernetes.io/pod/network/policy_event_count
(BETA) Change in the number of network policy events seen in the dataplane. Identifies connectivity issues caused by network policies. kubernetes.io/pod/volume/utilization
The fraction of the volume that is currently being used by the instance. This value cannot be greater than 1 as usage cannot exceed the total available volume space. Enables proactive management of volume space by warning when high utilization (approaching 1) might lead to write failures. kubernetes.io/pod/latencies/pod_first_ready
(BETA) The Pod end-to-end startup latency (from Pod Created to Ready), including image pulls. Diagnoses slow-starting Pods. Visualize metrics with Metrics Explorer
To visualize the state of your GKE environment, create charts based on metrics with Metrics Explorer.
To use Metrics Explorer, complete the following steps:
In the Google Cloud console, go to the Metrics Explorer page.
In the Metrics field, select or enter the metric that you want to inspect.
View the results and observe any trends over time.
For example, to investigate the memory consumption of Pods in a specific namespace, you can do the following:
kubernetes.io/container/memory/used_bytes
and click Apply.The resulting chart shows you the memory usage for each Pod over time, which can help you visually identify any Pods with unusually high or spiking memory consumption.
Metrics Explorer has a great deal of flexibility in how to construct the metrics that you want to view. For more information about advanced Metrics Explorer options, see Create charts with Metrics Explorer in the Cloud Monitoring documentation.
Create alerts for proactive issue detectionTo receive notifications when things go wrong or when metrics breach certain thresholds, set up alerting policies in Cloud Monitoring.
For example, to set up an alerting policy that notifies you when the container CPU limit is over 80% for five minutes, do the following:
In the Google Cloud console, go to the Alerting page.
Click Create policy.
In the Select a metric box, filter for CPU limit utilization
and then select the following metric: kubernetes.io/container/cpu/limit_utilization.
Click Apply.
Leave the Add a filter field blank. This setting triggers an alert when any cluster violates your threshold.
In the Transform data section, do the following:
In the Rolling window function list, select mean.
Both of these settings average the CPU limit utilization for each container every minute.
Click Next.
In the Configure alert section, do the following:
0.8
. This value represents the 80% threshold that you want to monitor for.In the Configure the notifications and finalize the alert section, do the following:
Review your policy, and if it all looks correct, click Create policy.
To learn about the additional ways that you can create alerts, see Alerting overview in the Cloud Monitoring documentation.
Accelerate diagnosis with Gemini Cloud AssistPreview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions.
Sometimes, the cause of your issue isn't immediately obvious, even when you used the tools discussed in the preceding sections. Investigating complex cases can be time-consuming and requires deep expertise. For scenarios like this, Gemini Cloud Assist can help. It can automatically detect hidden patterns, surface anomalies, and provide summaries to help you quickly pinpoint a likely cause.
As an early-stage technology, Gemini for Google Cloud products can generate output that seems plausible but is factually incorrect. We recommend that you validate all output from Gemini for Google Cloud products before you use it. For more information, see Gemini for Google Cloud and responsible AI.
Access Gemini Cloud AssistTo access Gemini Cloud Assist, complete the following steps:
In the Google Cloud console toolbar, click spark Open or close Gemini Cloud Assist chat.
The Cloud Assist panel opens. You can click example prompts if they are displayed, or you can enter a prompt in the Enter a prompt field.
To help you understand how Gemini Cloud Assist can help you, here are some example prompts:
Theme Scenario Example prompt How Gemini Cloud Assist can help Confusing error message A Pod has theCrashLoopBackoff
status, but the error message is hard to understand. What does this GKE Pod error mean and what are common causes: panic: runtime error: invalid memory address or nil pointer dereference
? Gemini Cloud Assist analyzes the message and explains it in clear terms. It also offers potential causes and solutions. Performance issues Your team notices high latency for an app that runs in GKE. My api-gateway
service in the prod
GKE cluster is experiencing high latency. What metrics should I check first, and can you suggest some common GKE-related causes for this? Gemini Cloud Assist suggests key metrics to examine, explores potential issues (for example, resource constraints, or network congestion), and recommends tools and techniques for further investigation. Node issues A GKE node is stuck with a status of NotReady
. One of my GKE nodes (node-xyz
) is showing a NotReady
status. What are the typical steps to troubleshoot this? Gemini Cloud Assist provides a step-by-step investigation plan, explaining concepts like node auto-repair and suggesting relevant kubectl
commands. Understanding GKE You're unsure about a specific GKE feature or how to implement a best practice. What are the best practices for securing a GKE cluster? Is there any way I can learn more? Gemini Cloud Assist provides clear explanations of GKE best practices. Click Show related content to see links to official documentation.
For more information, see the following resources:
In addition to interactive chat, Gemini Cloud Assist can perform more automated, in-depth analysis through Gemini Cloud Assist Investigations. This feature is integrated directly into workflows like Logs Explorer, and is a powerful root-cause analysis tool.
When you initiate an investigation from an error or a specific resource, Gemini Cloud Assist analyzes logs, configurations, and metrics. It uses this data to produce ranked observations and hypotheses about probable root causes, and then provides you with recommended next steps. You can also transfer these results to a Google Cloud support case to provide valuable context that can help you resolve your issue faster.
For more information, see Gemini Cloud Assist Investigations in the Gemini documentation.
Put it all together: Example troubleshooting scenarioThis example shows how you can use a combination of GKE tools to diagnose and understand a common real-world problem: a container that is repeatedly crashing due to insufficient memory.
The scenarioYou are the on-call engineer for a web app named product-catalog
that runs in GKE.
Your investigation begins when you receive an automated alert from Cloud Monitoring:
Alert: High memory utilization for container 'product-catalog' in 'prod' cluster.
This alert tells you that a problem exists and indicates that the problem has something to do with the product-catalog
workload.
You start with a high-level view of your workloads to confirm the issue.
product-catalog
workload.3/3
, you see the value steadily showing an unhealthy status: 2/3
. This value tells you that one of your app's Pods doesn't have a status of Ready
.product-catalog
workload to go to its details page.Restarts
column for your Pod shows 14
, an unusually high number.This high restart count confirms the issue is causing app instability, and suggests that a container is failing its health checks or crashing.
Find the reason withkubectl
commands
Now that you know that your app is repeatedly restarting, you need to find out why. The kubectl describe
command is a good tool for this.
You get the exact name of the unstable Pod:
kubectl get pods -n prod
The output is the following:
NAME READY STATUS RESTARTS AGE
product-catalog-d84857dcf-g7v2x 0/1 CrashLoopBackOff 14 25m
product-catalog-d84857dcf-lq8m4 1/1 Running 0 2h30m
product-catalog-d84857dcf-wz9p1 1/1 Running 0 2h30m
You describe the unstable Pod to get the detailed event history:
kubectl describe pod product-catalog-d84857dcf-g7v2x -n prod
You review the output and find clues under the Last State
and Events
sections:
Containers:
product-catalog-api:
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 23 Jun 2025 10:50:15 -0700
Finished: Mon, 23 Jun 2025 10:54:58 -0700
Ready: False
Restart Count: 14
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned prod/product-catalog-d84857dcf-g7v2x to gke-cs-cluster-default-pool-8b8a777f-224a
Normal Pulled 8m (x14 over 25m) kubelet Container image "us-central1-docker.pkg.dev/my-project/product-catalog/api:v1.2" already present on machine
Normal Created 8m (x14 over 25m) kubelet Created container product-catalog-api
Normal Started 8m (x14 over 25m) kubelet Started container product-catalog-api
Warning BackOff 3m (x68 over 22m) kubelet Back-off restarting failed container
The output gives you two critical clues:
Last State
section shows that the container was terminated with Reason: OOMKilled
, which tells you it ran out of memory. This reason is confirmed by the Exit Code: 137
, which is the standard Linux exit code for a process that has been killed due to excessive memory consumption.Events
section shows a Warning: BackOff
event with the message Back-off restarting failed container
. This message confirms that the container is in a failure loop, which is the direct cause of the CrashLoopBackOff
status that you saw earlier.The kubectl describe
command told you what happened, but Cloud Monitoring can show you the behavior of your environment over time.
container/memory/used_bytes
metric.The chart shows a distinct pattern: the memory usage climbs steadily, then abruptly drops to zero when the container is OOM killed and restarts. This visual evidence confirms either a memory leak or insufficient memory limit.
Find the root cause in logsYou now know the container is running out of memory, but you still don't know exactly why. To discover the root cause, use Logs Explorer.
You write a query to filter for your specific container's logs from just before the time of the last crash (which you saw in the output of the kubectl describe
command):
resource.type="k8s_container"
resource.labels.cluster_name="example-cluster"
resource.labels.namespace_name="prod"
resource.labels.pod_name="product-catalog-d84857dcf-g7v2x"
timestamp >= "2025-06-23T17:50:00Z"
timestamp < "2025-06-23T17:55:00Z"
In the logs, you find a repeating pattern of messages right before each crash:
{
"message": "Processing large image file product-image-large.jpg",
"severity": "INFO"
},
{
"message": "WARN: Memory cache size now at 248MB, nearing limit.",
"severity": "WARNING"
}
These log entries tell you that the app is trying to process large image files by loading them entirely into memory, which eventually exhausts the container's memory limit.
The findingsBy using the tools together, you have a complete picture of the problem:
kubectl
commands pinpointed the exact reason for the restarts (OOMKilled
).You're now ready to implement a solution. You can either optimize the app code to handle large files more efficiently or, as a short-term fix, increase the container's memory limit (specifically, the spec.containers.resources.limits.memory
value) in the workload's YAML manifest.
For advice about resolving specific problems, review GKE's troubleshooting guides.
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
google-kubernetes-engine
tag to search for similar issues. You can also join the #kubernetes-engine
Slack channel for more community support.RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4