A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://cloud.google.com/kubernetes-engine/docs/troubleshooting/introduction below:

Introduction to GKE troubleshooting | Google Kubernetes Engine (GKE)

This page introduces you to fundamental troubleshooting techniques for Google Kubernetes Engine (GKE). This page is for users who are new to Kubernetes and GKE who want to learn effective troubleshooting practices.

This page provides an overview of the following tools and techniques for monitoring, diagnosing, and resolving issues with GKE:

Understand core concepts

If you're new to Kubernetes and GKE, understanding core concepts, like cluster architecture and the relationship between Pods and nodes, is essential before you start to troubleshoot. If you want to learn more, see Start learning about GKE.

It's also helpful to understand which parts of GKE you're responsible for maintaining and which parts Google Cloud is responsible for maintaining. For more information, see GKE shared responsibility.

Review cluster and workload health in the Google Cloud console

The Google Cloud console is a good starting point for troubleshooting because it provides a quick view of the health of your clusters and workloads. Cluster health refers to the health of the underlying GKE infrastructure like nodes and networking, while workload health refers to the status and performance of your apps running on the cluster.

The following sections describe the cluster and workload pages. To provide a complete picture of your app's health, the Google Cloud console also gives you access to powerful logging and monitoring tools, letting you investigate the root cause of past failures and proactively prevent future ones. For more information about these tools, see the Conduct historical analysis with Cloud Logging and Perform proactive monitoring with Cloud Monitoring sections.

Find cluster issues

The Kubernetes clusters page provides you with an overview of the health of your clusters. To identify problems with any of your clusters, start on this page.

Here are some examples of how you can use this page for troubleshooting:

Investigate a specific cluster

After you discover a problem with a cluster, explore the cluster's Details page for in-depth information that helps you troubleshoot your cluster and understand its configuration.

To go to a cluster's Details page, do the following:

  1. Go to the Kubernetes clusters page.

    Go to Kubernetes clusters

  2. Review the Name column and click the name of the cluster that you want to investigate.

Here are some examples of how to use the cluster Details page to troubleshoot your cluster:

Find workload issues

When you suspect that there's a problem with a specific app, like a failed Deployment, go to the Workloads page in the Google Cloud console. This page provides a centralized view of all of the apps that run within your clusters.

Tip: If you work in a project with many resources, use the Filter bar to narrow the view to a specific cluster or namespace.

Here are some examples of how you can use this page for troubleshooting:

Investigate a specific workload

After you identify a problematic workload from the overview, explore the workload Details page to begin to isolate the root cause.

To go to a workload's Details page, do the following:

  1. Go to the Workloads page.

    Go to Workloads

  2. View the Name column and click the name of the workload that you want to investigate.

Here are some examples of how to use the workload Details page to troubleshoot your workloads:

Investigate the cluster's state with the kubectl command-line tool

Although the Google Cloud console helps you understand if there's a problem, the kubectl command-line tool is essential for discovering why. By communicating directly with the Kubernetes control plane, the kubectl command-line tool lets you gather the detailed information that you need to troubleshoot your GKE environment.

The following sections introduce you to some essential commands that are a powerful starting point for GKE troubleshooting.

Before you begin

Before you start, perform the following tasks:

Get an overview of what's running

The kubectl get command helps you to see an overall view of what's happening in your cluster. Use the following commands to see the status of two of the most important cluster components, nodes and Pods:

  1. To check if your nodes are healthy, view details about all nodes and their statuses:

    kubectl get nodes
    

    The output is similar to the following:

    NAME                                        STATUS   ROLES    AGE     VERSION
    
    gke-cs-cluster-default-pool-8b8a777f-224a   Ready    <none>   4d23h   v1.32.3-gke.1785003
    gke-cs-cluster-default-pool-8b8a777f-egb2   Ready    <none>   4d22h   v1.32.3-gke.1785003
    gke-cs-cluster-default-pool-8b8a777f-p5bn   Ready    <none>   4d22h   v1.32.3-gke.1785003
    

    Any status other than Ready requires additional investigation.

  2. To check if your Pods are healthy, view details about all Pods and their statuses:

    kubectl get pods --all-namespaces
    

    The output is similar to the following:

    NAMESPACE   NAME       READY   STATUS      RESTARTS   AGE
    kube-system netd-6nbsq 3/3     Running     0          4d23h
    kube-system netd-g7tpl 3/3     Running     0          4d23h
    

    Any status other than Running requires additional investigation. Here are some common statuses that you might see:

The preceding commands are only two examples of how you can use the kubectl get command. You can also use the command to learn more about many types of Kubernetes resources. For a full list of the resources that you can explore, see kubectl get in the Kubernetes documentation.

Tip: Add -o wide to your kubectl get commands to see additional information about your resources. For example, the kubectl get nodes -o wide command adds the following columns to the output: Internal-IP, External-IP, OS-Image, Kernel-Version, and Container-Runtime. Learn more about specific resources

After you identify a problem, you need to get more details. An example of a problem could be a Pod that doesn't have a status of Running. To get more details, use the kubectl describe command.

For example, to describe a specific Pod, run the following command:

kubectl describe pod POD_NAME -n NAMESPACE_NAME

Replace the following:

The output of the kubectl describe command includes detailed information about your resource. Here are some of the most helpful sections to review when you troubleshoot a Pod:

Like the kubectl get command, you can use the kubectl describe command to learn more about multiple types of resources. For a full list of the resources that you can explore, see kubectl describe in the Kubernetes documentation.

Conduct historical analysis with Cloud Logging

Although the kubectl command-line tool is invaluable for inspecting the live state of your Kubernetes objects, its view is often limited to the present moment. To understand the root cause of a problem, you often need to investigate what happened over time. When you need that historical context, use Cloud Logging.

Cloud Logging aggregates logs from your GKE clusters, containerized apps, and other Google Cloud services.

Understand key log types for troubleshooting

Cloud Logging automatically collects several different types of GKE logs that can help you troubleshoot:

Common troubleshooting scenarios

After you identify an issue, you can query these logs to find out what happened. To help get you started, reviewing logs can help you with these issues:

How to access logs

Use Logs Explorer to query, view, and analyze GKE logs in the Google Cloud console. Logs Explorer provides powerful filtering options that help you to isolate your issue.

To access and use Logs Explorer, complete the following steps:

  1. In the Google Cloud console, go to the Logs Explorer page.

    Go to Logs Explorer

  2. In the query pane, enter a query. Use the Logging query language to write targeted queries. Here are some common filters to get you started:

    Filter type Description Example value resource.type The type of Kubernetes resource. k8s_cluster, k8s_node, k8s_pod, k8s_container log_id The log stream from the resource. stdout, stderr resource.labels.RESOURCE_TYPE.name Filter for resources with a specific name.
    Replace RESOURCE_TYPE with the name of the resource that you want to query. For example, namespace or pod. example-namespace-name, example-pod-name severity The log severity level. DEFAULT, INFO, WARNING, ERROR, CRITICAL jsonPayload.message=~ A regular expression search for text within the log message. scale.down.error.failed.to.delete.node.min.size.reached

    For example, to troubleshoot a specific Pod, you might want to isolate its error logs. To see only logs with an ERROR severity for that Pod, use the following query:

    resource.type="k8s_container"
    resource.labels.pod_name="POD_NAME"
    resource.labels.namespace_name="NAMESPACE_NAME"
    severity=ERROR
    

    Replace the following:

    For more examples, see Kubernetes-related queries in the Google Cloud Observability documentation.

  3. Click Run query.

  4. To see the full log message, including the JSON payload, metadata, and timestamp, click the log entry.

For more information about GKE logs, see About GKE logs.

Perform proactive monitoring with Cloud Monitoring

After an issue occurs, reviewing logs is a critical step in troubleshooting. However, a truly resilient system also requires a proactive approach to identify problems before they cause an outage.

To proactively identify future problems and track key performance indicators over time, use Cloud Monitoring. Cloud Monitoring provides dashboards, metrics, and alerting capabilities. These tools help you find rising error rates, increasing latency, or resource constraints, which help you act before users are affected.

Review useful metrics

GKE automatically sends a set of metrics to Cloud Monitoring. The following sections list some of the most important metrics for troubleshooting:

For a complete list of GKE metrics, see GKE system metrics.

Container performance and health metrics

Start with these metrics when you suspect a problem with a specific app. These metrics help you monitor the health of your app, including discovering if a container is restarting frequently, running out of memory, or being throttled by CPU limits.

Metric Description Troubleshooting significance kubernetes.io/container/cpu/limit_utilization The fraction of the CPU limit that is currently in use on the instance. This value can be greater than 1 as a container might be allowed to exceed its CPU limit. Identifies CPU throttling. High values can lead to performance degradation. kubernetes.io/container/memory/limit_utilization The fraction of the memory limit that is currently in use on the instance. This value cannot exceed 1. Monitors for risk of OutOfMemory (OOM) errors. kubernetes.io/container/memory/used_bytes Actual memory consumed by the container in bytes. Tracks memory consumption to identify potential memory leaks or risk of OOM errors. kubernetes.io/container/memory/page_fault_count Number of page faults, broken down by type: major and minor. Indicates significant memory pressure. Major page faults mean memory is being read from disk (swapping), even if memory limits aren't reached. kubernetes.io/container/restart_count Number of times the container has restarted. Highlights potential problems such as crashing apps, misconfigurations, or resource exhaustion through a high or increasing number of restarts. kubernetes.io/container/ephemeral_storage/used_bytes Local ephemeral storage usage in bytes. Monitors temporary disk usage to prevent Pod evictions due to full ephemeral storage. kubernetes.io/container/cpu/request_utilization The fraction of the requested CPU that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. Identifies over or under-provisioned CPU requests to help you optimize resource allocation. kubernetes.io/container/memory/request_utilization The fraction of the requested memory that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. Identifies over or under-provisioned memory requests to improve scheduling and prevent OOM errors. Node performance and health metrics

Examine these metrics when you need to diagnose issues with the underlying GKE infrastructure. These metrics are crucial for understanding the overall health and capacity of your nodes, helping you investigate whether the node is unhealthy or under pressure, or whether the node has enough memory to schedule new Pods.

Metric Description Troubleshooting significance kubernetes.io/node/cpu/allocatable_utilization The fraction of the allocatable CPU that is currently in use on the instance. Indicates if the sum of Pod usage is straining the node's available CPU resources. kubernetes.io/node/memory/allocatable_utilization The fraction of the allocatable memory that is currently in use on the instance. This value cannot exceed 1 as usage cannot exceed allocatable memory bytes. Suggests that the node lacks memory for scheduling new Pods or for existing Pods to operate, especially when values are high. kubernetes.io/node/status_condition (BETA) Condition of a node from the node status condition field. Reports node health conditions like Ready, MemoryPressure, or DiskPressure. kubernetes.io/node/ephemeral_storage/used_bytes Local ephemeral storage bytes used by the node. Helps prevent Pod startup failures or evictions by providing warnings about high ephemeral storage usage. kubernetes.io/node/ephemeral_storage/inodes_free Free number of index nodes (inodes) on local ephemeral storage. Monitors the number of free inodes. Running out of inodes can halt operations even if disk space is available. kubernetes.io/node/interruption_count (BETA) Interruptions are system evictions of infrastructure while the customer is in control of that infrastructure. This metric is the current count of interruptions by type and reason. Explains why a node might disappear unexpectedly due to system evictions. Pod performance and health metrics

These metrics help you troubleshoot issues related to a Pod's interaction with its environment, such as networking and storage. Use these metrics when you need to diagnose slow-starting Pods, investigate potential network connectivity issues, or proactively manage storage to prevent write failures from full volumes.

Metric Description Troubleshooting significance kubernetes.io/pod/network/received_bytes_count Cumulative number of bytes received by the Pod over the network. Identifies unusual network activity (high or low) that can indicate app or network issues. kubernetes.io/pod/network/policy_event_count (BETA) Change in the number of network policy events seen in the dataplane. Identifies connectivity issues caused by network policies. kubernetes.io/pod/volume/utilization The fraction of the volume that is currently being used by the instance. This value cannot be greater than 1 as usage cannot exceed the total available volume space. Enables proactive management of volume space by warning when high utilization (approaching 1) might lead to write failures. kubernetes.io/pod/latencies/pod_first_ready (BETA) The Pod end-to-end startup latency (from Pod Created to Ready), including image pulls. Diagnoses slow-starting Pods. Visualize metrics with Metrics Explorer

To visualize the state of your GKE environment, create charts based on metrics with Metrics Explorer.

To use Metrics Explorer, complete the following steps:

  1. In the Google Cloud console, go to the Metrics Explorer page.

    Go to Metrics Explorer

  2. In the Metrics field, select or enter the metric that you want to inspect.

  3. View the results and observe any trends over time.

For example, to investigate the memory consumption of Pods in a specific namespace, you can do the following:

  1. In the Select a metric list, choose the metric kubernetes.io/container/memory/used_bytes and click Apply.
  2. Click Add filter and select namespace_name.
  3. In the Value list, select the namespace you want to investigate.
  4. In the Aggregation field, select Sum > pod_name and click OK. This setting displays a separate time series line for each Pod.
  5. Click Save chart.

The resulting chart shows you the memory usage for each Pod over time, which can help you visually identify any Pods with unusually high or spiking memory consumption.

Metrics Explorer has a great deal of flexibility in how to construct the metrics that you want to view. For more information about advanced Metrics Explorer options, see Create charts with Metrics Explorer in the Cloud Monitoring documentation.

Create alerts for proactive issue detection

To receive notifications when things go wrong or when metrics breach certain thresholds, set up alerting policies in Cloud Monitoring.

For example, to set up an alerting policy that notifies you when the container CPU limit is over 80% for five minutes, do the following:

  1. In the Google Cloud console, go to the Alerting page.

    Go to Alerting

  2. Click Create policy.

  3. In the Select a metric box, filter for CPU limit utilization and then select the following metric: kubernetes.io/container/cpu/limit_utilization.

  4. Click Apply.

  5. Leave the Add a filter field blank. This setting triggers an alert when any cluster violates your threshold.

  6. In the Transform data section, do the following:

    1. In the Rolling window list, select 1 minute. This setting means that Google Cloud calculates an average value every minute.
    2. In the Rolling window function list, select mean.

      Both of these settings average the CPU limit utilization for each container every minute.

  7. Click Next.

  8. In the Configure alert section, do the following:

    1. For Condition type, select Threshold.
    2. For Alert trigger, select Any time series violates.
    3. For Threshold position, select Above threshold.
    4. For Threshold value, enter 0.8. This value represents the 80% threshold that you want to monitor for.
    5. Click Advanced options.
    6. In the Retest window list, select 5 min. This setting means that the alert triggers only if the CPU utilization stays over 80% for a continuous five-minute period, which reduces false alarms from brief spikes.
    7. In the Condition name field, give the condition a descriptive name.
    8. Click Next.
  9. In the Configure the notifications and finalize the alert section, do the following:

    1. In the Notification channels list, select the channel where you want to receive the alert. If you don't have a channel, click Manage notification channels to create one.
    2. In the Name the alert policy field, give the policy a clear and descriptive name.
    3. Leave all other fields with their default values.
    4. Click Next.
  10. Review your policy, and if it all looks correct, click Create policy.

To learn about the additional ways that you can create alerts, see Alerting overview in the Cloud Monitoring documentation.

Accelerate diagnosis with Gemini Cloud Assist

Preview

This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions.

Sometimes, the cause of your issue isn't immediately obvious, even when you used the tools discussed in the preceding sections. Investigating complex cases can be time-consuming and requires deep expertise. For scenarios like this, Gemini Cloud Assist can help. It can automatically detect hidden patterns, surface anomalies, and provide summaries to help you quickly pinpoint a likely cause.

As an early-stage technology, Gemini for Google Cloud products can generate output that seems plausible but is factually incorrect. We recommend that you validate all output from Gemini for Google Cloud products before you use it. For more information, see Gemini for Google Cloud and responsible AI.

Access Gemini Cloud Assist

To access Gemini Cloud Assist, complete the following steps:

  1. In the Google Cloud console, go to any page.
  2. In the Google Cloud console toolbar, click spark Open or close Gemini Cloud Assist chat.

    The Cloud Assist panel opens. You can click example prompts if they are displayed, or you can enter a prompt in the Enter a prompt field.

Explore example prompts

To help you understand how Gemini Cloud Assist can help you, here are some example prompts:

Theme Scenario Example prompt How Gemini Cloud Assist can help Confusing error message A Pod has the CrashLoopBackoff status, but the error message is hard to understand. What does this GKE Pod error mean and what are common causes: panic: runtime error: invalid memory address or nil pointer dereference? Gemini Cloud Assist analyzes the message and explains it in clear terms. It also offers potential causes and solutions. Performance issues Your team notices high latency for an app that runs in GKE. My api-gateway service in the prod GKE cluster is experiencing high latency. What metrics should I check first, and can you suggest some common GKE-related causes for this? Gemini Cloud Assist suggests key metrics to examine, explores potential issues (for example, resource constraints, or network congestion), and recommends tools and techniques for further investigation. Node issues A GKE node is stuck with a status of NotReady. One of my GKE nodes (node-xyz) is showing a NotReady status. What are the typical steps to troubleshoot this? Gemini Cloud Assist provides a step-by-step investigation plan, explaining concepts like node auto-repair and suggesting relevant kubectl commands. Understanding GKE You're unsure about a specific GKE feature or how to implement a best practice. What are the best practices for securing a GKE cluster? Is there any way I can learn more? Gemini Cloud Assist provides clear explanations of GKE best practices. Click Show related content to see links to official documentation.

For more information, see the following resources:

Use Gemini Cloud Assist Investigations

In addition to interactive chat, Gemini Cloud Assist can perform more automated, in-depth analysis through Gemini Cloud Assist Investigations. This feature is integrated directly into workflows like Logs Explorer, and is a powerful root-cause analysis tool.

When you initiate an investigation from an error or a specific resource, Gemini Cloud Assist analyzes logs, configurations, and metrics. It uses this data to produce ranked observations and hypotheses about probable root causes, and then provides you with recommended next steps. You can also transfer these results to a Google Cloud support case to provide valuable context that can help you resolve your issue faster.

For more information, see Gemini Cloud Assist Investigations in the Gemini documentation.

Put it all together: Example troubleshooting scenario

This example shows how you can use a combination of GKE tools to diagnose and understand a common real-world problem: a container that is repeatedly crashing due to insufficient memory.

The scenario

You are the on-call engineer for a web app named product-catalog that runs in GKE.

Your investigation begins when you receive an automated alert from Cloud Monitoring:

Alert: High memory utilization for container 'product-catalog' in 'prod' cluster.

This alert tells you that a problem exists and indicates that the problem has something to do with the product-catalog workload.

Confirm the problem in the Google Cloud console

You start with a high-level view of your workloads to confirm the issue.

  1. In the Google Cloud console, you navigate to the Workloads page and filter for your product-catalog workload.
  2. You look at the Pods status column. Instead of the healthy 3/3, you see the value steadily showing an unhealthy status: 2/3. This value tells you that one of your app's Pods doesn't have a status of Ready.
  3. You want to investigate further, so you click the name of the product-catalog workload to go to its details page.
  4. On the details page, you view the Managed Pods section. You immediately identify a problem: the Restarts column for your Pod shows 14, an unusually high number.

This high restart count confirms the issue is causing app instability, and suggests that a container is failing its health checks or crashing.

Find the reason with kubectl commands

Now that you know that your app is repeatedly restarting, you need to find out why. The kubectl describe command is a good tool for this.

  1. You get the exact name of the unstable Pod:

    kubectl get pods -n prod
    

    The output is the following:

    NAME                             READY  STATUS            RESTARTS  AGE
    product-catalog-d84857dcf-g7v2x  0/1    CrashLoopBackOff  14        25m
    product-catalog-d84857dcf-lq8m4  1/1    Running           0         2h30m
    product-catalog-d84857dcf-wz9p1  1/1    Running           0         2h30m
    
  2. You describe the unstable Pod to get the detailed event history:

    kubectl describe pod product-catalog-d84857dcf-g7v2x -n prod
    
  3. You review the output and find clues under the Last State and Events sections:

    Containers:
      product-catalog-api:
        ...
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Mon, 23 Jun 2025 10:50:15 -0700
          Finished:     Mon, 23 Jun 2025 10:54:58 -0700
        Ready:          False
        Restart Count:  14
    ...
    Events:
      Type     Reason     Age                           From                Message
      ----     ------     ----                          ----                -------
      Normal   Scheduled  25m                           default-scheduler   Successfully assigned prod/product-catalog-d84857dcf-g7v2x to gke-cs-cluster-default-pool-8b8a777f-224a
      Normal   Pulled     8m (x14 over 25m)             kubelet             Container image "us-central1-docker.pkg.dev/my-project/product-catalog/api:v1.2" already present on machine
      Normal   Created    8m (x14 over 25m)             kubelet             Created container product-catalog-api
      Normal   Started    8m (x14 over 25m)             kubelet             Started container product-catalog-api
      Warning  BackOff    3m (x68 over 22m)             kubelet             Back-off restarting failed container
    

    The output gives you two critical clues:

Visualize the behavior with metrics

The kubectl describe command told you what happened, but Cloud Monitoring can show you the behavior of your environment over time.

  1. In the Google Cloud console, you go to Metrics Explorer.
  2. You select the container/memory/used_bytes metric.
  3. You filter the output down to your specific cluster, namespace, and Pod name.

The chart shows a distinct pattern: the memory usage climbs steadily, then abruptly drops to zero when the container is OOM killed and restarts. This visual evidence confirms either a memory leak or insufficient memory limit.

Find the root cause in logs

You now know the container is running out of memory, but you still don't know exactly why. To discover the root cause, use Logs Explorer.

  1. In the Google Cloud console, you navigate to Logs Explorer.
  2. You write a query to filter for your specific container's logs from just before the time of the last crash (which you saw in the output of the kubectl describe command):

    resource.type="k8s_container"
    resource.labels.cluster_name="example-cluster"
    resource.labels.namespace_name="prod"
    resource.labels.pod_name="product-catalog-d84857dcf-g7v2x"
    timestamp >= "2025-06-23T17:50:00Z"
    timestamp < "2025-06-23T17:55:00Z"
    
  3. In the logs, you find a repeating pattern of messages right before each crash:

    {
      "message": "Processing large image file product-image-large.jpg",
      "severity": "INFO"
    },
    {
      "message": "WARN: Memory cache size now at 248MB, nearing limit.",
      "severity": "WARNING"
    }
    

These log entries tell you that the app is trying to process large image files by loading them entirely into memory, which eventually exhausts the container's memory limit.

The findings

By using the tools together, you have a complete picture of the problem:

You're now ready to implement a solution. You can either optimize the app code to handle large files more efficiently or, as a short-term fix, increase the container's memory limit (specifically, the spec.containers.resources.limits.memory value) in the workload's YAML manifest.

What's next

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4