A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://cloud.google.com/kubernetes-engine/docs/troubleshooting/oom-events below:

Troubleshoot OOM events | Google Kubernetes Engine (GKE)

This page helps you troubleshoot and resolve Out of Memory (OOM) events in Google Kubernetes Engine (GKE). Learn to identify the common causes of OOM events, distinguish between container-level and node-level occurrences, and apply solutions.

This page is for Application developers who want to verify that their apps are successfully deployed and for Platform admins and operators who want to understand the root cause of OOM events and verify platform configuration. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

Common causes of OOM events

OOM events typically occur during load or traffic spikes, where app memory usage surges and reaches the memory limit configured for the container.

The following scenarios can cause an OOM event:

OOM events can initiate a cascading failure, because fewer containers remain to handle the traffic, increasing the load on the remaining containers. These containers might then also be terminated.

How Kubernetes handles OOM events

The Linux OOM Killer handles every OOM event. The OOM Killer is a kernel process that activates when the system is critically low on memory. Its purpose is to prevent a total system crash by strategically terminating processes to free up resources. The kernel uses a scoring system to select which process to stop, aiming to preserve system stability and minimize data loss.

In a Kubernetes environment, the OOM Killer operates at two different scopes: the control group (cgroup), which affects one container; and the system, which affects the entire node.

Container-level OOM kill

A container-level OOM kill occurs when a container attempts to exceed its predefined memory limit. Kubernetes assigns each container to a specific cgroup with a hard memory limit. When a container's memory usage reaches this limit, the kernel first tries to reclaim memory within that cgroup. If the kernel cannot reclaim enough memory by using this process, the cgroup OOM Killer is invoked. It terminates processes within that specific cgroup to enforce the resource boundary.

When the main process in a container is terminated this way, Kubernetes observes the event and marks the container's status as OOMKilled. The Pod's configured restartPolicy then dictates the outcome:

By isolating the failure to the offending container, the OOM Killer prevents a single faulty Pod from crashing the entire node.

How cgroup version affects OOM Killer behavior

OOM kill behavior can differ significantly between cgroup versions. If you're not sure which cgroup version you use, check the cgroup mode of cluster nodes.

System-level OOM kill

A system-level OOM kill is a more serious event that occurs when the entire node, not just a single container, runs out of available memory. This event can happen if the combined memory usage of all processes (including all Pods and system daemons) exceeds the node's capacity.

When this node runs out of memory, the global OOM Killer assesses all processes on the node and terminates a process to reclaim memory for the entire system. The selected process is usually one that is short-lived and uses a large amount of memory.

To prevent severe OOM situations, Kubernetes uses node-pressure eviction to manage node resources. This process involves evicting less critical Pods from a node when resources, such as memory or disk space, become critically low. A system-level OOM kill indicates that this eviction process couldn't free up memory fast enough to prevent the issue.

If the OOM Killer terminates a container's process, the effect is usually identical to a cgroup-triggered kill: the container is marked OOMKilled and restarted based on its policy. However, if a critical system process is killed (which is rare), the node itself could become unstable.

Investigate OOM events

The following sections help you detect and confirm an OOM event, starting with the simplest Kubernetes tools and moving to more detailed log analysis.

Note: Don't rely solely on memory utilization metrics from monitoring tools. These metrics are collected at intervals and often miss the rapid spike that triggers an OOM kill. Check the Pod status for visible OOM events

The first step in confirming an OOM event is to check if Kubernetes observed the OOM event. Kubernetes observes the event when the container's main process is killed, which is standard behavior in cgroup v2 environments.

If you see OOMKilled in the Reason field, you have confirmed the event. An Exit Code of 137 also indicates an OOM kill. If the Reason field has a different value, or the Pod is still running despite app errors, proceed to the next section for further investigation.

Search logs for invisible OOM events

An OOM kill is "invisible" to Kubernetes if a child process is killed but the main container process continues to run (a common scenario in cgroup v1 environments). You must search the node's logs to find evidence of these events.

To find invisible OOM kills, use Logs Explorer:

  1. In the Google Cloud console, go to Logs Explorer.

    Go to Logs Explorer

  2. In the query pane, enter one of the following queries:

  3. Click Run query.

  4. In the output, locate OOM events by searching for log entries containing the string TaskOOM.

  5. Optional: if you searched for OOM events for all GKE workloads and want to identify the specific Pod that experienced the OOM events, complete the following steps:

    1. For each event, make a note of the container ID that's associated with it.
    2. Identify container stoppages by looking for log entries containing the string ContainerDied, and that occurred shortly after the OOM events. Match the container ID from the OOM event to the corresponding ContainerDied line.

      Note: ContainerDied lines without a corresponding TaskOOM line (sharing the same container ID) indicate that the container terminated for reasons other than OOMs. You can rule out OOM as a cause for these containers and investigate alternative possibilities. For example, the container could have crashed due to a bug such as an unhandled exception or the failure of a container liveness probe. In those cases, Kubernetes restarts the container, resulting in a ContainerDied event.
    3. After you match the container IDs, the ContainerDied line typically includes the Pod name associated with the failed container. This Pod was affected by the OOM event.

Use journalctl for real-time information

If you need to perform real-time analysis of your system, use journalctl commands.

  1. Connect to the node by using SSH:

    gcloud compute ssh NODE_NAME --location ZONE
    

    Replace the following:

  2. In the shell, explore the kernel messages from the node's system journal:

    journalctl -k
    
  3. Analyze the output to distinguish the event type:

Resolve OOM events

To resolve an OOM event, try the following solutions:

What's next

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4