RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/kubernetes/kubernetes/issues/88153 below:

Windows node `PLEG not healthy` during load test with 1pod/s rate · Issue #88153 · kubernetes/kubernetes · GitHub

What happened:
When running windows pod startup latency load test with pod creation speed 1pod/s, the kubelet on the node will become Not Ready with error message PLEG is not healthy: pleg was last seen active 3m8.068354s ago; threshold is 3m0s

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 04 Feb 2020 19:11:39 -0800   Tue, 04 Feb 2020 19:11:39 -0800   RouteCreated                 NodeController create implicit route
  MemoryPressure       False   Fri, 07 Feb 2020 08:04:21 -0800   Wed, 05 Feb 2020 06:03:47 -0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 07 Feb 2020 08:04:21 -0800   Wed, 05 Feb 2020 06:03:47 -0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 07 Feb 2020 08:04:21 -0800   Wed, 05 Feb 2020 06:03:47 -0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Fri, 07 Feb 2020 08:04:21 -0800   Fri, 07 Feb 2020 08:04:21 -0800   KubeletNotReady              PLEG is not healthy: pleg was last seen active 3m8.068354s ago; threshold is 3m0s

What you expected to happen:
Compared with linux nodes load test, 5 pods/s still works fine.

Is there anything we can do to improve the performance on windows node?

How to reproduce it (as minimally and precisely as possible):
For simplicity, created a script to reproduce it:
https://gist.github.com/YangLu1031/a318ad5e92ae1e61102801fdb9109788

Anything else we need to know?:
#45419

Scenarios when this failure happen:
It seems like there are situations in our current GKE Windows clusters where there's a risk of this issue happening and then causing cascading / continuous node failures:

A user slowly brings up 100 pods on Windows Node A in their cluster.
Node A restarts for some reason: it crashes and reboots, the user manually restarts it, whatever.
Node A comes back up in a few minutes and rejoins the cluster.
Kubernetes tries to restart all 100 pods on Node A, all at the same time. Because the pods are not started in a rate-limited manner this time, this leads to the PLEG not healthy issue, and Node A becomes Unhealthy.
Kubernetes notices that Node A is now unhealthy, stops trying to execute the 100 pods on Node A and now tries to run them all on Windows Node B. Now Node B hits the PLEG not healthy issue, ...

Steps to reproduce this cascading node failures thru Deployment & ReplicaController.

Created a cluster with 2 windows nodes
Slowly brought up 30 pods on windows node A thru deployment by gradually updating replicas 10 -> 20 -> 30
Killed the kubelet on node A to simulate the crashes, it became NotReady
After 5 mins, node B got scheduled 30 pods all at same time, then node B became unhealthy.
Also tried setting RollingUpate, but seems not working for this scenario.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # how many pods we can add at a time
maxUnavailable: 0 # maxUnavailable define how many pods can be unavailable during the rolling update

/sig windows
/cc @PatrickLang @dineshgovindasamy @pjh @yliaog @ddebroy

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4