This page describes flex-start provisioning mode in Google Kubernetes Engine (GKE). Flex-start, powered by Dynamic Workload Scheduler, provides a flexible and cost-effective technique to obtain GPUs and TPUs when you need to run AI/ML workloads.
Flex-start lets you dynamically provision GPUs and TPUs as needed, for up to seven days, not bounded to a specific start time, and without the management of long-term reservations. Therefore, flex-start works well for smaller to medium-sized workloads with fluctuating demand requirements or short durations. For example, small model pre-training, model fine-tuning, or scalable serving models.
The information on this page can help you to do the following:
This page is intended for Platform admins and operators and Machine learning (ML) engineers who want to optimize accelerator infrastructure for their workloads.
When to use flex-startWe recommend that you use flex-start if your workloads meet all of the following conditions:
Flex-start is recommended if your workload requires dynamically provisioned resources as needed, for up to seven days with short-term reservations, no complex quota management, and cost-effective access. Flex-start is powered by Dynamic Workload Scheduler and is billed using Dynamic Workload Scheduler pricing:
To use flex-start in GKE, your cluster must meet the following requirements:
To run TPUs, use GKE version 1.33.0-gke.1712000 or later. Flex-start supports the following version and zones:
asia-northeast1-b
, us-east5-a
, and us-east5-b
.us-west4-a
.us-east5-a
.TPU v3 and v4 are not supported.
With flex-start, you specify the required GPU or TPU capacity in your workloads. Additionally, with Standard clusters, you configure flex-start on specific node pools. GKE automatically provisions VMs by completing the following process when capacity becomes available:
maxRunDurationSeconds
parameter. If you don't specify a value for the maxRunDurationSeconds
parameter, the default is seven days.maxRunDurationSeconds
parameter ends, the nodes and the Pods are preempted.GKE counts the duration for each flex-start request on a node level. The time available for running Pods might be slightly smaller due to delays during startup. Pod retries share this duration, which means that less time is available for Pods after retry. GKE counts the duration for each flex-start request separately.
Flex-start configurationsGKE supports the following flex-start configurations:
--flex-start
flag during node creation.Flex-start with queued provisioning, where GKE allocates all requested resources at the same time. To use this configuration, you have to add the --flex-start
and enable-queued-provisioning
flags when you create the node pool. GKE follows the process in How flex-start provisioning mode works in this document, but also applies the following conditions:
The following table compares the flex-start configurations:
Optimize flex-start configurationTo create robust and cost-optimized AI/ML infrastructure, you can combine flex-start configurations with available GKE features. We recommend that you use Compute Classes to define a prioritized list of node configurations based on your workload requirements. GKE will select the most suitable configuration based on availability and your defined priority.
Manage disruptions in workloads that use Dynamic Workload SchedulerWorkloads that require the availability of all nodes, or most nodes, in a node pool are sensitive to evictions. In addition, nodes that are provisioned by using Dynamic Workload Scheduler requests don't support automatic repair. Automatic repair removes all workloads from a node, and thus prevents them from running.
All nodes using flex-start, queued provisioning, or both, use short-lived upgrades when the cluster control plane runs the minimum version for flex-start, 1.32.2-gke.1652000 or later.
Short-lived upgrades update a Standard node pool or group of nodes in an Autopilot cluster without disrupting running nodes. New nodes are created with the new configuration, gradually replacing existing nodes with the old configuration over time. Earlier versions of GKE, which don't support flex-start or short-lived upgrades, require different best practices.
Best practices to minimize workload disruptions for nodes using short-lived upgradesFlex-start nodes and nodes which use queued provisioning are automatically configured to use short-lived upgrades when the cluster runs version 1.32.2-gke.1652000 or later.
To minimize disruptions to workloads running on nodes that use short-lived upgrades, perform the following tasks:
For nodes on clusters running versions earlier than 1.32.2-gke.1652000, and thus not using short-lived upgrades, refer to the specific guidance for those nodes.
Best practices to minimize workload disruption for queued provisioning nodes without short-lived upgradesNodes using queued provisioning on a cluster running a GKE version earlier than 1.32.2-gke.1652000 don't use short-lived upgrades. Clusters upgraded to 1.32.2-gke.1652000 or later with existing queued provisioning nodes are automatically updated to use short-lived upgrades.
For nodes running these earlier versions, refer to the following guidance:
GKE updates existing nodes using queued provisioning to use short-lived upgrades when the cluster is upgraded to version 1.32.2-gke.1652000 or later. GKE doesn't update other settings, such as enabling node auto-upgrades if you disabled them for a specific node pool.
We recommend that you consider implementing the following best practices now that your node pools use short-lived upgrades:
--no-enable-autoupgrade
flag, this migration doesn't re-enable node auto-upgrades for the node pool. We recommend that you enable node auto-upgrades, because short-lived upgrades are not disruptive to existing nodes and the workloads that run on them. For more information, see Short-lived upgrades.To help ensure a smooth transition of nodes and prevent downtime for your running jobs, flex-start supports node recycling. When a node reaches the end of its duration, GKE automatically replaces the node with a new one to preserve your running workloads.
To use node recycling, you must create a custom compute class profile and include the nodeRecycling
field in the flexStart
specification with the leadTimeSeconds
parameter.
The leadTimeSeconds
parameter lets you balance resource availability and cost efficiency. This parameter specifies how early (in seconds) before a node reaches the end of its seven-day duration for a new node provisioning process should start to substitute it. A longer lead time increases the probability that the new node is ready before the old one is removed, but might incur additional costs.
The node recycling process consists of the following steps:
Recycling phase: GKE validates that a flex-start-provisioned node has the nodeRecycling
field with the leadTimeSeconds
parameter set. If so, GKE starts the node recycling phase when the current date is greater than or equal to the difference between the values from the following fields:
creationTimestamp
plus maxRunDurationSeconds
leadTimeSeconds
The creationTimeStamp
flag includes the time when the node was created. The maxRunDurationSeconds
field can be specified in the custom compute class, and defaults to seven days.
Node creation: the creation process for the new node begins, proceeding through queueing and provisioning phases. The duration of the queueing phase can vary dynamically depending on the zone and specific accelerator capacity.
Cordon the node that's reaching the end of its seven-day duration: after the new node is running, the old node is cordoned. This action prevents any new Pods from being scheduled on it. Existing Pods in that node continue to run.
Node deprovisioning: the node that's reaching the end of its seven-day duration is eventually deprovisioned after a suitable period, which helps ensure that running workloads have migrated to the new node.
The following example of a compute class configuration includes leadTimeSeconds
and maxRunDuration
fields:
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
name: dws-model-inference-class
spec:
priorities:
- machineType: g2-standard-24
spot: true
- machineType: g2-standard-24
maxRunDurationSeconds: 72000
flexStart:
enabled: true
nodeRecycling:
leadTimeSeconds: 3600
nodePoolAutoCreation:
enabled: true
For more information about how to use node recycling, try the Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy tutorial.
Limitations--reservation-affinity=none
flag when you create the node pool. Dynamic Workload Scheduler requires and supports only the ANY
location policy for cluster autoscaling.ACTIVE_RESIZE_REQUESTS
quota to control the number of Dynamic Workload Scheduler requests that are pending in a queue. By default, this quota has a limit of 100 requests per Google Cloud project. If you attempt to create a Dynamic Workload Scheduler request that's greater than this quota, the new request fails.Indexed
.RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4