A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://cloud.google.com/kubernetes-engine/docs/concepts/provisioning-model below:

About GPU and TPU provisioning with flex-start provisioning mode | Google Kubernetes Engine (GKE)

This page describes flex-start provisioning mode in Google Kubernetes Engine (GKE). Flex-start, powered by Dynamic Workload Scheduler, provides a flexible and cost-effective technique to obtain GPUs and TPUs when you need to run AI/ML workloads.

Flex-start lets you dynamically provision GPUs and TPUs as needed, for up to seven days, not bounded to a specific start time, and without the management of long-term reservations. Therefore, flex-start works well for smaller to medium-sized workloads with fluctuating demand requirements or short durations. For example, small model pre-training, model fine-tuning, or scalable serving models.

The information on this page can help you to do the following:

This page is intended for Platform admins and operators and Machine learning (ML) engineers who want to optimize accelerator infrastructure for their workloads.

When to use flex-start

We recommend that you use flex-start if your workloads meet all of the following conditions:

Flex-start pricing

Flex-start is recommended if your workload requires dynamically provisioned resources as needed, for up to seven days with short-term reservations, no complex quota management, and cost-effective access. Flex-start is powered by Dynamic Workload Scheduler and is billed using Dynamic Workload Scheduler pricing:

Requirements

To use flex-start in GKE, your cluster must meet the following requirements:

How flex-start provisioning mode works

With flex-start, you specify the required GPU or TPU capacity in your workloads. Additionally, with Standard clusters, you configure flex-start on specific node pools. GKE automatically provisions VMs by completing the following process when capacity becomes available:

  1. The workload requests capacity that is not immediately available. This request can be made directly by the workload specification or through scheduling tools such as custom compute classes or Kueue.
  2. GKE identifies that your node has flex-start enabled and that the workload can wait for an indeterminate amount of time.
  3. The cluster autoscaler accepts your request and calculates the number of necessary nodes, treating them as a single unit.
  4. The cluster autoscaler provisions the necessary nodes when they are available. These nodes run for a maximum of seven days, or for a shorter duration if you specify a value in the maxRunDurationSeconds parameter. If you don't specify a value for the maxRunDurationSeconds parameter, the default is seven days.
  5. After the running time you defined in the maxRunDurationSeconds parameter ends, the nodes and the Pods are preempted.
  6. If the Pods finish sooner and the nodes are no longer utilized, the cluster autoscaler removes them according to the autoscaling profile.

GKE counts the duration for each flex-start request on a node level. The time available for running Pods might be slightly smaller due to delays during startup. Pod retries share this duration, which means that less time is available for Pods after retry. GKE counts the duration for each flex-start request separately.

Flex-start configurations

GKE supports the following flex-start configurations:

The following table compares the flex-start configurations:

Optimize flex-start configuration

To create robust and cost-optimized AI/ML infrastructure, you can combine flex-start configurations with available GKE features. We recommend that you use Compute Classes to define a prioritized list of node configurations based on your workload requirements. GKE will select the most suitable configuration based on availability and your defined priority.

Manage disruptions in workloads that use Dynamic Workload Scheduler

Workloads that require the availability of all nodes, or most nodes, in a node pool are sensitive to evictions. In addition, nodes that are provisioned by using Dynamic Workload Scheduler requests don't support automatic repair. Automatic repair removes all workloads from a node, and thus prevents them from running.

All nodes using flex-start, queued provisioning, or both, use short-lived upgrades when the cluster control plane runs the minimum version for flex-start, 1.32.2-gke.1652000 or later.

Short-lived upgrades update a Standard node pool or group of nodes in an Autopilot cluster without disrupting running nodes. New nodes are created with the new configuration, gradually replacing existing nodes with the old configuration over time. Earlier versions of GKE, which don't support flex-start or short-lived upgrades, require different best practices.

Best practices to minimize workload disruptions for nodes using short-lived upgrades

Flex-start nodes and nodes which use queued provisioning are automatically configured to use short-lived upgrades when the cluster runs version 1.32.2-gke.1652000 or later.

To minimize disruptions to workloads running on nodes that use short-lived upgrades, perform the following tasks:

For nodes on clusters running versions earlier than 1.32.2-gke.1652000, and thus not using short-lived upgrades, refer to the specific guidance for those nodes.

Best practices to minimize workload disruption for queued provisioning nodes without short-lived upgrades

Nodes using queued provisioning on a cluster running a GKE version earlier than 1.32.2-gke.1652000 don't use short-lived upgrades. Clusters upgraded to 1.32.2-gke.1652000 or later with existing queued provisioning nodes are automatically updated to use short-lived upgrades.

For nodes running these earlier versions, refer to the following guidance:

Considerations for when your cluster migrates to short-lived upgrades

GKE updates existing nodes using queued provisioning to use short-lived upgrades when the cluster is upgraded to version 1.32.2-gke.1652000 or later. GKE doesn't update other settings, such as enabling node auto-upgrades if you disabled them for a specific node pool.

We recommend that you consider implementing the following best practices now that your node pools use short-lived upgrades:

Node recycling in flex-start

To help ensure a smooth transition of nodes and prevent downtime for your running jobs, flex-start supports node recycling. When a node reaches the end of its duration, GKE automatically replaces the node with a new one to preserve your running workloads.

To use node recycling, you must create a custom compute class profile and include the nodeRecycling field in the flexStart specification with the leadTimeSeconds parameter.

The leadTimeSeconds parameter lets you balance resource availability and cost efficiency. This parameter specifies how early (in seconds) before a node reaches the end of its seven-day duration for a new node provisioning process should start to substitute it. A longer lead time increases the probability that the new node is ready before the old one is removed, but might incur additional costs.

The node recycling process consists of the following steps:

  1. Recycling phase: GKE validates that a flex-start-provisioned node has the nodeRecycling field with the leadTimeSeconds parameter set. If so, GKE starts the node recycling phase when the current date is greater than or equal to the difference between the values from the following fields:

    The creationTimeStamp flag includes the time when the node was created. The maxRunDurationSeconds field can be specified in the custom compute class, and defaults to seven days.

  2. Node creation: the creation process for the new node begins, proceeding through queueing and provisioning phases. The duration of the queueing phase can vary dynamically depending on the zone and specific accelerator capacity.

  3. Cordon the node that's reaching the end of its seven-day duration: after the new node is running, the old node is cordoned. This action prevents any new Pods from being scheduled on it. Existing Pods in that node continue to run.

  4. Node deprovisioning: the node that's reaching the end of its seven-day duration is eventually deprovisioned after a suitable period, which helps ensure that running workloads have migrated to the new node.

The following example of a compute class configuration includes leadTimeSeconds and maxRunDuration fields:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: dws-model-inference-class
spec:
  priorities:
    - machineType: g2-standard-24
      spot: true
    - machineType: g2-standard-24
      maxRunDurationSeconds: 72000
      flexStart:
        enabled: true
        nodeRecycling:
          leadTimeSeconds: 3600
  nodePoolAutoCreation:
    enabled: true

For more information about how to use node recycling, try the Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy tutorial.

Limitations What's next

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4