A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway below:

About GKE Inference Gateway | GKE networking

About GKE Inference Gateway

Stay organized with collections Save and categorize content based on your preferences.

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see the launch stage descriptions.

This page explains the key concepts and features of Google Kubernetes Engine (GKE) Inference Gateway, an extension to the GKE Gateway for optimized serving of generative AI applications.

This page assumes that you know about the following:

This page is intended for the following personas:

Overview

GKE Inference Gateway is an extension to the GKE Gateway that provides optimized routing and load balancing for serving generative Artificial Intelligence (AI) workloads. It simplifies the deployment, management, and observability of AI inference workloads.

To choose the optimal load balancing strategy for your AI/ML workloads, see Choose a load balancing strategy for AI inference on GKE.

Features and benefits

GKE Inference Gateway provides the following key capabilities to efficiently serve generative AI models for generative AI applications on GKE:

Understand key concepts

GKE Inference Gateway enhances the existing GKE Gateway that uses GatewayClass objects. GKE Inference Gateway introduces the following new Gateway API Custom Resource Definitions (CRDs), aligned with the OSS Kubernetes Gateway API extension for Inference:

The following diagram illustrates GKE Inference Gateway and its integration with AI safety, observability, and model serving within a GKE cluster.

Figure: GKE Inference Gateway resource model

The following diagram illustrates the resource model that focuses on two new inference-focused personas and the resources they manage.

Figure: GKE Inference Gateway resource model How GKE Inference Gateway works

GKE Inference Gateway uses Gateway API extensions and model-specific routing logic to handle client requests to an AI model. The following steps describe the request flow.

How the request flow works

GKE Inference Gateway routes client requests from the initial request to a model instance. This section describes how GKE Inference Gateway handles requests. This request flow is common for all clients.

  1. The client sends a request, formatted as described in the OpenAI API specification, to the model running in GKE.
  2. GKE Inference Gateway processes the request using the following inference extensions:
    1. Body-based routing extension: extracts the model identifier from the client request body and sends it to GKE Inference Gateway. GKE Inference Gateway then uses this identifier to route the request based on rules defined in the Gateway API HTTPRoute object. Request body routing is similar to routing based on the URL path. The difference is that request body routing uses data from the request body.
    2. Security extension: uses Model Armor or supported third-party solutions to enforce model-specific security policies that include content filtering, threat detection, sanitization, and logging. The security extension applies these policies to both request and response processing paths. This enables the security extension to sanitize and log both requests and responses.
    3. Endpoint picker extension: monitors key metrics from model servers within the InferencePool. It tracks the key-value cache (KV-cache) utilization, queue length of pending requests, and active LoRA adapters on each model server. It then routes the request to the optimal model replica based on these metrics to minimize latency and maximize throughput for AI inference.
  3. GKE Inference Gateway routes the request to the model replica returned by the endpoint picker extension.

The following diagram illustrates the request flow from a client to a model instance through GKE Inference Gateway.

Figure: GKE Inference Gateway request flow How traffic distribution works

GKE Inference Gateway dynamically distributes inference requests to model servers within the InferencePool object. This helps optimize resource utilization and maintains performance under varying load conditions. GKE Inference Gateway uses the following two mechanisms to manage traffic distribution:

GKE Inference Gateway supports the following Criticality levels:

When the system is under resource pressure, Standard and Sheddable requests are immediately dropped with a 429 error code to safeguard Critical workloads.

Streaming inference

GKE Inference Gateway supports streaming inference for applications like chatbots and live translation that require continuous or near-real-time updates. Streaming inference delivers responses in incremental chunks or segments, rather than as a single, complete output. If an error occurs during a streaming response, the stream terminates, and the client receives an error message. GKE Inference Gateway does not retry streaming responses.

Explore application examples

This section provides examples to address various generative AI application scenarios by using GKE Inference Gateway.

Example 1: Serve multiple generative AI models on a GKE cluster

A company wants to deploy multiple large language models (LLMs) to serve different workloads. For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application. The company needs to ensure optimal serving performance for these LLMs.

Using GKE Inference Gateway, you can deploy these LLMs on your GKE cluster with your chosen accelerator configuration in an InferencePool. You can then route requests based on the model name (such as chatbot and recommender) and the Criticality property.

The following diagram illustrates how GKE Inference Gateway routes requests to different models based on the model name and Criticality.

Figure: Serving multiple generative AI models on a GKE cluster using GKE Inference Gateway Example 2: Serve LoRA adapters on a shared accelerator

A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish. They have fine-tuned models for each language, but need to efficiently use their GPU and TPU capacity. You can use GKE Inference Gateway to deploy dynamic LoRA fine-tuned adapters for each language (for example, english-bot and spanish-bot) on a common base model (for example, llm-base) and accelerator. This lets you reduce the number of required accelerators by densely packing multiple models on a common accelerator.

The following diagram illustrates how GKE Inference Gateway serves multiple LoRA adapters on a shared accelerator.

Figure: Serving LoRA adapters on a shared accelerator What's next

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-13 UTC.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-13 UTC."],[],[]]


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4