RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://sgl-project.github.io/ome/docs/concepts/inference_service below:

Website Navigation

OME

Inference Service | OME

Inference Service

InferenceService is the primary resource that manages the deployment and serving of machine learning models in OME.

What is an InferenceService?

An InferenceService is the central Kubernetes resource in OME that orchestrates the complete lifecycle of model serving. It acts as a declarative specification that describes how you want your AI models deployed, scaled, and served across your cluster.

Think of InferenceService as the “deployment blueprint” for your AI workloads. It brings together models (defined by BaseModel/ClusterBaseModel), runtimes (defined by ServingRuntime/ClusterServingRuntime), and infrastructure configuration to create a complete serving solution.

Architecture Overview

OME uses a component-based architecture where InferenceService can be composed of multiple specialized components:

Model: References the AI model to serve (BaseModel/ClusterBaseModel)
Runtime: References the serving runtime environment (ServingRuntime/ClusterServingRuntime)
Engine: Main inference component that processes requests
Decoder: Optional component for disaggregated serving (prefill-decode separation)
Router: Optional component for request routing and load balancing

New vs Deprecated Architecture

apiVersion: ome.io/v1beta1
kind: InferenceService
spec:
  model:
    name: llama-3-70b-instruct
  runtime:
    name: vllm-text-generation
  engine:
    minReplicas: 1
    maxReplicas: 3
    resources:
      requests:
        nvidia.com/gpu: "1"

Component Types Engine Component

The Engine is the primary inference component that processes model requests. It handles model loading, inference execution, and response generation.

spec:
  engine:
    # Pod-level configuration
    serviceAccountName: custom-sa
    nodeSelector:
      accelerator: nvidia-a100
    
    # Component configuration  
    minReplicas: 1
    maxReplicas: 10
    scaleMetric: cpu
    scaleTarget: 70
    
    # Container configuration
    runner:
      image: custom-vllm:latest
      resources:
        requests:
          nvidia.com/gpu: "2"
        limits:
          nvidia.com/gpu: "2"
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1"

Decoder Component

The Decoder is used for disaggregated serving architectures where the prefill (prompt processing) and decode (token generation) phases are separated for better resource utilization.

spec:
  decoder:
    minReplicas: 2
    maxReplicas: 8
    runner:
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"

Router Component

The Router handles request routing, cache awareness load balancing, or prefill and decode disaggregation load balancing.

spec:
  router:
    minReplicas: 1
    maxReplicas: 3
    config:
      routing_strategy: "round_robin"
      health_check_interval: "30s"
    runner:
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"

Deployment Modes

OME automatically selects the optimal deployment mode based on your configuration:

Mode Description Use Cases Infrastructure Raw Deployment Standard Kubernetes Deployment Stable workloads, predictable traffic, no cold starts Kubernetes Deployments + Services Serverless Knative-based auto-scaling Variable workloads, cost optimization, scale-to-zero Knative Serving Multi-Node Distributed inference across multiple nodes Large models (DeepSeek), models that can not fit in a single node LeaderWorkerSet Prefill-Decode Disaggregation Disaggregated serving architecture Maximizing resource utilization, better performance, Raw Deployments or LeaderWorkerSet(if the model can not fit in a single node) Raw Deployment Mode (Default)

Uses standard Kubernetes Deployments with full control over pod lifecycle and scaling.

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-chat
spec:
  model:
    name: llama-3-70b-instruct
  engine:
    minReplicas: 2
    maxReplicas: 10

This deployment mode offers direct Kubernetes management with standard HPA-based autoscaling, no cold starts, and is ideal for stable, predictable workloads.

Serverless Mode

Leverages Knative Serving for automatic scaling including scale-to-zero capabilities.

apiVersion: ome.io/v1beta1
kind: InferenceService  
metadata:
  name: llama-chat
spec:
  model:
    name: llama-3-70b-instruct
  engine:
    minReplicas: 0  # Enables scale-to-zero
    maxReplicas: 10
    scaleTarget: 10  # Concurrent requests per pod

This deployment mode leverages Knative Serving for request-based autoscaling, scale-to-zero when idle, and is ideal for variable workloads and cost-sensitive environments.

⚠️ WARNING: This deployment mode leverages Knative Serving for request-based autoscaling, scale-to-zero when idle, and is ideal for variable workloads and cost-sensitive environments; however, it may introduce additional startup latency for large language models due to cold starts and model loading time.

Multi-Node Mode

Enables distributed model serving across multiple nodes using LeaderWorkerSet or Ray clusters.

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: deepseek-chat
spec:
  model:
    name: deepseek-r1  # Large model requiring multiple GPUs
  engine:
    minReplicas: 1
    maxReplicas: 2
    # Worker node configuration  
    worker:
      size: 1  # Number of worker nodes

This deployment mode enables distributed inference using LeaderWorkerSet or Ray, with support for multi-GPU and multi-node setups, and is optimized for large language models through automatic coordination between nodes

⚠️ WARNING: Multi-node configurations typically require high-performance networking such as RoCE or InfiniBand, and performance may vary depending on the underlying network topology and hardware provided by different cloud vendors.

Disaggregated Serving (Prefill-Decode)

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: deepseek-ep-disaggregated
spec:
  model:
    name: deepseek-r1
  
  # Router handles request routing and load balancing for prefill-decode disaggregation
  router:
    minReplicas: 1
    maxReplicas: 3
  
  # Engine handles prefill phase
  engine:
    minReplicas: 1
    maxReplicas: 3
  
  # Decoder handles token generation
  decoder:
    minReplicas: 2
    maxReplicas: 8

Specification Reference Attribute Type Description Core References model ModelRef Reference to BaseModel/ClusterBaseModel to serve runtime ServingRuntimeRef Reference to ServingRuntime/ClusterServingRuntime to use Components engine EngineSpec Main inference component configuration decoder DecoderSpec Optional decoder component for disaggregated serving router RouterSpec Optional router component for request routing Autoscaling kedaConfig KedaConfig KEDA event-driven autoscaling configuration ModelRef Specification Attribute Type Description name string Name of the BaseModel/ClusterBaseModel kind string Resource kind (defaults to “ClusterBaseModel”) apiGroup string API group (defaults to “ome.io”) fineTunedWeights []string Optional fine-tuned weight references ServingRuntimeRef Specification Attribute Type Description name string Name of the ServingRuntime/ClusterServingRuntime kind string Resource kind (defaults to “ClusterServingRuntime”) apiGroup string API group (defaults to “ome.io”) Component Configuration

All components (Engine, Decoder, Router) share this common configuration structure:

Attribute Type Description Pod Configuration serviceAccountName string Service account for the component pods nodeSelector map[string]string Node labels for pod placement tolerations []Toleration Pod tolerations for tainted nodes affinity Affinity Pod affinity and anti-affinity rules volumes []Volume Additional volumes to mount containers []Container Additional sidecar containers Scaling Configuration minReplicas int Minimum number of replicas (default: 1) maxReplicas int Maximum number of replicas scaleTarget int Target value for autoscaling metric scaleMetric string Metric to use for scaling (cpu, memory, concurrency, rps) containerConcurrency int64 Maximum concurrent requests per container timeoutSeconds int64 Request timeout in seconds Traffic Management canaryTrafficPercent int64 Percentage of traffic to route to canary version Resource Configuration runner RunnerSpec Main container configuration leader LeaderSpec Leader node configuration (multi-node only) worker WorkerSpec Worker node configuration (multi-node only) Deployment Strategy deploymentStrategy DeploymentStrategy Kubernetes deployment strategy (RawDeployment only) KEDA Configuration kedaConfig KedaConfig Component-specific KEDA configuration RunnerSpec Configuration Attribute Type Description name string Container name image string Container image command []string Container command args []string Container arguments env []EnvVar Environment variables resources ResourceRequirements CPU, memory, and GPU resource requirements volumeMounts []VolumeMount Volume mount points KEDA Configuration Attribute Type Description enableKeda bool Whether to enable KEDA autoscaling promServerAddress string Prometheus server URL for metrics customPromQuery string Custom Prometheus query for scaling scalingThreshold string Threshold value for scaling decisions scalingOperator string Comparison operator (GreaterThanOrEqual, LessThanOrEqual) Status and Monitoring InferenceService Status

The InferenceService status provides comprehensive information about the deployment state:

status:
  url: "http://llama-chat.default.example.com"
  address:
    url: "http://llama-chat.default.svc.cluster.local"
  conditions:
    - type: Ready
      status: "True"
      lastTransitionTime: "2024-01-15T10:30:00Z"
    - type: IngressReady
      status: "True"
      lastTransitionTime: "2024-01-15T10:25:00Z"
  components:
    engine:
      url: "http://llama-chat-engine.default.example.com"
      latestReadyRevision: "llama-chat-engine-00001"
      latestCreatedRevision: "llama-chat-engine-00001"
      traffic:
        - revisionName: "llama-chat-engine-00001"
          percent: 100
          latestRevision: true
    router:
      url: "http://llama-chat-router.default.example.com"
      latestReadyRevision: "llama-chat-router-00001"
  modelStatus:
    transitionStatus: "UpToDate"
    modelRevisionStates:
      activeModelState: "Loaded"
      targetModelState: "Loaded"

Condition Types Condition Description Ready Overall readiness of the InferenceService IngressReady Network routing is configured and ready EngineReady Engine component is ready to serve requests DecoderReady Decoder component is ready (if configured) RouterReady Router component is ready (if configured) PredictorReady Deprecated: Legacy predictor readiness Model Status States State Description Pending Model is not yet registered Standby Model is available but not loaded Loading Model is currently loading Loaded Model is loaded and ready for inference FailedToLoad Model failed to load Deployment Mode Selection

Choose the appropriate deployment mode based on your requirements:

Requirement Recommended Mode Stable, predictable load Raw Deployment No cold starts Raw Deployment Variable workload Serverless Cost optimization Serverless Scale-to-zero capability Serverless Large model requiring multiple GPUs Multi-Node Distributed inference Multi-Node Maximum performance Multi-Node Best Practices Resource Management

GPU Allocation: Always specify GPU resources explicitly

runner:
  resources:
    requests:
      nvidia.com/gpu: "1"
    limits:
      nvidia.com/gpu: "1"

Memory Sizing: Allow 2-4x model size for memory

runner:
  resources:
    requests:
      memory: "32Gi"  # For 8B parameter model

CPU Allocation: Provide adequate CPU for preprocessing

runner:
  resources:
    requests:
      cpu: "4"

Scaling Configuration

Set Appropriate Limits:

engine:
  minReplicas: 1     # Prevent scale-to-zero for latency
  maxReplicas: 10    # Control costs
  scaleTarget: 70    # 70% CPU utilization target

Use KEDA for Custom Metrics:

kedaConfig:
  enableKeda: true
  customPromQuery: "avg_over_time(vllm:request_latency_seconds{service='%s'}[5m])"
  scalingThreshold: "0.5"  # 500ms latency threshold

Troubleshooting

Check Component Status:

kubectl get inferenceservice llama-chat -o yaml
kubectl describe inferenceservice llama-chat

Monitor Pod Logs:

kubectl logs -l serving.ome.io/inferenceservice=llama-chat

Check Resource Usage:

kubectl top pods -l serving.ome.io/inferenceservice=llama-chat

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4