Cluster Serving Runtime is a cluster-scoped resource that manages the runtime environment for model serving.
The only difference between the two is that one is namespace-scoped and the other is cluster-scoped.
A ClusterServingRuntime defines the templates for Pods that can serve one or more particular model. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.
These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace.
The following is an example of a ClusterServingRuntime:
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
name: srt-mistral-7b-instruct
spec:
supportedModelFormats:
- name: safetensors
modelFormat:
name: safetensors
version: "1.0.0"
modelFramework:
name: transformers
version: "4.36.2"
modelArchitecture: MistralForCausalLM
autoSelect: true
priority: 1
protocolVersions:
- openAI
modelSizeRange:
max: 9B
min: 5B
engineConfig:
runner:
image: lmsysorg/sglang:v0.4.6.post6
resources:
requests:
cpu: 10
memory: 30Gi
nvidia.com/gpu: 2
limits:
cpu: 10
memory: 30Gi
nvidia.com/gpu: 2
minReplicas: 1
maxReplicas: 3
Several out-of-the-box ClusterServingRuntimes are provided with OME so that users can quickly deploy common models without having to define the runtimes themselves.
SGLang RuntimesName Model Framework Model Format Model Architecture deepseek-rdma-pd-rt transformers safetensors DeepseekV3ForCausalLM deepseek-rdma-rt transformers safetensors DeepseekV3ForCausalLM e5-mistral-7b-instruct-rt transformers safetensors MistralModel llama-3-1-70b-instruct-pd-rt transformers safetensors LlamaForCausalLM llama-3-1-70b-instruct-rt transformers safetensors LlamaForCausalLM llama-3-2-11b-vision-instruct-rt transformers safetensors MllamaForConditionalGeneration llama-3-2-1b-instruct-pd-rt transformers safetensors LlamaForCausalLM llama-3-2-1b-instruct-rt transformers safetensors LlamaForCausalLM llama-3-2-3b-instruct-pd-rt transformers safetensors LlamaForCausalLM llama-3-2-3b-instruct-rt transformers safetensors LlamaForCausalLM llama-3-2-90b-vision-instruct-rt transformers safetensors MllamaForConditionalGeneration llama-3-3-70b-instruct-pd-rt transformers safetensors LlamaForCausalLM llama-3-3-70b-instruct-rt transformers safetensors LlamaForCausalLM llama-4-maverick-17b-128e-instruct-fp8-pd-rt transformers safetensors Llama4ForConditionalGeneration llama-4-maverick-17b-128e-instruct-fp8-rt transformers safetensors Llama4ForConditionalGeneration llama-4-scout-17b-16e-instruct-pd-rt transformers safetensors Llama4ForConditionalGeneration llama-4-scout-17b-16e-instruct-rt transformers safetensors Llama4ForConditionalGeneration mistral-7b-instruct-pd-rt transformers safetensors MistralForCausalLM mistral-7b-instruct-rt transformers safetensors MistralForCausalLM mixtral-8x7b-instruct-pd-rt transformers safetensors MixtralForCausalLM mixtral-8x7b-instruct-rt transformers safetensors MixtralForCausalLM VLLM Runtimes Name Model Framework Model Format Model Architecture e5-mistral-7b-instruct-rt transformers safetensors MistralModel llama-3-1-405b-instruct-fp8-rt transformers safetensors LlamaForCausalLM llama-3-1-nemotron-nano-8b-v1-rt transformers safetensors LlamaForCausalLM llama-3-1-nemotron-ultra-253b-v1-rt transformers safetensors DeciLMForCausalLM llama-3-2-11b-vision-instruct-rt transformers safetensors MllamaForConditionalGeneration llama-3-2-1b-instruct-rt transformers safetensors LlamaForCausalLM llama-3-2-3b-instruct-rt transformers safetensors LlamaForCausalLM llama-3-3-70b-instruct-rt transformers safetensors LlamaForCausalLM llama-3-3-nemotron-super-49b-v1-rt transformers safetensors DeciLMForCausalLM llama-4-maverick-17b-128e-instruct-fp8-rt transformers safetensors Llama4ForConditionalGeneration llama-4-scout-17b-16e-instruct-rt transformers safetensors Llama4ForConditionalGeneration mistral-7b-instruct-rt transformers safetensors MistralForCausalLM mixtral-8x7b-instruct-rt transformers safetensors MixtralForCausalLM Spec AttributesNote: SGLang is our flagship supporting runtime, offering the latest serving engine with the most optimal performance. It provides cutting-edge features including multi-node serving capabilities, prefill-decode disaggregated serving, and Large-scale Cross-node Expert Parallelism (EP) for optimal performance at scale.
Available attributes in the ServingRuntime
spec:
disabled
Disables this runtime supportedModelFormats
List of model format, architecture, and type supported by the current runtime supportedModelFormats[ ].name
Name of the model format (deprecated, use modelFormat.name
instead) supportedModelFormats[ ].modelFormat
ModelFormat specification including name and version supportedModelFormats[ ].modelFormat.name
Name of the model format, e.g., “safetensors”, “ONNX”, “TensorFlow SavedModel” supportedModelFormats[ ].modelFormat.version
Version of the model format. Used in validating that a runtime supports a model. It Can be “major”, “major.minor” or “major.minor.patch” supportedModelFormats[ ].modelFramework
ModelFramework specification including name and version supportedModelFormats[ ].modelFramework.name
Name of the library, e.g., “transformer”, “TensorFlow”, “PyTorch”, “ONNX”, “TensorRTLLM” supportedModelFormats[ ].modelFramework.version
Version of the framework library supportedModelFormats[ ].modelArchitecture
Name of the model architecture, used in validating that a model is supported by a runtime, e.g., “LlamaForCausalLM”, “GemmaForCausalLM” supportedModelFormats[ ].quantization
Quantization scheme applied to the model, e.g., “fp8”, “fbgemm_fp8”, “int4” supportedModelFormats[ ].autoSelect
Set to true to allow the ServingRuntime to be used for automatic model placement if this model is specified with no explicit runtime. The default value is false. supportedModelFormats[ ].priority
Priority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format.
protocolVersions
Supported protocol versions (i.e. openAI or cohere or openInference-v1 or openInference-v2) modelSizeRange
Model size range is the range of model sizes supported by this runtime modelSizeRange.min
Minimum size of the model in bytes modelSizeRange.max
Maximum size of the model in bytes Component Configuration
The ServingRuntime spec supports three main component configurations:
Engine Configuration Attribute DescriptionengineConfig
Engine configuration for model serving engineConfig.runner
Container specification for the main engine container engineConfig.runner.image
Container image for the engine engineConfig.runner.resources
Kubernetes limits or requests engineConfig.runner.env
List of environment variables to pass to the container engineConfig.minReplicas
Minimum number of replicas, defaults to 1 but can be set to 0 to enable scale-to-zero engineConfig.maxReplicas
Maximum number of replicas for autoscaling engineConfig.scaleTarget
Integer target value for the autoscaler metric engineConfig.scaleMetric
Scaling metric type (concurrency, rps, cpu, memory) engineConfig.volumes
List of volumes that can be mounted by containers engineConfig.nodeSelector
Node selector for pod scheduling engineConfig.affinity
Affinity rules for pod scheduling engineConfig.tolerations
Tolerations for pod scheduling engineConfig.leader
Leader configuration for multi-node deployments engineConfig.worker
Worker configuration for multi-node deployments Router Configuration Attribute Description routerConfig
Router configuration for request routing routerConfig.runner
Container specification for the router container routerConfig.config
Additional configuration parameters for the router routerConfig.minReplicas
Minimum number of router replicas routerConfig.maxReplicas
Maximum number of router replicas Decoder Configuration Attribute Description decoderConfig
Decoder configuration for PD (Prefill-Decode) disaggregated deployments decoderConfig.runner
Container specification for the decoder container decoderConfig.minReplicas
Minimum number of decoder replicas decoderConfig.maxReplicas
Maximum number of decoder replicas decoderConfig.leader
Leader configuration for multi-node decoder deployments decoderConfig.worker
Worker configuration for multi-node decoder deployments Multi-Node Configuration
For both engineConfig
and decoderConfig
, multi-node deployments are supported:
leader
Leader node configuration for coordinating distributed processing leader.runner
Container specification for the leader node worker
Worker nodes configuration for distributed processing worker.size
Number of worker pod instances worker.runner
Container specification for worker nodes
Using ClusterServingRuntimesNote:
ClusterServingRuntime
support the use of template variables of the form{{.Variable}}
inside the container spec. These should map to fields inside an InferenceService’s metadata object. The primary use of this is for passing in InferenceService-specific information, such as a name, to the runtime environment.
When users define predictor in their InferenceService, they can explicitly specify the name of a ClusterServingRuntime or ServingRuntime. For example:
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: mistral-7b-instruct
namespace: mistral-7b-instruct
spec:
engine:
minReplicas: 1
maxReplicas: 1
model:
name: mistral-7b-instruct
runtime:
name: srt-mistral-7b-instruct
Here, the runtime specified is srt-mistral-7b-instruct
, so the OME controller will first search the namespace for a ServingRuntime with that name. If none exist, the controller will then search the list of ClusterServingRuntimes.
Users can also implicitly specify the runtime by setting the autoSelect
field to true
in the supportedModelFormats
field of the ClusterServingRuntime.
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: mistral-7b-instruct
namespace: mistral-7b-instruct
spec:
engine:
minReplicas: 1
maxReplicas: 1
model:
name: mistral-7b-instruct
Runtime Selection Logic
The OME controller uses an enhanced runtime selection algorithm to automatically choose the best runtime for a given model. The selection process includes several steps:
Runtime DiscoveryThe controller searches for compatible runtimes in the following order:
The runtime must not be disabled. Runtimes can be disabled by setting the disabled
field to true
in the ServingRuntime spec.
The runtime must support the model’s complete format specification, which includes several components:
All these attributes must match between the model and the runtime’s supportedModelFormats
for the runtime to be considered compatible.
The modelSizeRange
field defines the minimum and maximum model sizes that the runtime can support. This field is optional, but when provided, it helps the controller identify a runtime that matches the model size within the specified range. If multiple runtimes meet the size requirement, the controller will choose the runtime with the range closest to the model size.
The runtime must support the requested protocol version. Protocol versions include:
openAI
: OpenAI-compatible API formatopenInference-v1
: Open Inference Protocol version 1openInference-v2
: Open Inference Protocol version 2If no protocol version is specified in the InferenceService, the controller defaults to openAI
.
The runtime must have autoSelect
enabled for at least one supported format. This ensures that only runtimes explicitly marked for automatic selection are considered during the selection process.
If more than one serving runtime supports the same model architecture
, format
, framework
, quantization
, and size range
with same version
, then we can optionally specify priority
for the serving runtime. Based on the priority
the runtime is automatically selected if no runtime is explicitly specified. Note that, priority
is valid only if autoSelect
is true
. Higher value means higher priority.
For example, let’s consider the serving runtimes srt-mistral-7b-instruct
and srt-mistral-7b-instruct-2
. Both the serving runtimes support the MistralForCausalLM
model architecture, transformers
model framework, safetensors
model format, version 1
and both supports the protocolVersion
openAI. Also note that autoSelect
is enabled in both the serving runtimes.
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
name: srt-mistral-7b-instruct
spec:
supportedModelFormats:
- name: safetensors
modelFormat:
name: safetensors
version: "1.0.0"
modelFramework:
name: transformers
version: "4.36.2"
modelArchitecture: MistralForCausalLM
autoSelect: true
priority: 1
protocolVersions:
- openAI
modelSizeRange:
max: 9B
min: 5B
engineConfig:
runner:
image: lmsysorg/sglang:v0.4.6.post6
resources:
requests:
cpu: 10
memory: 30Gi
nvidia.com/gpu: 2
limits:
cpu: 10
memory: 30Gi
nvidia.com/gpu: 2
minReplicas: 1
maxReplicas: 3
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
name: srt-mistral-7b-instruct-2
spec:
supportedModelFormats:
- name: safetensors
modelFormat:
name: safetensors
version: "1.0.0"
modelFramework:
name: transformers
version: "4.36.2"
modelArchitecture: MistralForCausalLM
autoSelect: true
priority: 2
protocolVersions:
- openAI
modelSizeRange:
max: 9B
min: 5B
engineConfig:
runner:
image: lmsysorg/sglang:v0.4.6.post6
resources:
requests:
cpu: 10
memory: 30Gi
nvidia.com/gpu: 2
limits:
cpu: 10
memory: 30Gi
nvidia.com/gpu: 2
Constraints of priority
!!! Warning If multiple runtimes list the same format and/or version as auto-selectable and the priority is not specified, the runtime is selected based on the creationTimestamp
i.e. the most recently created runtime is selected. So there is no guarantee which runtime will be selected. So users and cluster-administrators should enable autoSelect
with care.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4