This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs
--model
as an alias for --model-path
in server_args by @CatherineSue in #7505Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:
Consistent metrics and logging for better observability and debugging.
Unified error handling, request validation, and processing logic for improved reliability and maintainability.
Improved request tracking across sessions and components.
Fixed bugs in embedding requests and reasoning parsers.
This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.
DeepSeek R1 FP4 on Blackwell GPUAdded support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.
Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.
Supported 2-stream shared expert execution.
Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.
Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.
Breaking Change: OpenAI-Compatible API Module MovedThe sglang/srt/openai_api
directory has been removed and replaced with sglang/srt/entrypoints/openai
.
Update your imports to the new module path. For example:
- from sglang.srt.openai_api.protocol import Tool + from sglang.srt.entrypoints.openai.protocol import ToolWhat's Changed
The PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.
The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.
SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.
PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.
Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.
SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.
We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!
What's Changedmax_completion_tokens
for OpenAIChatCompletions by @CatherineSue in #5857calculate_num_image_tokens
from qwen2_vl.py by @JustinTong0323 in #5783Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
Coming Soonbench_one_batch
support enable_dp_attention
by @fzyzcjy in #4058--enable-llama4-multimodal
by @ch-wan in #5254The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.
New FeaturesLlama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct
model and 80.7 for Llama-4-Maverick-17B-128E-Instruct
model. #5092
FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709
EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247
DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.
Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.
Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!
Coming SoonDisaggregated Prefill and Decoding: #4655
Llama 4 Optimization: #5118
EP Enhancement: #4734
FA3 Enhancement: #4709
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
What's Changedtorch.cat
instead of torch.concat
to prevent entering the Autograd
backends. by @Alcanderian in #4466torch.inference_mode()
instead of torch.no_grad()
by @Alcanderian in #4372The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!
Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!
Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement, there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!
Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!
OptimizationsAMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with--enable-flashinfer-mla
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script, compatible with radix cache and chunked prefill.
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable withexport SGL_ENABLE_JIT_DEEPGEMM=1
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
Other Optimizations:
Blackwell architecture Block Scale FP8 GEMM support
Support page size greater than 1 #4356
Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) #4390
Integrate Flash Attention #4385
Integrate FlashMLA #4384
EAGLE 2 optimization #4383
EAGLE 3 day one support #4247
Integrate DeepEP #4232
Prefill and Decoding Disaggregation
tl.range()
in block GEMM kernels with num_stages
set by host. by @whchung in #3535tl.range()
in block GEMM kernels with `num_stage… by @zhyncs in #3632The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!
Performance Improvements DeepSeek V3/R1 Optimizationspadding
circular import by @BBuf in #2624update_weights_from_tensor
by @fzyzcjy in #2631get_cuda_graph_seq_len_fill_value
by @merrymercy in #1783ZMQ
buffer size heuristic by @hnyls2002 in #1801SGLANG_CPU_COUNT
by @ByronHsu in #1803engine.generate
by @ByronHsu in #1820use_thread
in the run_program
for easier debugging. by @liuyanyi in #1823RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4