The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!
Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!
Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement, there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!
Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!
OptimizationsAMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with--enable-flashinfer-mla
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script, compatible with radix cache and chunked prefill.
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable withexport SGL_ENABLE_JIT_DEEPGEMM=1
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
Other Optimizations:
Blackwell architecture Block Scale FP8 GEMM support
Support page size greater than 1 #4356
Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) #4390
Integrate Flash Attention #4385
Integrate FlashMLA #4384
EAGLE 2 optimization #4383
EAGLE 3 day one support #4247
Integrate DeepEP #4232
Prefill and Decoding Disaggregation
tl.range()
in block GEMM kernels with num_stages
set by host. by @whchung in #3535tl.range()
in block GEMM kernels with `num_stage… by @zhyncs in #3632debug_tensor_dump_output_folder
optional key missing by @Qubitium in #4046lm_head
Quantization by @Qubitium in #3790__init__
function of model_runner.py shorter by @merrymercy in #4132aiohttp
into public dependencies by @stevapple in #3980torch.compile
by @junliu-mde in #3844Full Changelog: v0.4.3...v0.4.4
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4