RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/sgl-project/sglang/releases below:

Website Navigation

Releases · sgl-project/sglang · GitHub

v0.4.10 Highlights

This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs

Please check the 2025 H2 roadmap #7736
GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/
SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/
Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/
Accelerating SGLang with Multiple Token Prediction https://lmsys.org/blog/2025-07-17-mtp/
How to support new VLMs into SGLang: A Case Study with NVILA https://lmsys.org/blog/2025-07-16-nvila/
Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/
slime: An SGLang-Native Post-Training Framework for RL Scaling https://lmsys.org/blog/2025-07-09-slime/

What's Changed

[AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
enable aiter_biased_grouped_topk kernel by @valarLip in #7423
[PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
Remove cumsum_buffer initilization by @ispobock in #7439
[benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
Support multi-thread model weight loading by @xianzhiT in #7277
[PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
fix: Add --model as an alias for --model-path in server_args by @CatherineSue in #7505
misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
[OAI] patch origin request_id logic by @whybeyoung in #7508
[PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
EPLB support for MTP by @yilian49 in #7510
clean duplicate code by @habaohaba in #7512
[ci] add router benchmark script and CI by @slin1237 in #7498
fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
[CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
[CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
npu fused op by @ll819214 in #7386
feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
[PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
enable aiter fp8 blockscale quant by @valarLip in #7520
take aiter get_rope back by @valarLip in #7521
Fix typo of flash_cache by @hebiao064 in #7513
feat: add return hidden_states at async generation by @yyihuang in #7507
minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
[PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
[CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
chore: improve ci bug reporting by @mickqian in #7542
chore: remove vlm unnecessary import by @JustinTong0323 in #7541
chore: bump v0.4.8.post1 by @zhyncs in #7559
[PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
[Fix] incorrect assert in EPLB by @ch-wan in #7575
Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
[CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
Updates transformers and timm dependencies by @JustinTong0323 in #7577
feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
Move multimodal processors into a separate folder by @merrymercy in #7581
Fix broken CI TestVILAServer by @lifuhuang in #7610
[router] add centralized configuration module for sgl-router by @slin1237 in #7588
Fix: Minicpm by @JustinTong0323 in #7612
Hybrid kv cache for LLaMA4 by @tarinkk in #6563
[CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
Tiny add logs for expert location updater by @fzyzcjy in #7308
Fix flakiness in LoRA batch test. by @lifuhuang in #7552
[BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
[PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
fix unit tests by @zhyncs in #7618
Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
[bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
[Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
Fix sgl-router startup crash by @finetunej in #7619
[bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
Move files related to EPLB by @fzyzcjy in #7580
[misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
[AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
Update CODEOWNERS by @merrymercy in #7640
[EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
[CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
Add dsv3 router gemm kernel by @Fridge003 in #7627
chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
[doc] update lws doc for pd by @whybeyoung in #7318
Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in https://github.com/sgl-project...

Read more Release v0.4.8 Highlights OpenAI-Compatible Server Refactor

Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:

Consistent metrics and logging for better observability and debugging.
Unified error handling, request validation, and processing logic for improved reliability and maintainability.
Improved request tracking across sessions and components.
Fixed bugs in embedding requests and reasoning parsers.

This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.

DeepSeek R1 FP4 on Blackwell GPU

Added support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.

Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.
Supported 2-stream shared expert execution.
Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.

Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.

Breaking Change: OpenAI-Compatible API Module Moved

The sglang/srt/openai_api directory has been removed and replaced with sglang/srt/entrypoints/openai.

Update your imports to the new module path. For example:

- from sglang.srt.openai_api.protocol import Tool
+ from sglang.srt.entrypoints.openai.protocol import Tool

What's Changed

Update README.md by @merrymercy in #7040
[Docker] Upgrading base image from 24.04 to 24.12 by @Swipe4057 in #7043
fix 24.12 docker by @zhyncs in #7045
Minor cleanup of fa3 backend by @merrymercy in #6999
Fix eagle on AMD by @merrymercy in #7051
Clean up server_args.py by @merrymercy in #7037
Minor style fix in cuda_graph_runner.py by @merrymercy in #7053
[WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" by @kkHuang-amd in #7021
[fix] libmlx5.so already in base image by @HanHan009527 in #7060
Fix test_lora.py CI by @Fridge003 in #7061
Tiny fix cutlass_mla_get_workspace_size stub incorrect signature by @fzyzcjy in #7057
Add sanity checks when a test file is not added to CI by @fzyzcjy in #6947
Revert "Add sanity checks when a test file is not added to CI (#6947)" by @zhyncs in #7063
Fix missing tool call id if tool call index >0 in streaming tool call output. by @Xu-Wenqing in #7049
chore: update dev docker by @zhyncs in #7064
Open AI API hidden states by @kyle-pena-kuzco in #6716
fix arm sgl-kernel link issue by @zhyncs in #7066
[Feature] Add Logit Bias by @b8zhong in #6579
Improve perf tuning docs by @merrymercy in #7071
Frontend language separate reasoning support by @binarycrayon in #6031
Do not run frontend_reasoning.ipynb to reduce the CI load by @merrymercy in #7073
Simplify the heuristics for setting --mem-fraction-static by @merrymercy in #7054
update doc by @Ximingwang-09 in #7046
Clean up docs for server args and sampling parameters (generated by grok) by @merrymercy in #7076
Fix GGuf and add back test_gguf.py by @Fridge003 in #7067
vlm: adapt internvl to VisionAttention by @mickqian in #6870
Fix circular import in test_prefix_chunk_info.py by @Fridge003 in #7097
Fix misusing the "_is_cuda". by @sogalin in #7091
Support VILA models by @futrime in #6106
[FIX]remove redundant code in logits_processor.py by @pc-neo in #7079
[feat]: Emit fixed-size KV blocks events by @faradawn in #6824
[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations by @lifuhuang in #6994
Fix positional argument by @liquanfeng in #7093
[sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul by @yuan-luo in #6919
Improve log status by @hnyls2002 in #7115
feat: update blackwell setup by @zhyncs in #7119
Update CODEOWNERS by @merrymercy in #7126
Add gfx950 support for sgl-kernel. by @sogalin in #7092
[Fix] Reduce busy polling when scheduler is idle by @p12tic in #6026
Minor add utility to read expert distribution recorder output by @fzyzcjy in #7134
Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… by @byjiang1996 in #7140
Minor speedup topk postprocessing by @fzyzcjy in #7058
filter by num_hidden_layers by @pansicheng in #7056
Remove 200us slow concat kernel (part 1: kernel) by @fzyzcjy in #7145
Support new DeepGEMM format in per token group quant by @fzyzcjy in #7146
chore: bump v0.1.8.post1 by @zhyncs in #7152
Support new DeepGEMM format in per token group quant (part 2: srt) by @fzyzcjy in #7155
Fix DeepEP error in some environments by @fzyzcjy in #7154
Minor speed up block_quant_dequant by @fzyzcjy in #6814
Tiny add sanity checks for DeepGEMM inputs by @fzyzcjy in #7157
Remove 200us slow concat kernel (part 2: srt) by @fzyzcjy in #7020
Re-quantize DeepSeek model weights to support DeepGEMM new input format by @fzyzcjy in #7156
Minor style change of triton backend by @merrymercy in #7165
Split the eagle test into two files by @merrymercy in #7170
Support new DeepGEMM input format in silu_and_mul_masked_post_quant_fwd by @fzyzcjy in #7153
Refactor DeepGEMM integration by @fzyzcjy in #7150
Add test for refactored openai server by @jhinpan in #7161
Improve test cases for eagle infer by @merrymercy in #7173
Support new DeepGEMM by @fzyzcjy in #7172
Increase timeout in test/srt/test_disaggregation.py by @merrymercy in #7175
Add Phi-4-mm to supported VLM supported model list. by @lifuhuang in #7178
Fix shared experts fusion + weight requant by @fzyzcjy in #7177
[fix] fix dsv3 weight loader tqdm and simplify shared experts fusion by @Alcanderian in #7181
[fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla by @Alcanderian in #7184
[PD] Update prefill.py by @ByronHsu in #7190
Fix a minor bug related to DeepGEMM upgrade by @zhijian-liu in #7191
chore: bump v0.1.8.post2 by @zhyncs in #7189
[fix] fix determine_num_fused_shared_experts by @Alcanderian in #7180
chore: upgrade sgl-kernel v0.1.8.post2 by @Alcanderian in #7186
Fix NCCL 2.27.3 not in docker image by @fzyzcjy in #7195
Fix error when disabling new DeepGEMM by @fzyzcjy in #7198
[PD] Support decode retract and update decode.py by @ByronHsu in #7196
Move host memory pools into a separate file by @merrymercy in #7200
Lianmin/simplify memory pool by @merrymercy in #7202
Fix grammar abort & Minor style fixes by @merrymercy in https://github.com/sg...