RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/sgl-project/sglang/releases/tag/v0.4.10 below:

Release v0.4.10 · sgl-project/sglang · GitHub

Highlights

This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs

Please check the 2025 H2 roadmap #7736
GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/
SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/
Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/
Accelerating SGLang with Multiple Token Prediction https://lmsys.org/blog/2025-07-17-mtp/
How to support new VLMs into SGLang: A Case Study with NVILA https://lmsys.org/blog/2025-07-16-nvila/
Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/
slime: An SGLang-Native Post-Training Framework for RL Scaling https://lmsys.org/blog/2025-07-09-slime/

What's Changed

[AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
enable aiter_biased_grouped_topk kernel by @valarLip in #7423
[PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
Remove cumsum_buffer initilization by @ispobock in #7439
[benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
Support multi-thread model weight loading by @xianzhiT in #7277
[PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
fix: Add --model as an alias for --model-path in server_args by @CatherineSue in #7505
misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
[OAI] patch origin request_id logic by @whybeyoung in #7508
[PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
EPLB support for MTP by @yilian49 in #7510
clean duplicate code by @habaohaba in #7512
[ci] add router benchmark script and CI by @slin1237 in #7498
fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
[CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
[CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
npu fused op by @ll819214 in #7386
feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
[PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
enable aiter fp8 blockscale quant by @valarLip in #7520
take aiter get_rope back by @valarLip in #7521
Fix typo of flash_cache by @hebiao064 in #7513
feat: add return hidden_states at async generation by @yyihuang in #7507
minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
[PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
[CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
chore: improve ci bug reporting by @mickqian in #7542
chore: remove vlm unnecessary import by @JustinTong0323 in #7541
chore: bump v0.4.8.post1 by @zhyncs in #7559
[PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
[Fix] incorrect assert in EPLB by @ch-wan in #7575
Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
[CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
Updates transformers and timm dependencies by @JustinTong0323 in #7577
feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
Move multimodal processors into a separate folder by @merrymercy in #7581
Fix broken CI TestVILAServer by @lifuhuang in #7610
[router] add centralized configuration module for sgl-router by @slin1237 in #7588
Fix: Minicpm by @JustinTong0323 in #7612
Hybrid kv cache for LLaMA4 by @tarinkk in #6563
[CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
Tiny add logs for expert location updater by @fzyzcjy in #7308
Fix flakiness in LoRA batch test. by @lifuhuang in #7552
[BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
[PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
fix unit tests by @zhyncs in #7618
Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
[bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
[Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
Fix sgl-router startup crash by @finetunej in #7619
[bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
Move files related to EPLB by @fzyzcjy in #7580
[misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
[AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
Update CODEOWNERS by @merrymercy in #7640
[EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
[CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
Add dsv3 router gemm kernel by @Fridge003 in #7627
chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
[doc] update lws doc for pd by @whybeyoung in #7318
Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in #7648
Add small requirements for benchmark/parse_result tools by @BBuf in #7671
[CPU] remove process_group from inputs of shm_allreduce and shm_allgather by @chunyuan-w in #7486
chore: bump sgl-kernel v0.2.1 by @zhyncs in #7675
support llama4 eagle3 by @sleepcoo in #6985
Refactor mm processors and Enable mixed modality processing by @JustinTong0323 in #7629
upgrade sgl kernel to 0.2.1 for main by @xiezhq-hermann in #7676
add description for llama4 eagle3 by @yizhang2077 in #7688
fix(model loader): use safe_open to prevent file handle leaks. by @SimonCqk in #7684
chore: upgrade flashinfer v0.2.7.post1 by @zhyncs in #7698
Improve error handling for requests with unloaded LoRA path(s) by @lifuhuang in #7642
Apply dsv3_fused_a_gemm kernel by @ispobock in #7635
Fix GPTQMarlinMoE by @lkm2835 in #7697
[1/n] apply wna16marlin kernel in moe weight only quantization by @AniZpZ in #7683
Apply dsv3 router gemm kernel for deepseek-r1 fp4 by @Fridge003 in #7677
[AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill by @hubertlu-tw in #7717
[RL] add --skip-warmup by @zhuzilin in #7416
[RL] support update_weights_from_distributed with different group and multiple weights by @zhuzilin in #7292
[router] add --log-level to sgl-router by @zhuzilin in #6512
[b200] support trt-llm allreduce fuse rms_norm_add kernel by @BBuf in #7621
[CPU] Bind threads and numa node for each TP rank by @chunyuan-w in #6549
Support non-contiguous query input for extend/decode attention by @yanbing-j in #7462
Support updating weights at once by stopping all requests by @tianyuzhou95 in #6698
Fix num_tokens_pre_allocated in disaggregation log by @ZeldaHuang in #7714
[CPU] [sgl-kernel] set dispatch key of initialize to CatchAll by @chunyuan-w in #7734
[CPU] fix all_reduce and all_gather by @chunyuan-w in #6770
fix awq and dsv3 fused gemm compatible by @AniZpZ in #7735
[CI][Router] Fix bench_one_batch_server for pd router test by @ShangmingCai in #7731
Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture by @ayrnb in #7278
fix dsv3 fused proj check by @AniZpZ in #7738
Ascend attention backend(PA&MLA) by @ping1jing2 in #7722
[fix] fix dsv3_router_gemm filter by @Alcanderian in #7750
[CPU] refine CPU integration code by @chunyuan-w in #7647
[CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size by @chunyuan-w in #6771
support qwen3 dense model dp attention by @yizhang2077 in #7681
[optimize] add two stream norm for qwen3 by @yizhang2077 in #7740
feat: use D2D instead of H2H in pp by @TianyuZhang1214 in #7673
[Bug] add flashinfer bool check for fusedmoe in Qwen moe models by @yilian49 in #7723
[fix] put cpu in the first priority in get_device() by @Alcanderian in #7752
[optimize] fuse renormalize into moe_topk_softmax by @yizhang2077 in #7744
chore: bump sgl-kernel 0.2.2 by @zhyncs in #7755
fix CI: update native api ipynb by @JustinTong0323 in #7754
fuse renormal into moe topk softmax kernel python code by @yizhang2077 in #7751
Remove type conversion and fix id map in topk by @ispobock in #7759
Add V2-lite model test by @yanbing-j in #7390
refactor llama4 dp attention logic by @yizhang2077 in #7729
fix(docs): fix the broken link in docs/references/production_metrics.md by @rudeigerc in #7741
[fix] update bench_speculative.py for compatibility by @yankay in #7764
Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups by @merrymercy in #7748
[RL] Add --nccl-port to prevent port conflict by @zhuzilin in #7418
[RL] add pause and continue generation for async rl training by @zhuzilin in #7419
[Fix] Alloc return type error by @Capronir in #7778
[feat] Support EAGLE3 for Qwen by @Ximingwang-09 in #7745
saving hidden_states.clone() by @ch-wan in #7705
[1/n]: add cutlass W4A8 moe kernel for hopper architecture by @yangsijia-serena in #7772
add model: qwen2-audio by @leng-yue in #7596
Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario by @HydraQYH in #7782
Embedding parallel by attn_tp by @MoonBall in #7623
fix: fix apply_shuffle_mul_sum by @mickqian in #7444
chore: bump sgl-kernel v0.2.3 by @zhyncs in #7784
fix: use nvidia-nccl-cu12 2.27.5 by @zhyncs in #7787
DP Attention with Auto DeepEP Dispatch by @ch-wan in #7222
chore: upgrade sgl-kernel v0.2.3 by @zhyncs in #7786
Fix incorrect spec_num_draft_tokens in draft_extend by @ch-wan in #7757
[fix] fix misusing of is_cuda by @Alcanderian in #7790
Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 by @merrymercy in #7756
chore: bump sgl-kernel v0.2.4 by @zhyncs in #7800
ci: fix port args by @mickqian in #7792
Fix CI test OOM issue. by @lifuhuang in #7799
chore: upgrade sgl-kernel v0.2.4 by @zhyncs in #7801
chore: bump v0.4.9 by @zhyncs in #7802
[misc] remove pdlb rust by @slin1237 in #7796
fix: free disk space by @zhyncs in #7803
fix: disable dsv3_router_gemm in dsv3_nextn by @Alcanderian in #7793
Support logprobs in two-batch overlap by @fzyzcjy in #7709
Fix division-by-zero bug in LoRA triton kernels. by @lifuhuang in #7785
[AMD] Add test_fused_moe.py and test_rope_rocm.py to AMD CI by @hubertlu-tw in #5246
[RL] Fix illegal memory for _import_static_state by @hebiao064 in #7733
Fix _import_static_state issue by @nanjiangwill in #7812
Optimize moe align block size kernel by @ispobock in #7794
Log the timestamps of each prefill/decode iteration by @yuhsuan-t in #6094
[bugfix] Fix sgl-router get_server_info endpoint compatibility issue by @slin1237 in #7813
Integrate triton moe kernel by @yuan-luo in #7689
Kernels for efficient KV cache IO by @xiezhq-hermann in #7313
[docs] update router readme by @slin1237 in #7797
[misc] release new router version by @slin1237 in #7798
fix duplicate args in schedule_batch by @ZeldaHuang in #7816
[AMD] Fail gracefully when AITER is unavailable gfx90a GPUs by @haohui in #7187
docs: update README by @zhyncs in #7821
feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode by @yangsijia-serena in #7762
Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang by @Edwardf0t1 in #7129
[Minor] Fix sporadic CI timeout caused by underestimated tests. by @lifuhuang in #7850
[Bugfix] Fix two batch overlap with auto DeepEP Dispatch by @ShangmingCai in #7853
Fix cache modules of triton import error by @kkHuang-amd in #7832
[router] forward stream_options in request by @ZhangShuaiyi in #7860
Fix illegal memory in trtllm allreduce fusion by @BBuf in #7864
Fix llama4 vision by @JustinTong0323 in #7840
Support Mimo-VL by @JustinTong0323 in #7579
fix: Handles input_embeds in GenerateReqInput when n>1 by @JustinTong0323 in #7830
[Multimodal][Perf] Use pybase64 instead of base64 by @b8zhong in #7724
Bump xgrammar's version to 0.1.20 by @whybeyoung in #7866
[CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack by @chunyuan-w in #7818
[PD] Add guidance for prefill bootstrap timeout by @ShangmingCai in #7846
Update native_api doc to match the change in the get_model_info endpoint by @Arist12 in #7660
Revert "Embedding parallel by attn_tp (#7623)" by @zhyncs in #7880
chore: bump v0.4.9.post1 by @zhyncs in #7882
Fixes typo in assertion message by @JustinTong0323 in #7895
[CI] Add deepep tests to CI by @ch-wan in #7872
[CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt by @chunyuan-w in #7885
[CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding by @jianan-gu in #7838
Remove unused imports by @almaslof in #7898
[router] Update metrics when request completes by @ZhangShuaiyi in #7899
[feature] Add start step profile argument in /start_profile by @kyleliang-nv in #7608
[bugfix] add pd router policy validation by @slin1237 in #7904
vlm: support video as an input modality by @mickqian in #5888
Feat: Support Phi-3.5-MoE in SGLang by @byjiang1996 in #7907
add sentencepiece as dependency explicitly by @ZailiWang in #7922
Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen by @likesen-alibaba in #6449
[feature]Ascend quantization support by @ping1jing2 in #7791
[ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module by @BBuf in #7775
Support Kimi K2 by @Atream in #7940
[feature] kv transfer support of ascend npu by @ping1jing2 in #7795
fix: minor fix for modelopt weight load compatibility by @AniZpZ in #7953
temporarily disable deepep-8-gpu and activate two small tests by @ch-wan in #7961
[fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel by @HydraQYH in #7932
chore: bump sgl-kernel v0.2.5 by @zhyncs in #7964
Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (#7236)" by @fzyzcjy in #7968
chore: upgrade xgrammar 0.1.21 by @zhyncs in #7962
delete uselese code caused by fuse allreduce+add_rmsnorm pr by @BBuf in #7970
Fix wrong gemm branch cause 250us slower by @fzyzcjy in #7969
[router] add worker abstraction by @slin1237 in #7960
chore: upgrade sgl-kernel 0.2.5 by @zhyncs in #7971
chore: bump v0.4.9.post2 by @zhyncs in #7963
[minor fix] llama4 hybrid memory by @Ying1123 in #7950
[minor fix] SWA missing methods by @Ying1123 in #7972
[script] update loogle test by @Ying1123 in #7975
docs: update README by @zhyncs in #7985
Overlap the gating function with shared experts in DeepSeek by @ch-wan in #7978
[BugFix] fix pre_reorder_triton_kernel default int32 issue by @Yuechguo in #7814
[minor] Add server_args check for Llama4 with hybrid by @Ying1123 in #7988
Tiny fix mooncake log warning wrong output by @fzyzcjy in #7952
[BugFix] add verify logit_bias to avoid crash because of IndexError by @ehuaa in #7749
SWA Prefix Cache by @hanming-lu in #7367
chore: remove unnecessary limits on quantization methods in test script by @AniZpZ in #7997
Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes by @lifuhuang in #7844
Support for Phi-1.5 & Phi-2 models by @ppraneth in #7862
[Dockerfile] Multi-arch support for ROCm by @mqhc2020 in #7902
[CPU] fix no attribute 'can_fuse_mlp_allreduce' error by @chunyuan-w in #8010
perf: add kimi k2 fused_moe tuning config for h30_3e by @GaoYusong in #8021
[ci] CI supports use cached models by @HanHan009527 in #7874
[Minor] Remove redundant print by @merrymercy in #8005
[Feature]TP Group Switching for PD-Multiplexing by @ykcombat in #7653
[Feature] CUDA Green Context Support by @ykcombat in #7649
Fix flaky CI: test_vlm_models by @lifuhuang in #8006
Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode by @hzh0425 in #7982
prevent server crash from potential invalid grammar by @ehuaa in #7897
Setup workflow for releasing mi300x and mi350x dockers. by @saienduri in #8035
fix: modality length mismatch with image_data by @Yangruipis in #7887
Update CODEOWNERS by @CatherineSue in #8044
[feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8023
feat: update multimodal data handling in engine entrypoint by @JustinTong0323 in #8002
fix: remove redundant rotary embedding cache recomputation in MiniCPM by @JustinTong0323 in #8022
Fix the input tools format and history tool_calls in OpenAI API by @chen700564 in #6556
fix: resolve arm build issue by @zhyncs in #8052
concurrently load weights of DeepseekV2ForCausalLM by @tianyuzhou95 in #7943
H20 tune config for Kimi by @artetaout in #8047
Update amd docker image. by @saienduri in #8045
feat: replace Decord with video_reader-rs by @kozoy in #5163
remove kv_a.congigous in DeepseekV2AttentionMLA by @strgrb in #8058
update transformers to 4.53.2 by @JustinTong0323 in #8029
Fix different device type adjustment in PP by @Qiaolin-Yu in #7760
Use device_group for all_gather when disabling overlap scheduling by @Qiaolin-Yu in #8001
Revert "feat: replace Decord with video_reader-rs" by @mickqian in #8077
Fix CI xeon test with triton 3.3.1 by @yanbing-j in #8086
fix greenctx stream compability by @AniZpZ in #8090
[misc] update nvshmem and pin deepEP commit hash by @slin1237 in #8098
[Feature] Layer-wise Prefill by @jason-fxz in #7634
[1/n] chore: decouple quantization implementation from vLLM dependency by @AniZpZ in #7992
refactor: unify names of the feature field of MultimodalDataItem by @mickqian in #8075
feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics by @acelyc111 in #7597
[ci] limit cmake build nproc by @slin1237 in #8100
[ci] disable memory imbalance check for draft worker by @ch-wan in #8108
[Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models by @hzh0425 in #8110
[ci] recover 8-gpu deepep test by @ch-wan in #8105
Refactor: move all quantization-related code to srt/layer/quantization by @ch-wan in #7989
[kernel] opt moe align block kernel by block/warp scan algorithm by @yuan-luo in #7884
Super tiny fix typo by @fzyzcjy in #8046
fix: update HostKVCache init to report correct msg when available memory is not enough by @ziqifan617 in #8102
[Hunyuan]: Fix Dense Model Support by @kzjeef in #8117
feat: add production metric for retracted requests due to insufficient kvcache by @aftersnow in #7030
refactor: simply MultimodalTokens logic by @mickqian in #7924
[Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell by @HydraQYH in #8127
Feat: Support Granite 3.0 MoE in SGLang by @zminglei in #7959
load draft model fix by @yilian49 in #7506
[CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" by @jianan-gu in #7889
[Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config by @jianan-gu in #7820
Hicache Storage Layer Prototype by @xiezhq-hermann in #7704
Revert "Fix different device type adjustment in PP" by @saienduri in #8141
feat: enchance green context stream creation robust with backward compatibility by @AniZpZ in #8136
fix compressed tensors WNA16 imports by @qeternity in #8142
[Bugfix] Fix w8a8_int8 import error on NPU by @iforgetmyname in #8147
[3/n] chore: decouple AWQ implementation from vLLM dependency by @Hongbosherlock in #8113
[router] Refactor router and policy traits with dependency injection by @slin1237 in #7987
[AMD] Add triton awq_dequantize kernel to support AWQ on ROCm by @hubertlu-tw in #7661
[Doc] Steps to add a new attention backend by @merrymercy in #8155
chore: tune mem fraction static for vlm by @mickqian in #6881
Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs by @haohui in #7302
Feat: Support audio in Phi4-mm model by @byjiang1996 in #8048
[PD] Support non-MLA models PD different TP with DP attention by @ShangmingCai in #7931
[health_generate] fix: fix the /health_generate always success bug by @acelyc111 in #8028
[router] router metrics cleanup by @slin1237 in #8158
[router] allow router to have empty workers by @slin1237 in #8160
Add GB200 wide-EP docker by @kyleliang-nv in #8157
[1/N] MoE Refactor: refactor select_experts by @ch-wan in #7966
chore: bump sgl-kernel v0.2.6 by @zhyncs in #8165
chore: upgrade sgl-kernel 0.2.6 by @zhyncs in #8166
Fix suffix mismatch for the metrics. by @Charles-L-Chen in #8168
Update README.md by @merrymercy in #8171
Clean up server args by @merrymercy in #8161
Fix LoRA buffer contamination during adapter eviction by @lifuhuang in #8103
Fix Dockerfile.gb200 by @kyleliang-nv in #8169
[router] add ut for worker and errors by @slin1237 in #8170
bugfix: fix sglang crash in NVIDIA MIG container by @Garrybest in #8167
Support start up LoRA server without initial adapters by @lifuhuang in #8019
Clean warning logs for gate_proj loading in Lora by @Fridge003 in #8172
Fix tuning_fused_moe_triton.py by @ch-wan in #8175
[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability by @whybeyoung in #8115
Add bf16 output option for dsv3_router_gemm kernel by @Fridge003 in #7999
Enable FlashInfer support encoder models and add head_dim padding workaround by @ccs96307 in #6230
Add get_hidden_dim to qwen3.py for correct lora by @logachevpa in #7312
feat: add h200 tp 16 kimi k2 moe config by @zhyncs in #8176
feat: add b200 tp 16 kimi k2 moe config by @zhyncs in #8178
fix moe gate dtype, fix tbo, fix fake dispatch by @Atream in #7825
Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" by @merrymercy in #8181
feat: update nccl 2.27.6 by @zhyncs in #8182
Feat: Support for Persimmon Model by @ppraneth in #7983
feat: add h200 tp 16 kimi k2 moe config by @Qiaolin-Yu in #8183
Fix eagle3 cuda graph by @Ja1Zhou in #8163
fix: fix the bug of loading Internvl3 by @coco-alen in #8067
Fix dtype error in CI by @ispobock in #8197
[router] add ut for pd request, metrics and config by @slin1237 in #8184
[feature] enable NPU CI by @ping1jing2 in #7935
[fix] fix modelopt fp4 on b200 by @Alcanderian in #8195
chore: bump sgl-kernel v0.2.6.post1 by @zhyncs in #8200
Apply fused sorted token ids padding by @ispobock in #8193
[Refactor] simplify multimodal data processing by @JustinTong0323 in #8107
[router] add ut for pd router by @slin1237 in #8208
[router] upgade router version to 0.1.6 by @slin1237 in #8209
Remve router gemm output dtype conversion by @ispobock in #8204
chore: upgrade sgl-kernel 0.2.6.post1 by @zhyncs in #8202
[Feature] Add a test for Layer-wise Prefill by @jason-fxz in #8231
docs: update 2025 h2 roadmap by @zhyncs in #8237
fix: retrieve mm token by modality, raise error if none by @JustinTong0323 in #8221
[AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 by @hubertlu-tw in #7484
fix: sgl-router remove dead code by @oldsharp in #8257
[fix] benchmark : routed_scaling_factor is None by @panpan0000 in #8059
[Benchmark] add disable-auto-run param for hicache/bench_multiturn by @rzwei in #7822
Preliminary Support for Qwen3XMLDetector by @yhyang201 in #8260
chore: bump v0.4.9.post3 by @zhyncs in #8265
Skip llama4 vision module loading when multimodal disabled by @ispobock in #8272
Fix sgl-kernel ci test by @ispobock in #8284
Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching by @lifuhuang in #8261
Hicache IO kernel refactoring by @xiezhq-hermann in #8264
bug fix and tag by @xiezhq-hermann in #8282
HiCache Fix by @xiezhq-hermann in #8288
[sgl-kernel] Opt per_token_quant_fp8 with warp reduce by @yuan-luo in #8130
[router] add common ut infra to mock worker and app by @slin1237 in #8295
fix: workaround for deepgemm warmup issue by @zhyncs in #8302
[Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages by @YiXR in #8133
Fix the issue of incorrect finish reason in final stream response chunk returned during tool call by @xianzhiT in #7708
fix: match chat-template for internvl3 by @JustinTong0323 in #8262
Fix gemma3n with hybrid swa by @JustinTong0323 in #8240
chore: upgrade sgl-kernel 0.2.7 by @zhyncs in #8304
fix: prevent crashes due to logit bias dimension mismatch by @0xymoro in #7685
feat(function call): complete utility method for KimiK2Detector and enhance documentation by @CatherineSue in #8043
Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP by @xianzhiT in #7562
[AMD] Pull latest image for AMD CI by @michael-amd in #8070
Pin the version of petit kernel to fix the APIs by @haohui in #8235
[bug] fix pd completion protocol for batching support by @slin1237 in #8317
[router] fix pd model completion request by @slin1237 in #8303
fix bug when eos_ids==0 by @bzantium in #8315
[router] add endpoint unit test by @slin1237 in #8298
[code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import by @BBuf in #8310
chore: upgrade flashinfer v0.2.9rc1 by @Swipe4057 in #8301
[router] add streaming unit test by @slin1237 in #8299
[router] add request format unit test by @slin1237 in #8300
HiCache Storage TP Refinement by @xiezhq-hermann in #8307
breakdown kernel update by @xiezhq-hermann in #8334
support idle batch for TBO by @sherry-1001 in #8233
[Feature] Integrate quick allreduce and select the best allreduce implementation by @lihaoyang-amd in #6619
DP Enhancement by @ch-wan in #8280
fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals by @ynwang007 in #8266
[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs by @hubertlu-tw in #7135
[CPU] Add tutorial docs for SGL on CPU by @ZailiWang in #8000
chore: upgrade mooncake 0.3.5 by @ShangmingCai in #8341
[torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering torch.compile in forward pass by @BBuf in #8353
[P/D] Support ipv6 in P/D scenario by @thefacetakt in #7858
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #8344
[Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector by @CatherineSue in #8357
Clean up server_args, triton cache manager by @merrymercy in #8332
fix: upgrade nccl version by @zhyncs in #8359
[Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 by @CatherineSue in #8363
fix: kimi k2 xgrammar crash by @zhyncs in #8367
Fix FP4 MoE accuracy from missing routed_scaling_factor by @trevor-m in #8333
[CI] Fix flaky threshold by @merrymercy in #8370
chore: bump v0.4.9.post4 by @zhyncs in #8305
Fix test_moe_fused_gate_combined sgl-kernel ci test by @ispobock in #8374
Uodate Dockerfile.gb200 to latest sglang by @kyleliang-nv in #8356
chore: improve mmmu benchmark by @mickqian in #7000
Save peak memory in logits processor by @ch-wan in #8343
Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce by @hebiao064 in #8267
chore: improvements on mm_utils by @mickqian in #7737
vlm: optimize tensor transport by @mickqian in #6003
Tiny assert EPLB is used together with expert parallel by @fzyzcjy in #8381
model: support intern-s1 by @RunningLeon in #8350
Add perf tests for LoRA by @lifuhuang in #8314
Remove slot usage in code to be backward-compatible with python 3.9 by @lifuhuang in #8396
Add docker release flow for gb200 by @kyleliang-nv in #8394
HiCache, check before terminate prefetching by @xiezhq-hermann in #8372
Add nvfp4 scaled mm benchmark. by @HydraQYH in #8401
Urgent Fix: intern-s1 chat-template matching by @JustinTong0323 in #8403
Tool to dump and compare internal activation tensors by @fzyzcjy in #7976
Minor tool for comparison of benchmark results by @fzyzcjy in #7974
Fix bench script making input data on L2 cache by @fzyzcjy in #7739
[NVIDIA] Add Flashinfer MoE blockscale fp8 backend by @kaixih in #8036
Update Cutlass in sgl-kernel to v4.1 by @Fridge003 in #8392
fix: minor fix TransportProxyTensor under tp by @mickqian in #8382
[router] add different policies for p node and d node by @slin1237 in #8395
Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @lambert0312 in #8351
fix: fix the missing metrics on non-rank0 nodes by @acelyc111 in #7720
[2/N] MoE Refactor: Unify weight loader and quant methods by @ch-wan in #8397
Use FlashInfer FP4 gemm. by @elfiegg in #8241
Support precomputed_embeddings for Llama 4 by @AlienKevin in #8156
[hotfix] fix merge conflicts in FlashInferEPMoE by @ch-wan in #8405
chore: update CODEOWNERS by @zhyncs in #8407
chore: upgrade flashinfer v0.2.9rc2 by @zhyncs in #8406
Support triton kernels v3.4.0 for fused_moe by @yuan-luo in #8258
[Bugfix] Prevent PD server crash from invalid grammar by @ShangmingCai in #8062
Change to use native arm runner by @kyleliang-nv in #8414
Support overlapped lora updates by @lifuhuang in #8213
Support ue8m0 for triton quant kernel by @fzyzcjy in #7603
Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic by @byjiang1996 in #8316
bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check by @CatherineSue in #8417
Fix test_openai_server by @CatherineSue in #8419
Fix docker buildx push error by @kyleliang-nv in #8425
bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation by @CatherineSue in #8422
[router] improve router logs and request id header by @slin1237 in #8415
[feat] Support different attention backends for prefill and decode by @Qiaolin-Yu in #6338
chore: bump transformer to 4.54.0 by @hebiao064 in #8416
[PD] Fix abort_request for PD disaggregation by @ShangmingCai in #8352
GLM-4.5 Model Support by @zRzRzRzRzRzRzR in #8224
Remove zstd compression for building Dockerfile.gb200 by @kyleliang-nv in #8442
doc: add bench_one_batch_server in the benchmark doc by @Qiaolin-Yu in #8441
GLM-4.5 Model Support Follow-up by @byjiang1996 in #8445
fix GLM4_MOE launch with compressed_tensor quant model by @zminglei in #8456
Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. by @strgrb in #8449
Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" by @BBuf in #8457
chore: bump v0.4.9.post5 by @zhyncs in #8458
fix:reorder topk experts to ensure shared expert replaces minimal score by @erictanjn in #8125
Update PR template by @ispobock in #8465
feat: throttle requests at scheduler based on --max_queued_requests by @harrisonlimh in #7565
fix: update dep by @zhyncs in #8467
[NVIDIA] Change to use num_local_experts by @kaixih in #8453
Fix parsing ChatCompletionMessage by @Onyad in #7273
[3/N] MoE Refactor: Simplify DeepEP Output by @ch-wan in #8421
feat: support glm4 tuning by @zhyncs in #8473
Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 by @hebiao064 in #8469
Update codeowner by @merrymercy in #8476
chore: add glm4 fp8 tp8 config by @zhyncs in #8478
chore: add glm 4.5 fp8 tp4 config by @zhyncs in #8480
[CI]Add genai-bench Performance Validation for PD Router by @key4ng in #8477
Update CODEOWNERS by @merrymercy in #8485
Rename the last step in pr-test.yml as pr-test-finish by @merrymercy in #8486
Reduce memory usage for fp4 moe by @fzyzcjy in #8413
Tiny add warnings for DeepEP when it is suboptimal by @fzyzcjy in #8426
Support colocating requests by @fzyzcjy in #7973
Fix incorrect KV cache allocation for MTP models. by @lifuhuang in #8482
Add PVC and update resource limits in k8s config by @haitwang-cloud in #8489
chore: bump v0.4.9.post6 by @zhyncs in #8517
Always trigger pr-test by @merrymercy in #8527
Update README.md by @merrymercy in #8528
[sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% by @BBuf in #8499
Update cutlass_moe.py by @elfiegg in #8535
Fix moe align kernel test by @ispobock in #8531
Split the scheduler into multiple mixin classes to reduce the file size by @merrymercy in #8483
bring back kimi vl ci by @hebiao064 in #8537
fix: temporarily disable cuda-ipc for mm data tensor by @mickqian in #8431
Support EPLB in FusedMoE by @ch-wan in #8448
feat(hicache): support file backend reading directory config form env. by @hzh0425 in #8498
feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. by @hzh0425 in #8516
[router] allow longer time out for router e2e by @slin1237 in #8560
Update cutlass_moe.py by @elfiegg in #8545
Update CODEOWNERS by @ShangmingCai in #8562
[feature] [sgl-router] Add a dp-aware routing strategy by @oldsharp in #6869
[Hot-Fix] moe_aligned_block_size CI failed in AMD by @yuan-luo in #8461
[Model] Add support for Arcee Foundational Model by @adarshxs in #8154
Revert "Fix the input tools format and history tool_calls in OpenAI API (#6556)" by @CatherineSue in #8584
Add hf3fs support for hicache storage (based on #7704) by @pansicheng in #7280
[router] migrate router from actix to axum by @slin1237 in #8479
[Fix]Fix index oob in get_group_gemm_starts kernel. by @HydraQYH in #8564
Bump transfomers to 4.54.1 to fix Gemma cache issue. by @lifuhuang in #8541
Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. by @pyc96 in #8544
Bug: Fix google gemma3n-mm audio input not working bug by @byjiang1996 in #8365
update sgl-kernel for EP: kernel part by @ch-wan in #8514
chore: bump sgl-kernel v0.2.8 by @zhyncs in #8599
[bugfix] Fix 2 minor bugs in the hicache storage layer by @yapple in #8404
fix incorrect increase of hit count by @huangtingwei9988 in #8533
Support l3 cache (mooncake store) for hiradix cache by @huangtingwei9988 in #7211
update sgl-kernel for EP: python part by @ch-wan in #8550
add SVG logo by @hnyls2002 in #8603
[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE by @ch-wan in #8515
fix: fork should not run pypi router by @yihong0618 in #8604
model: support Step3V by @CatherineSue in #8583
[Feature] Hybrid EP and TP by @ch-wan in #8590
chore: bump v0.4.10 by @zhyncs in #8608

New Contributors

@valarLip made their first contribution in #7423
@ssssnow made their first contribution in #7236
@xianzhiT made their first contribution in #7277
@yilian49 made their first contribution in #7510
@ll819214 made their first contribution in #7386
@MasterJH5574 made their first contribution in #7543
@TomQuartz made their first contribution in #7584
@finetunej made their first contribution in #7619
@narutolhy made their first contribution in #7648
@SimonCqk made their first contribution in #7684
@ZeldaHuang made their first contribution in #7714
@ayrnb made their first contribution in #7278
@ping1jing2 made their first contribution in #7722
@TianyuZhang1214 made their first contribution in #7673
@rudeigerc made their first contribution in #7741
@Capronir made their first contribution in #7778
@yangsijia-serena made their first contribution in #7772
@leng-yue made their first contribution in #7596
@HydraQYH made their first contribution in #7782
@MoonBall made their first contribution in #7623
@nanjiangwill made their first contribution in #7812
@haohui made their first contribution in #7187
@ZhangShuaiyi made their first contribution in #7860
@almaslof made their first contribution in #7898
@kyleliang-nv made their first contribution in #7608
@likesen-alibaba made their first contribution in #6449
@Yuechguo made their first contribution in #7814
@hanming-lu made their first contribution in #7367
@ppraneth made their first contribution in #7862
@mqhc2020 made their first contribution in #7902
@ykcombat made their first contribution in #7653
@hzh0425 made their first contribution in #7982
@Yangruipis made their first contribution in #7887
@chen700564 made their first contribution in #6556
@artetaout made their first contribution in #8047
@kozoy made their first contribution in #5163
@jason-fxz made their first contribution in #7634
@acelyc111 made their first contribution in #7597
@ziqifan617 made their first contribution in #8102
@kzjeef made their first contribution in #8117
@aftersnow made their first contribution in #7030
@Charles-L-Chen made their first contribution in #8168
@Garrybest made their first contribution in #8167
@ccs96307 made their first contribution in #6230
@logachevpa made their first contribution in #7312
@coco-alen made their first contribution in #8067
@oldsharp made their first contribution in #8257
@rzwei made their first contribution in #7822
@YiXR made their first contribution in #8133
@0xymoro made their first contribution in #7685
@bzantium made their first contribution in #8315
@sherry-1001 made their first contribution in #8233
@lihaoyang-amd made their first contribution in #6619
@ynwang007 made their first contribution in #8266
@thefacetakt made their first contribution in #7858
@RunningLeon made their first contribution in #8350
@AlienKevin made their first contribution in #8156
@zRzRzRzRzRzRzR made their first contribution in #8224
@erictanjn made their first contribution in #8125
@harrisonlimh made their first contribution in #7565
@Onyad made their first contribution in #7273
@haitwang-cloud made their first contribution in #8489
@yapple made their first contribution in #8404
@yihong0618 made their first contribution in #8604

Full Changelog: v0.4.8...v0.4.10

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4