Showing content from https://github.com/sgl-project/sglang/releases/tag/v0.4.10 below:
Release v0.4.10 · sgl-project/sglang · GitHub
Highlights
This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs
What's Changed
- [AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
- enable aiter_biased_grouped_topk kernel by @valarLip in #7423
- [PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
- Remove cumsum_buffer initilization by @ispobock in #7439
- [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
- Support multi-thread model weight loading by @xianzhiT in #7277
- [PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
- fix: Add
--model
as an alias for --model-path
in server_args by @CatherineSue in #7505
- misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
- Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
- [OAI] patch origin request_id logic by @whybeyoung in #7508
- [PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
- EPLB support for MTP by @yilian49 in #7510
- clean duplicate code by @habaohaba in #7512
- [ci] add router benchmark script and CI by @slin1237 in #7498
- fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
- [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
- [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
- npu fused op by @ll819214 in #7386
- feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
- [PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
- enable aiter fp8 blockscale quant by @valarLip in #7520
- take aiter get_rope back by @valarLip in #7521
- Fix typo of flash_cache by @hebiao064 in #7513
- feat: add return hidden_states at async generation by @yyihuang in #7507
- minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
- Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
- Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
- [PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
- [CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
- Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
- Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
- chore: improve ci bug reporting by @mickqian in #7542
- chore: remove vlm unnecessary import by @JustinTong0323 in #7541
- chore: bump v0.4.8.post1 by @zhyncs in #7559
- [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
- [Fix] incorrect assert in EPLB by @ch-wan in #7575
- Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
- Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
- Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
- [CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
- Updates transformers and timm dependencies by @JustinTong0323 in #7577
- feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
- Move multimodal processors into a separate folder by @merrymercy in #7581
- Fix broken CI TestVILAServer by @lifuhuang in #7610
- [router] add centralized configuration module for sgl-router by @slin1237 in #7588
- Fix: Minicpm by @JustinTong0323 in #7612
- Hybrid kv cache for LLaMA4 by @tarinkk in #6563
- [CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
- Tiny add logs for expert location updater by @fzyzcjy in #7308
- Fix flakiness in LoRA batch test. by @lifuhuang in #7552
- [BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
- Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
- [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
- fix unit tests by @zhyncs in #7618
- Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
- Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
- docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
- Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
- [bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
- [Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
- Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
- Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
- Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
- Fix sgl-router startup crash by @finetunej in #7619
- [bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
- Move files related to EPLB by @fzyzcjy in #7580
- [misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
- [AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
- Update CODEOWNERS by @merrymercy in #7640
- [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
- [CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
- Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
- Add dsv3 router gemm kernel by @Fridge003 in #7627
- chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
- [doc] update lws doc for pd by @whybeyoung in #7318
- Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in #7648
- Add small requirements for benchmark/parse_result tools by @BBuf in #7671
- [CPU] remove process_group from inputs of shm_allreduce and shm_allgather by @chunyuan-w in #7486
- chore: bump sgl-kernel v0.2.1 by @zhyncs in #7675
- support llama4 eagle3 by @sleepcoo in #6985
- Refactor mm processors and Enable mixed modality processing by @JustinTong0323 in #7629
- upgrade sgl kernel to 0.2.1 for main by @xiezhq-hermann in #7676
- add description for llama4 eagle3 by @yizhang2077 in #7688
- fix(model loader): use safe_open to prevent file handle leaks. by @SimonCqk in #7684
- chore: upgrade flashinfer v0.2.7.post1 by @zhyncs in #7698
- Improve error handling for requests with unloaded LoRA path(s) by @lifuhuang in #7642
- Apply dsv3_fused_a_gemm kernel by @ispobock in #7635
- Fix GPTQMarlinMoE by @lkm2835 in #7697
- [1/n] apply wna16marlin kernel in moe weight only quantization by @AniZpZ in #7683
- Apply dsv3 router gemm kernel for deepseek-r1 fp4 by @Fridge003 in #7677
- [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill by @hubertlu-tw in #7717
- [RL] add --skip-warmup by @zhuzilin in #7416
- [RL] support update_weights_from_distributed with different group and multiple weights by @zhuzilin in #7292
- [router] add --log-level to sgl-router by @zhuzilin in #6512
- [b200] support trt-llm allreduce fuse rms_norm_add kernel by @BBuf in #7621
- [CPU] Bind threads and numa node for each TP rank by @chunyuan-w in #6549
- Support non-contiguous query input for extend/decode attention by @yanbing-j in #7462
- Support updating weights at once by stopping all requests by @tianyuzhou95 in #6698
- Fix num_tokens_pre_allocated in disaggregation log by @ZeldaHuang in #7714
- [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll by @chunyuan-w in #7734
- [CPU] fix all_reduce and all_gather by @chunyuan-w in #6770
- fix awq and dsv3 fused gemm compatible by @AniZpZ in #7735
- [CI][Router] Fix bench_one_batch_server for pd router test by @ShangmingCai in #7731
- Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture by @ayrnb in #7278
- fix dsv3 fused proj check by @AniZpZ in #7738
- Ascend attention backend(PA&MLA) by @ping1jing2 in #7722
- [fix] fix dsv3_router_gemm filter by @Alcanderian in #7750
- [CPU] refine CPU integration code by @chunyuan-w in #7647
- [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size by @chunyuan-w in #6771
- support qwen3 dense model dp attention by @yizhang2077 in #7681
- [optimize] add two stream norm for qwen3 by @yizhang2077 in #7740
- feat: use D2D instead of H2H in pp by @TianyuZhang1214 in #7673
- [Bug] add flashinfer bool check for fusedmoe in Qwen moe models by @yilian49 in #7723
- [fix] put cpu in the first priority in get_device() by @Alcanderian in #7752
- [optimize] fuse renormalize into moe_topk_softmax by @yizhang2077 in #7744
- chore: bump sgl-kernel 0.2.2 by @zhyncs in #7755
- fix CI: update native api ipynb by @JustinTong0323 in #7754
- fuse renormal into moe topk softmax kernel python code by @yizhang2077 in #7751
- Remove type conversion and fix id map in topk by @ispobock in #7759
- Add V2-lite model test by @yanbing-j in #7390
- refactor llama4 dp attention logic by @yizhang2077 in #7729
- fix(docs): fix the broken link in
docs/references/production_metrics.md
by @rudeigerc in #7741
- [fix] update bench_speculative.py for compatibility by @yankay in #7764
- Move mem_fraction_static adjustment for multimodal models to
server_args.py
& Fix session control & Other cleanups by @merrymercy in #7748
- [RL] Add --nccl-port to prevent port conflict by @zhuzilin in #7418
- [RL] add pause and continue generation for async rl training by @zhuzilin in #7419
- [Fix] Alloc return type error by @Capronir in #7778
- [feat] Support EAGLE3 for Qwen by @Ximingwang-09 in #7745
- saving hidden_states.clone() by @ch-wan in #7705
- [1/n]: add cutlass W4A8 moe kernel for hopper architecture by @yangsijia-serena in #7772
- add model: qwen2-audio by @leng-yue in #7596
- Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario by @HydraQYH in #7782
- Embedding parallel by attn_tp by @MoonBall in #7623
- fix: fix apply_shuffle_mul_sum by @mickqian in #7444
- chore: bump sgl-kernel v0.2.3 by @zhyncs in #7784
- fix: use nvidia-nccl-cu12 2.27.5 by @zhyncs in #7787
- DP Attention with Auto DeepEP Dispatch by @ch-wan in #7222
- chore: upgrade sgl-kernel v0.2.3 by @zhyncs in #7786
- Fix incorrect spec_num_draft_tokens in draft_extend by @ch-wan in #7757
- [fix] fix misusing of is_cuda by @Alcanderian in #7790
- Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 by @merrymercy in #7756
- chore: bump sgl-kernel v0.2.4 by @zhyncs in #7800
- ci: fix port args by @mickqian in #7792
- Fix CI test OOM issue. by @lifuhuang in #7799
- chore: upgrade sgl-kernel v0.2.4 by @zhyncs in #7801
- chore: bump v0.4.9 by @zhyncs in #7802
- [misc] remove pdlb rust by @slin1237 in #7796
- fix: free disk space by @zhyncs in #7803
- fix: disable dsv3_router_gemm in dsv3_nextn by @Alcanderian in #7793
- Support logprobs in two-batch overlap by @fzyzcjy in #7709
- Fix division-by-zero bug in LoRA triton kernels. by @lifuhuang in #7785
- [AMD] Add test_fused_moe.py and test_rope_rocm.py to AMD CI by @hubertlu-tw in #5246
- [RL] Fix illegal memory for _import_static_state by @hebiao064 in #7733
- Fix _import_static_state issue by @nanjiangwill in #7812
- Optimize moe align block size kernel by @ispobock in #7794
- Log the timestamps of each prefill/decode iteration by @yuhsuan-t in #6094
- [bugfix] Fix sgl-router get_server_info endpoint compatibility issue by @slin1237 in #7813
- Integrate triton moe kernel by @yuan-luo in #7689
- Kernels for efficient KV cache IO by @xiezhq-hermann in #7313
- [docs] update router readme by @slin1237 in #7797
- [misc] release new router version by @slin1237 in #7798
- fix duplicate args in schedule_batch by @ZeldaHuang in #7816
- [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs by @haohui in #7187
- docs: update README by @zhyncs in #7821
- feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode by @yangsijia-serena in #7762
- Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang by @Edwardf0t1 in #7129
- [Minor] Fix sporadic CI timeout caused by underestimated tests. by @lifuhuang in #7850
- [Bugfix] Fix two batch overlap with auto DeepEP Dispatch by @ShangmingCai in #7853
- Fix cache modules of triton import error by @kkHuang-amd in #7832
- [router] forward stream_options in request by @ZhangShuaiyi in #7860
- Fix illegal memory in trtllm allreduce fusion by @BBuf in #7864
- Fix llama4 vision by @JustinTong0323 in #7840
- Support Mimo-VL by @JustinTong0323 in #7579
- fix: Handles input_embeds in GenerateReqInput when n>1 by @JustinTong0323 in #7830
- [Multimodal][Perf] Use
pybase64
instead of base64
by @b8zhong in #7724
- Bump xgrammar's version to 0.1.20 by @whybeyoung in #7866
- [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack by @chunyuan-w in #7818
- [PD] Add guidance for prefill bootstrap timeout by @ShangmingCai in #7846
- Update native_api doc to match the change in the
get_model_info
endpoint by @Arist12 in #7660
- Revert "Embedding parallel by attn_tp (#7623)" by @zhyncs in #7880
- chore: bump v0.4.9.post1 by @zhyncs in #7882
- Fixes typo in assertion message by @JustinTong0323 in #7895
- [CI] Add deepep tests to CI by @ch-wan in #7872
- [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt by @chunyuan-w in #7885
- [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding by @jianan-gu in #7838
- Remove unused imports by @almaslof in #7898
- [router] Update metrics when request completes by @ZhangShuaiyi in #7899
- [feature] Add start step profile argument in /start_profile by @kyleliang-nv in #7608
- [bugfix] add pd router policy validation by @slin1237 in #7904
- vlm: support video as an input modality by @mickqian in #5888
- Feat: Support Phi-3.5-MoE in SGLang by @byjiang1996 in #7907
- add sentencepiece as dependency explicitly by @ZailiWang in #7922
- Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen by @likesen-alibaba in #6449
- [feature]Ascend quantization support by @ping1jing2 in #7791
- [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module by @BBuf in #7775
- Support Kimi K2 by @Atream in #7940
- [feature] kv transfer support of ascend npu by @ping1jing2 in #7795
- fix: minor fix for modelopt weight load compatibility by @AniZpZ in #7953
- temporarily disable deepep-8-gpu and activate two small tests by @ch-wan in #7961
- [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel by @HydraQYH in #7932
- chore: bump sgl-kernel v0.2.5 by @zhyncs in #7964
- Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (#7236)" by @fzyzcjy in #7968
- chore: upgrade xgrammar 0.1.21 by @zhyncs in #7962
- delete uselese code caused by fuse allreduce+add_rmsnorm pr by @BBuf in #7970
- Fix wrong gemm branch cause 250us slower by @fzyzcjy in #7969
- [router] add worker abstraction by @slin1237 in #7960
- chore: upgrade sgl-kernel 0.2.5 by @zhyncs in #7971
- chore: bump v0.4.9.post2 by @zhyncs in #7963
- [minor fix] llama4 hybrid memory by @Ying1123 in #7950
- [minor fix] SWA missing methods by @Ying1123 in #7972
- [script] update loogle test by @Ying1123 in #7975
- docs: update README by @zhyncs in #7985
- Overlap the gating function with shared experts in DeepSeek by @ch-wan in #7978
- [BugFix] fix pre_reorder_triton_kernel default int32 issue by @Yuechguo in #7814
- [minor] Add server_args check for Llama4 with hybrid by @Ying1123 in #7988
- Tiny fix mooncake log warning wrong output by @fzyzcjy in #7952
- [BugFix] add verify logit_bias to avoid crash because of IndexError by @ehuaa in #7749
- SWA Prefix Cache by @hanming-lu in #7367
- chore: remove unnecessary limits on quantization methods in test script by @AniZpZ in #7997
- Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes by @lifuhuang in #7844
- Support for Phi-1.5 & Phi-2 models by @ppraneth in #7862
- [Dockerfile] Multi-arch support for ROCm by @mqhc2020 in #7902
- [CPU] fix no attribute 'can_fuse_mlp_allreduce' error by @chunyuan-w in #8010
- perf: add kimi k2 fused_moe tuning config for h30_3e by @GaoYusong in #8021
- [ci] CI supports use cached models by @HanHan009527 in #7874
- [Minor] Remove redundant print by @merrymercy in #8005
- [Feature]TP Group Switching for PD-Multiplexing by @ykcombat in #7653
- [Feature] CUDA Green Context Support by @ykcombat in #7649
- Fix flaky CI: test_vlm_models by @lifuhuang in #8006
- Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode by @hzh0425 in #7982
- prevent server crash from potential invalid grammar by @ehuaa in #7897
- Setup workflow for releasing mi300x and mi350x dockers. by @saienduri in #8035
- fix: modality length mismatch with image_data by @Yangruipis in #7887
- Update CODEOWNERS by @CatherineSue in #8044
- [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8023
- feat: update multimodal data handling in engine entrypoint by @JustinTong0323 in #8002
- fix: remove redundant rotary embedding cache recomputation in MiniCPM by @JustinTong0323 in #8022
- Fix the input tools format and history tool_calls in OpenAI API by @chen700564 in #6556
- fix: resolve arm build issue by @zhyncs in #8052
- concurrently load weights of DeepseekV2ForCausalLM by @tianyuzhou95 in #7943
- H20 tune config for Kimi by @artetaout in #8047
- Update amd docker image. by @saienduri in #8045
- feat: replace Decord with video_reader-rs by @kozoy in #5163
- remove kv_a.congigous in DeepseekV2AttentionMLA by @strgrb in #8058
- update transformers to 4.53.2 by @JustinTong0323 in #8029
- Fix different device type adjustment in PP by @Qiaolin-Yu in #7760
- Use device_group for all_gather when disabling overlap scheduling by @Qiaolin-Yu in #8001
- Revert "feat: replace Decord with video_reader-rs" by @mickqian in #8077
- Fix CI xeon test with triton 3.3.1 by @yanbing-j in #8086
- fix greenctx stream compability by @AniZpZ in #8090
- [misc] update nvshmem and pin deepEP commit hash by @slin1237 in #8098
- [Feature] Layer-wise Prefill by @jason-fxz in #7634
- [1/n] chore: decouple quantization implementation from vLLM dependency by @AniZpZ in #7992
- refactor: unify names of the feature field of MultimodalDataItem by @mickqian in #8075
- feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics by @acelyc111 in #7597
- [ci] limit cmake build nproc by @slin1237 in #8100
- [ci] disable memory imbalance check for draft worker by @ch-wan in #8108
- [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models by @hzh0425 in #8110
- [ci] recover 8-gpu deepep test by @ch-wan in #8105
- Refactor: move all quantization-related code to
srt/layer/quantization
by @ch-wan in #7989
- [kernel] opt moe align block kernel by block/warp scan algorithm by @yuan-luo in #7884
- Super tiny fix typo by @fzyzcjy in #8046
- fix: update HostKVCache init to report correct msg when available memory is not enough by @ziqifan617 in #8102
- [Hunyuan]: Fix Dense Model Support by @kzjeef in #8117
- feat: add production metric for retracted requests due to insufficient kvcache by @aftersnow in #7030
- refactor: simply MultimodalTokens logic by @mickqian in #7924
- [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell by @HydraQYH in #8127
- Feat: Support Granite 3.0 MoE in SGLang by @zminglei in #7959
- load draft model fix by @yilian49 in #7506
- [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" by @jianan-gu in #7889
- [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config by @jianan-gu in #7820
- Hicache Storage Layer Prototype by @xiezhq-hermann in #7704
- Revert "Fix different device type adjustment in PP" by @saienduri in #8141
- feat: enchance green context stream creation robust with backward compatibility by @AniZpZ in #8136
- fix compressed tensors WNA16 imports by @qeternity in #8142
- [Bugfix] Fix w8a8_int8 import error on NPU by @iforgetmyname in #8147
- [3/n] chore: decouple AWQ implementation from vLLM dependency by @Hongbosherlock in #8113
- [router] Refactor router and policy traits with dependency injection by @slin1237 in #7987
- [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm by @hubertlu-tw in #7661
- [Doc] Steps to add a new attention backend by @merrymercy in #8155
- chore: tune mem fraction static for vlm by @mickqian in #6881
- Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs by @haohui in #7302
- Feat: Support audio in Phi4-mm model by @byjiang1996 in #8048
- [PD] Support non-MLA models PD different TP with DP attention by @ShangmingCai in #7931
- [health_generate] fix: fix the /health_generate always success bug by @acelyc111 in #8028
- [router] router metrics cleanup by @slin1237 in #8158
- [router] allow router to have empty workers by @slin1237 in #8160
- Add GB200 wide-EP docker by @kyleliang-nv in #8157
- [1/N] MoE Refactor: refactor
select_experts
by @ch-wan in #7966
- chore: bump sgl-kernel v0.2.6 by @zhyncs in #8165
- chore: upgrade sgl-kernel 0.2.6 by @zhyncs in #8166
- Fix suffix mismatch for the metrics. by @Charles-L-Chen in #8168
- Update README.md by @merrymercy in #8171
- Clean up server args by @merrymercy in #8161
- Fix LoRA buffer contamination during adapter eviction by @lifuhuang in #8103
- Fix Dockerfile.gb200 by @kyleliang-nv in #8169
- [router] add ut for worker and errors by @slin1237 in #8170
- bugfix: fix sglang crash in NVIDIA MIG container by @Garrybest in #8167
- Support start up LoRA server without initial adapters by @lifuhuang in #8019
- Clean warning logs for gate_proj loading in Lora by @Fridge003 in #8172
- Fix tuning_fused_moe_triton.py by @ch-wan in #8175
- [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability by @whybeyoung in #8115
- Add bf16 output option for dsv3_router_gemm kernel by @Fridge003 in #7999
- Enable FlashInfer support encoder models and add head_dim padding workaround by @ccs96307 in #6230
- Add get_hidden_dim to qwen3.py for correct lora by @logachevpa in #7312
- feat: add h200 tp 16 kimi k2 moe config by @zhyncs in #8176
- feat: add b200 tp 16 kimi k2 moe config by @zhyncs in #8178
- fix moe gate dtype, fix tbo, fix fake dispatch by @Atream in #7825
- Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" by @merrymercy in #8181
- feat: update nccl 2.27.6 by @zhyncs in #8182
- Feat: Support for Persimmon Model by @ppraneth in #7983
- feat: add h200 tp 16 kimi k2 moe config by @Qiaolin-Yu in #8183
- Fix eagle3 cuda graph by @Ja1Zhou in #8163
- fix: fix the bug of loading Internvl3 by @coco-alen in #8067
- Fix dtype error in CI by @ispobock in #8197
- [router] add ut for pd request, metrics and config by @slin1237 in #8184
- [feature] enable NPU CI by @ping1jing2 in #7935
- [fix] fix modelopt fp4 on b200 by @Alcanderian in #8195
- chore: bump sgl-kernel v0.2.6.post1 by @zhyncs in #8200
- Apply fused sorted token ids padding by @ispobock in #8193
- [Refactor] simplify multimodal data processing by @JustinTong0323 in #8107
- [router] add ut for pd router by @slin1237 in #8208
- [router] upgade router version to 0.1.6 by @slin1237 in #8209
- Remve router gemm output dtype conversion by @ispobock in #8204
- chore: upgrade sgl-kernel 0.2.6.post1 by @zhyncs in #8202
- [Feature] Add a test for Layer-wise Prefill by @jason-fxz in #8231
- docs: update 2025 h2 roadmap by @zhyncs in #8237
- fix: retrieve mm token by modality, raise error if none by @JustinTong0323 in #8221
- [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 by @hubertlu-tw in #7484
- fix: sgl-router remove dead code by @oldsharp in #8257
- [fix] benchmark : routed_scaling_factor is None by @panpan0000 in #8059
- [Benchmark] add disable-auto-run param for hicache/bench_multiturn by @rzwei in #7822
- Preliminary Support for Qwen3XMLDetector by @yhyang201 in #8260
- chore: bump v0.4.9.post3 by @zhyncs in #8265
- Skip llama4 vision module loading when multimodal disabled by @ispobock in #8272
- Fix sgl-kernel ci test by @ispobock in #8284
- Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching by @lifuhuang in #8261
- Hicache IO kernel refactoring by @xiezhq-hermann in #8264
- bug fix and tag by @xiezhq-hermann in #8282
- HiCache Fix by @xiezhq-hermann in #8288
- [sgl-kernel] Opt per_token_quant_fp8 with warp reduce by @yuan-luo in #8130
- [router] add common ut infra to mock worker and app by @slin1237 in #8295
- fix: workaround for deepgemm warmup issue by @zhyncs in #8302
- [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages by @YiXR in #8133
- Fix the issue of incorrect finish reason in final stream response chunk returned during tool call by @xianzhiT in #7708
- fix: match chat-template for internvl3 by @JustinTong0323 in #8262
- Fix gemma3n with hybrid swa by @JustinTong0323 in #8240
- chore: upgrade sgl-kernel 0.2.7 by @zhyncs in #8304
- fix: prevent crashes due to logit bias dimension mismatch by @0xymoro in #7685
- feat(function call): complete utility method for KimiK2Detector and enhance documentation by @CatherineSue in #8043
- Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP by @xianzhiT in #7562
- [AMD] Pull latest image for AMD CI by @michael-amd in #8070
- Pin the version of petit kernel to fix the APIs by @haohui in #8235
- [bug] fix pd completion protocol for batching support by @slin1237 in #8317
- [router] fix pd model completion request by @slin1237 in #8303
- fix bug when eos_ids==0 by @bzantium in #8315
- [router] add endpoint unit test by @slin1237 in #8298
- [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import by @BBuf in #8310
- chore: upgrade flashinfer v0.2.9rc1 by @Swipe4057 in #8301
- [router] add streaming unit test by @slin1237 in #8299
- [router] add request format unit test by @slin1237 in #8300
- HiCache Storage TP Refinement by @xiezhq-hermann in #8307
- breakdown kernel update by @xiezhq-hermann in #8334
- support idle batch for TBO by @sherry-1001 in #8233
- [Feature] Integrate quick allreduce and select the best allreduce implementation by @lihaoyang-amd in #6619
- DP Enhancement by @ch-wan in #8280
- fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals by @ynwang007 in #8266
- [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs by @hubertlu-tw in #7135
- [CPU] Add tutorial docs for SGL on CPU by @ZailiWang in #8000
- chore: upgrade mooncake 0.3.5 by @ShangmingCai in #8341
- [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering
torch.compile
in forward pass by @BBuf in #8353
- [P/D] Support ipv6 in P/D scenario by @thefacetakt in #7858
- Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #8344
- [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector by @CatherineSue in #8357
- Clean up server_args, triton cache manager by @merrymercy in #8332
- fix: upgrade nccl version by @zhyncs in #8359
- [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 by @CatherineSue in #8363
- fix: kimi k2 xgrammar crash by @zhyncs in #8367
- Fix FP4 MoE accuracy from missing routed_scaling_factor by @trevor-m in #8333
- [CI] Fix flaky threshold by @merrymercy in #8370
- chore: bump v0.4.9.post4 by @zhyncs in #8305
- Fix test_moe_fused_gate_combined sgl-kernel ci test by @ispobock in #8374
- Uodate Dockerfile.gb200 to latest sglang by @kyleliang-nv in #8356
- chore: improve mmmu benchmark by @mickqian in #7000
- Save peak memory in logits processor by @ch-wan in #8343
- Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce by @hebiao064 in #8267
- chore: improvements on mm_utils by @mickqian in #7737
- vlm: optimize tensor transport by @mickqian in #6003
- Tiny assert EPLB is used together with expert parallel by @fzyzcjy in #8381
- model: support intern-s1 by @RunningLeon in #8350
- Add perf tests for LoRA by @lifuhuang in #8314
- Remove slot usage in code to be backward-compatible with python 3.9 by @lifuhuang in #8396
- Add docker release flow for gb200 by @kyleliang-nv in #8394
- HiCache, check before terminate prefetching by @xiezhq-hermann in #8372
- Add nvfp4 scaled mm benchmark. by @HydraQYH in #8401
- Urgent Fix: intern-s1 chat-template matching by @JustinTong0323 in #8403
- Tool to dump and compare internal activation tensors by @fzyzcjy in #7976
- Minor tool for comparison of benchmark results by @fzyzcjy in #7974
- Fix bench script making input data on L2 cache by @fzyzcjy in #7739
- [NVIDIA] Add Flashinfer MoE blockscale fp8 backend by @kaixih in #8036
- Update Cutlass in sgl-kernel to v4.1 by @Fridge003 in #8392
- fix: minor fix TransportProxyTensor under tp by @mickqian in #8382
- [router] add different policies for p node and d node by @slin1237 in #8395
- Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @lambert0312 in #8351
- fix: fix the missing metrics on non-rank0 nodes by @acelyc111 in #7720
- [2/N] MoE Refactor: Unify weight loader and quant methods by @ch-wan in #8397
- Use FlashInfer FP4 gemm. by @elfiegg in #8241
- Support precomputed_embeddings for Llama 4 by @AlienKevin in #8156
- [hotfix] fix merge conflicts in FlashInferEPMoE by @ch-wan in #8405
- chore: update CODEOWNERS by @zhyncs in #8407
- chore: upgrade flashinfer v0.2.9rc2 by @zhyncs in #8406
- Support triton kernels v3.4.0 for fused_moe by @yuan-luo in #8258
- [Bugfix] Prevent PD server crash from invalid grammar by @ShangmingCai in #8062
- Change to use native arm runner by @kyleliang-nv in #8414
- Support overlapped lora updates by @lifuhuang in #8213
- Support ue8m0 for triton quant kernel by @fzyzcjy in #7603
- Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic by @byjiang1996 in #8316
- bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check by @CatherineSue in #8417
- Fix test_openai_server by @CatherineSue in #8419
- Fix docker buildx push error by @kyleliang-nv in #8425
- bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation by @CatherineSue in #8422
- [router] improve router logs and request id header by @slin1237 in #8415
- [feat] Support different attention backends for prefill and decode by @Qiaolin-Yu in #6338
- chore: bump transformer to 4.54.0 by @hebiao064 in #8416
- [PD] Fix abort_request for PD disaggregation by @ShangmingCai in #8352
- GLM-4.5 Model Support by @zRzRzRzRzRzRzR in #8224
- Remove zstd compression for building Dockerfile.gb200 by @kyleliang-nv in #8442
- doc: add bench_one_batch_server in the benchmark doc by @Qiaolin-Yu in #8441
- GLM-4.5 Model Support Follow-up by @byjiang1996 in #8445
- fix GLM4_MOE launch with compressed_tensor quant model by @zminglei in #8456
- Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. by @strgrb in #8449
- Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" by @BBuf in #8457
- chore: bump v0.4.9.post5 by @zhyncs in #8458
- fix:reorder topk experts to ensure shared expert replaces minimal score by @erictanjn in #8125
- Update PR template by @ispobock in #8465
- feat: throttle requests at scheduler based on --max_queued_requests by @harrisonlimh in #7565
- fix: update dep by @zhyncs in #8467
- [NVIDIA] Change to use
num_local_experts
by @kaixih in #8453
- Fix parsing ChatCompletionMessage by @Onyad in #7273
- [3/N] MoE Refactor: Simplify DeepEP Output by @ch-wan in #8421
- feat: support glm4 tuning by @zhyncs in #8473
- Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 by @hebiao064 in #8469
- Update codeowner by @merrymercy in #8476
- chore: add glm4 fp8 tp8 config by @zhyncs in #8478
- chore: add glm 4.5 fp8 tp4 config by @zhyncs in #8480
- [CI]Add genai-bench Performance Validation for PD Router by @key4ng in #8477
- Update CODEOWNERS by @merrymercy in #8485
- Rename the last step in pr-test.yml as pr-test-finish by @merrymercy in #8486
- Reduce memory usage for fp4 moe by @fzyzcjy in #8413
- Tiny add warnings for DeepEP when it is suboptimal by @fzyzcjy in #8426
- Support colocating requests by @fzyzcjy in #7973
- Fix incorrect KV cache allocation for MTP models. by @lifuhuang in #8482
- Add PVC and update resource limits in k8s config by @haitwang-cloud in #8489
- chore: bump v0.4.9.post6 by @zhyncs in #8517
- Always trigger pr-test by @merrymercy in #8527
- Update README.md by @merrymercy in #8528
- [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% by @BBuf in #8499
- Update cutlass_moe.py by @elfiegg in #8535
- Fix moe align kernel test by @ispobock in #8531
- Split the scheduler into multiple mixin classes to reduce the file size by @merrymercy in #8483
- bring back kimi vl ci by @hebiao064 in #8537
- fix: temporarily disable cuda-ipc for mm data tensor by @mickqian in #8431
- Support EPLB in FusedMoE by @ch-wan in #8448
- feat(hicache): support file backend reading directory config form env. by @hzh0425 in #8498
- feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. by @hzh0425 in #8516
- [router] allow longer time out for router e2e by @slin1237 in #8560
- Update cutlass_moe.py by @elfiegg in #8545
- Update CODEOWNERS by @ShangmingCai in #8562
- [feature] [sgl-router] Add a dp-aware routing strategy by @oldsharp in #6869
- [Hot-Fix] moe_aligned_block_size CI failed in AMD by @yuan-luo in #8461
- [Model] Add support for Arcee Foundational Model by @adarshxs in #8154
- Revert "Fix the input tools format and history tool_calls in OpenAI API (#6556)" by @CatherineSue in #8584
- Add hf3fs support for hicache storage (based on #7704) by @pansicheng in #7280
- [router] migrate router from actix to axum by @slin1237 in #8479
- [Fix]Fix index oob in get_group_gemm_starts kernel. by @HydraQYH in #8564
- Bump transfomers to 4.54.1 to fix Gemma cache issue. by @lifuhuang in #8541
- Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. by @pyc96 in #8544
- Bug: Fix google gemma3n-mm audio input not working bug by @byjiang1996 in #8365
- update sgl-kernel for EP: kernel part by @ch-wan in #8514
- chore: bump sgl-kernel v0.2.8 by @zhyncs in #8599
- [bugfix] Fix 2 minor bugs in the hicache storage layer by @yapple in #8404
- fix incorrect increase of hit count by @huangtingwei9988 in #8533
- Support l3 cache (mooncake store) for hiradix cache by @huangtingwei9988 in #7211
- update sgl-kernel for EP: python part by @ch-wan in #8550
- add SVG logo by @hnyls2002 in #8603
- [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE by @ch-wan in #8515
- fix: fork should not run pypi router by @yihong0618 in #8604
- model: support Step3V by @CatherineSue in #8583
- [Feature] Hybrid EP and TP by @ch-wan in #8590
- chore: bump v0.4.10 by @zhyncs in #8608
New Contributors
Full Changelog: v0.4.8...v0.4.10
RetroSearch is an open source project built by @garambo
| Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4