Showing content from https://github.com/sgl-project/sglang/releases/tag/v0.4.7 below:
Release Release v0.4.7 · sgl-project/sglang · GitHub
Highlights
-
The PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.
-
The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.
-
SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.
-
PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.
-
Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.
-
SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.
We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!
What's Changed
- Update nightly-test.yml by @merrymercy in #5797
- [CI] Improve github summary & enable fa3 for more models by @merrymercy in #5796
- [Docs] update grafana setup guide in production metrics by @PopSoda2002 in #5643
- [Misc] add structure logging, write to file and log tracing for SGL R… by @slin1237 in #5741
- Improve overlap scheduling by @hnyls2002 in #5788
- Add Cutlass MLA attention backend by @trevor-m in #5390
- chore: upgrade sgl-kernel 0.1.0 by @zhyncs in #5690
- Dockerfile.dev pip scikit_build_core by @BBuf in #5807
- Add a doc to fix sgl-kernel build link error in py39 with ccache by @BBuf in #5809
- Turn on overlap scheduler for multimodal models by @merrymercy in #5771
- Tiny refactor DefaultModelLoader.Source by @fzyzcjy in #5482
- [Docs] Replace lists with tables for cleanup and readability in server_arguments by @windsonsea in #5276
- Revert "Tiny refactor DefaultModelLoader.Source" by @merrymercy in #5825
- Feat: add support for thinking mode via chat_template_kwargs.enable_t… by @minleminzui in #5551
- fix: fix the error where the content is None when reasoning and tool … by @minleminzui in #5838
- feat: Add fused moe triton config for qwen3 moe on h100 by @JustinTong0323 in #5833
- fused moe triton tuning script support qwen3 by @BBuf in #5842
- feat: Add fused moe triton config for qwen3bf16 moe on h20 by @yhyang201 in #5839
- [PD] support pd fake transfer for warmup by @whybeyoung in #5726
- [qwen3] qwen3moe_tune_h20 fp8 tp4 by @whybeyoung in #5846
- [Doc] Recover history of server_arguments.md by @Fridge003 in #5851
- feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 by @GeLee-Q in #5850
- [CI] test chunked prefill more by @merrymercy in #5798
- ROCm: update AITER by @HaiShaw in #5816
- [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel by @yinfan98 in #5847
- [Fix] Missing bootstrap_port field by @xutianyi1999 in #5823
- feat: update is_fa3_default_architecture by @zhyncs in #5854
- add fused moe config for qwen3moe fp8/bf16 by @yizhang2077 in #5849
- chore: bump v0.4.6.post1 by @zhyncs in #5845
- Support
max_completion_tokens
for OpenAIChatCompletions by @CatherineSue in #5857
- simplify fused_moe config logging by @BBuf in #5801
- [CI] tune the test order to warmup the server by @merrymercy in #5860
- Cutlass MLA decode - fix dtype error by @trevor-m in #5868
- cutlass 3.9 supported to improve fp8_blockwise_gemm by @BBuf in #5820
- [Feature] support auto chat template by @woodx9 in #4949
- Feat: support cuda graph for LoRA by @Qiaolin-Yu in #4115
- Add qwen3 30b fused moe config by @JustinTong0323 in #5859
- [Fix] Fix a bug for flashmla to run R1 model by @pengcuo in #5875
- Add A800 fused moe config for qwen3 30b by @lambert0312 in #5880
- [Misc] add service discovery for sgl router by @slin1237 in #5865
- [fix]: PyO3 macOS linking and consolidate on tracing for logging by @slin1237 in #5856
- chore: update Dockerfile by @zhyncs in #5894
- [Docs] Update docs for Qwen3 and Qwen3MoE by @adarshxs in #5836
- Tables instead of bulletpoints for sampling doc by @simveit in #5841
- chore: update CODEOWNERS by @zhyncs in #5895
- [FEATURE] Enhance platform compatibility for ARM by @johnnynunez in #5746
- [CI] Add test_function_calling.py to run_suite.py by @CatherineSue in #5896
- Auto set draft model path for MTP by @ispobock in #5793
- [fix] relax mem_fraction_static for h200 by @Alcanderian in #5893
- feat: support pythonic tool call and index in tool call streaming by @CatherineSue in #5725
- [Bugfix]: fix missing queue_time_start for requests from grammar_queue by @CatherineSue in #5696
- Add AMD MI300x Nightly Testing. by @saienduri in #5861
- chore: use torch 2.6 for sgl-kernel build by @zhyncs in #5898
- Fix check_env script by @lambert0312 in #5901
- [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels #134 by @whybeyoung in #5830
- Bump Flashinfer to 0.2.5 by @Fridge003 in #5870
- [Fix] Unload lora in HF_Runner if needed by @Qiaolin-Yu in #5899
- Add A800 fused moe config for qwen3 235b by @lambert0312 in #5900
- Add sm_120 for blackwell by @zhjunqin in #5903
- [Feature] add support kimi vl model by @liwenju0 in #5383
- support vlm benchmark profile by @yizhang2077 in #5905
- [fix] kimi-vl test in test_vision_openai_server.py by @Alcanderian in #5910
- [Misc] use parallel build for cmake in sgl-kernel by @yinfan98 in #5919
- [qwen3] support qwen3 ep moe by @laixinn in #5917
- Add TP2 MOE benchmarks for AMD. by @saienduri in #5909
- [Feat] Scale up fa3 kernel to sm8x arch by @yinfan98 in #5912
- chore: bump sgl-kernel 0.1.1 by @zhyncs in #5932
- chore: upgrade sgl-kernel 0.1.1 by @zhyncs in #5933
- Remove unused method
calculate_num_image_tokens
from qwen2_vl.py by @JustinTong0323 in #5783
- [PP] Add pipeline parallelism by @Ying1123 in #5724
- Fix lora batch processing when input lora_path contains None by @Qiaolin-Yu in #5930
- add Thor & Spark by @johnnynunez in #5915
- fix: correct stream response when enable_thinking is set to false by @minleminzui in #5881
- fix: update model runner by @zhyncs in #5934
- chore: bump v0.4.6.post2 by @zhyncs in #5939
- Support XiaomiMiMo/MiMo model inference by @ryang-max in #5921
- [PD] Vectorise group_concurrent_contiguous in NumPy by @yuan-luo in #5834
- Remove extra contiguous by @ispobock in #5953
- Update ci test and doc for MTP api change by @ispobock in #5952
- docs: Fix Qwen model typo by @JiangJiaWei1103 in #5944
- Optimize a pad operation to accelerate 25us by @hebiao064 in #5945
- Properly return error response in vertex_generate HTTP endpoint by @KCFindstr in #5956
- feat: add concurrency evaluation logic in mmmu benchmark by @JustinTong0323 in #5782
- Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. by @saienduri in #5960
- feat: Refactor DeepSeekV3 function call by @CatherineSue in #5908
- Remove token in token out in Native API by @zhaochenyang20 in #5967
- Support InternVL3 by @xiaomin-D in #5350
- Support MMMU benchmark for InternVL by @JustinTong0323 in #5968
- FA3 speed up: skip len operation and get batch size directly from forward batch by @lifuhuang in #5969
- [PD] NIXL backend Prefill TP & Decode TP+DP by @jokerwyt in #5681
- Fix set kv cache multi-stream by @ispobock in #5975
- Overlap qk norm with two streams by @ispobock in #5977
- fix: only upgrade nccl for cu128 by @zhyncs in #5986
- Fix Phi3 serving which was broke by earlier change by @hebiao064 in #5991
- [perf] H100 DeepSeek-V3 fused moe tuned config by @Alcanderian in #5998
- [Fix] Suppress dynamo logging when using flashinfer backend with torch compile by @Fridge003 in #5992
- [Minor] Fix duplicate method definitions in conversation.py by @lifuhuang in #6012
- Fix flaky issues of lora and add multi batch tests by @Qiaolin-Yu in #5957
- Add
chat_template_kwargs
documentation by @vincentzed in #5679
- fix: fix broadcast_pyobj breaking VerlEngine by @ocss884 in #5997
- [PD] Allow customizing reserved tokens to avoid KV cache waste by @fzyzcjy in #6002
- Update dev container config to support live code sync and improve docker setup guide by @lifuhuang in #6018
- [PD] Optimize disaggregation ib device help info by @ShangmingCai in #5781
- [Test] Add flashmla attention backend test by @PopSoda2002 in #5587
- Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" by @Edenzzzz in #5555
- feat: Add a unified merge_state API by @DefTruth in #5428
- feat: append more comprehensive fields in messages instead of merely role and content by @minleminzui in #5996
- [Security][Bug] Prevent binding to all TCP interfaces by @adarshxs in #5752
- Fix prefill OOM error in the case of large page size by @xiezhq-hermann in #5081
- Fix problem of large page size with chunked prefill by @xiezhq-hermann in #6046
- docs: add Google Cloud Vertex AI in Adoption and Sponsorship by @zhyncs in #6047
- docs: add new blog by @zhyncs in #6048
- Fix not "import os" by @hnyls2002 in #6057
- Better PD initialization by @hnyls2002 in #5751
- fix: deepep dockerfile, use pip install deepep. by @HanHan009527 in #5885
- [Fix] Fix and rename flashmla CI test by @Fridge003 in #6045
- chore: upgrade cutlass 3.9.2 by @zhyncs in #6004
- Fix sgl-kernel build on aarch64 platforms by @Qiaolin-Yu in #6062
- Add DeepEP to CI PR Test by @liz-badada in #5655
- fix custom_allreduce namespace by @BBuf in #6039
- feat: add release workflow for SGLang kernels on aarch64 by @johnnynunez in #6010
- [Feature] Support for Ascend NPU backend by @botieking98 in #3853
- Fix the timeout for 8 gpu tests by @merrymercy in #6084
- Hint users DeepEP normal mode is incompatible with CUDA Graph by @fzyzcjy in #5014
- Super tiny fix doc by @fzyzcjy in #5233
- [Doc]Fix description for dp_size argument by @Fridge003 in #6063
- feat(engine): add bootstrap parameters to generate methods (dynamo) by @ishandhanani in #6075
- [refactor] slightly tidy fp8 module by @Alcanderian in #5993
- Clean up fa3 test from 8 gpus by @hebiao064 in #6105
- Deferring 8 GPU test by @ch-wan in #6102
- Update doc for MLA attention backends by @Fridge003 in #6034
- Clean logs for DeepSeek-V3 launching by @Fridge003 in #6079
- [CI]Add performance CI for VLM by @JustinTong0323 in #6038
- adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell by @Fridge003 in #6111
- optimize pad operations in fa3 to accelarate 100+us by @zminglei in #6077
- Overlap shared expert and routed expert computations by @fzyzcjy in #5121
- Tiny refactor ModelConfig.from_server_args by @fzyzcjy in #5219
- Tiny refactor weight loading logic by @fzyzcjy in #5232
- [PD] Add control to slow down a server by @fzyzcjy in #5572
- Change AMD test threshold by @fzyzcjy in #6091
- DeepEP normal support deepgemm-contiguous by @sleepcoo in #5626
- [fix] fix pyproject.toml dependencies by @Alcanderian in #6119
- [Feature] Add FlashAttention3 as a backend for VisionAttention by @Othame in #5764
- [perf] dsv3 bmm fallback to bf16 by @Alcanderian in #5662
- [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm by @hubertlu-tw in #6097
- [sgl-kernel] fix: fix cu118 compile error by @yinfan98 in #6123
- upgrade xgrammar to 0.1.19 by @Ubospica in #6129
- Remove unecessary is_fa3_supported check by @hebiao064 in #6112
- chore: bump sgl-kernel 0.1.2 by @zhyncs in #6131
- docs: update README by @zhyncs in #6132
- [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP by @yhyang201 in #5745
- Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 by @trevor-m in #6101
- opt flashinfer mla cat by @xu-yfei in #5822
- Update amd nightly concurrency. by @saienduri in #6141
- sampling_params: add thinking_budget by @thyecust in #6089
- [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph by @CatherineSue in #6162
- fix bug that gpu0 occupies more memory when hicache is turned on by @huangtingwei9988 in #5778
- chore: bump v0.4.6.post3 by @zhyncs in #6165
- KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost by @Simon-Li in #6016
- [fix] fix determine_n_share_experts_fusion by @Alcanderian in #6118
- Fix and Clean up chat-template requirement for VLM by @JustinTong0323 in #6114
- [Docs]Delete duplicate content by @Ximingwang-09 in #6146
- Revert "feat: add thinking_budget (#6089)" by @zhyncs in #6181
- Added async_encode method to Engine by @shimizust in #4701
- Fix data parallel perf regression by @merrymercy in #6183
- Fix request abortion by @merrymercy in #6184
- Add typo checker in pre-commit by @applesaucethebun in #6179
- Remove duplicate IO Struct test by @emmanuel-ferdman in #6180
- [PD] Add simple unit test for disaggregation feature by @ShangmingCai in #5654
- [CI] Disabled deepep tests temporarily because it takes too much time. by @merrymercy in #6186
- feat: support loogle eval by @zhyncs in #6190
- [fix] remove mixtral from is_fa3_default_architecture by @Alcanderian in #6191
- fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode by @GaoYusong in #6169
- chore: upgrade deepgemm by @zhyncs in #6073
- chore: bump sgl-kernel v0.1.2.post1 by @zhyncs in #6195
- chore: upgrade sgl-kernel v0.1.2.post1 by @zhyncs in #6196
- Handle empty input string for embedding models by @ravi03071991 in #5621
- doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct by @minleminzui in #6199
- [Docs] minor Qwen3 and reasoning parser docs fix by @adarshxs in #6032
- Improve structured outputs: fix race condition, server crash, metrics and style by @merrymercy in #6188
- [CI] Reorganize the 8 gpu tests by @merrymercy in #6192
- Add dev-deepep docker image by @fzyzcjy in #6198
- Replace time.time() to time.perf_counter() for benchmarking. by @lifuhuang in #6178
- Update README.md by @merrymercy in #6202
- Fix release-docs.yml to not use python 3.9 by @merrymercy in #6204
- Fix start_profile does not support with_stack and record_shapes by @fzyzcjy in #6043
- [doc] add a note for --n-share-experts-fusion args by @BBuf in #6154
- Performing Vocabulary Parallelism for LM Head across Attention TP Groups by @ch-wan in #5558
- Update AMD CI docker to v0.4.6.post3-rocm630. by @saienduri in #6213
- Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs by @merrymercy in #6201
- [CI] Fix PD mooncake dependency error by @ShangmingCai in #6212
- [CI] Re-enable pd disaggregation test by @ShangmingCai in #6231
- fix some typos by @applesaucethebun in #6209
- [Docs] Add docs for
SGLANG_
and SGL_
environment variables by @b8zhong in #6206
- [PP] Fix init_memory_pool desync & add PP for mixtral by @Ying1123 in #6223
- Revert "fix some typos" by @merrymercy in #6244
- chore: add hf_xet dep by @zhyncs in #6243
- Update AMD nightly deps. by @saienduri in #6241
- [PD] Add support for different TP sizes per DP rank by @ShangmingCai in #5922
- Support incremental streaming of logprob/token_ids between scheduler and detokenizer by @merrymercy in #6225
- fix typo by @zhyncs in #6248
- Support tuning moe for llama 4 model by @fzyzcjy in #6042
- Skip the flaky test_stateful_custom_logit_processor by @merrymercy in #6251
- [Llama4] Add docs note about enable multimodal by @b8zhong in #6235
- [VERL Use Case] Add torch_memory_saver into deps by @hebiao064 in #6247
- Fix two issues related to
--moe-dense-tp-size=1
by @ch-wan in #5657
- model(vlm): pixtral by @KivenChen in #5084
- [misc] deep_gemm fallback to NVRTC when NVCC not found by @Alcanderian in #6252
- Enable MI325X AMD CI. by @saienduri in #6259
- chore: bump v0.4.6.post4 by @zhyncs in #6245
- [CPU] Add CMakeLists.txt for sgl-kernel by @blzheng in #6115
- perf: optimize local_block_table memory allocation by @CatherineSue in #6273
- Fix a bug in schedule_policy by @Ying1123 in #6276
- [Bug] Fix accidental logger override caused by internVL. by @lifuhuang in #6282
- doc: update developer guide regarding mllms by @mickqian in #6138
- docs: fix a bad redirect by @b8zhong in #6300
- Enable unit tests for AMD CI. by @saienduri in #6283
- [AMD] Fix Llama 4 Scout and Maverick accuracy issues on MI300X by @hubertlu-tw in #6274
- feat: add flush cache to EngineBase and HttpServerEngineAdapter by @ocss884 in #6009
- [fix][RL] Remove the incorrect barrier in init_weights_update_group by @zhuzilin in #5914
- [Feat] Support FlashMLA backend with MTP and FP8 KV cache by @quinnrong94 in #6109
- [misc] remove redundant platform codes by @Alcanderian in #6298
- Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT by @chunyuan-w in #6216
- Fix gpu mem check on CPU by @yiliu30 in #6317
- Reduce MoE memory usage by @fzyzcjy in #6147
- Fix lora bench by @Qiaolin-Yu in #6302
- Minor improvements of TokenizerManager / health check by @merrymercy in #6327
- Upgrade CUTLASS 4.0 by @elfiegg in #6336
- Support precomputed multimodal features for Qwen-VL and Gemma3 models. by @ysulsky in #6136
- [Fix] Improve dependencies for Blackwell image by @Fridge003 in #6334
- [2/2] Add python wrapper for CUTLASS FP8 Blockscale MoE Kernel. by @elfiegg in #5694
- feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 by @Fr4nk1inCs in #6121
- Update CODEOWNERS by @merrymercy in #6359
- [Minor] cleanup unused imports by @merrymercy in #6358
- Fix amd ci by @merrymercy in #6360
- docs: update readme by @zhyncs in #6361
- model(vlm): mistral 3.1 by @KivenChen in #5099
- Fix one wasted kernel in DeepSeek and minor refactor by @fzyzcjy in #6316
- Minor code cleanup refactor for DeepSeek models by @fzyzcjy in #6324
- chore: bump sgl-kernel v0.1.3 by @zhyncs in #6368
- perf: Optimize local attention memory allocation in FlashAttentionBackend by @CatherineSue in #6356
- docs: Update the MD files by @vincentzed in #6373
- [router] Add /list_workers endpoint to router by @zhuzilin in #6366
- Speed up when having padding tokens in DeepEP by @fzyzcjy in #6175
- Use monotonic clock for interval measurement by @lifuhuang in #6211
- [fix] illegal memory in _fwd_kernel_ep_scatter_2 and _fwd_kernel_ep_gather by @xutizhou in #6348
- Fix stop_profile does not wait for finishing by @fzyzcjy in #4741
- Support outputing details for bench_serving by @fzyzcjy in #6107
- Tiny refactor bench_serving to improve extensibility by @fzyzcjy in #6134
- Tiny refactor bench_serving to extract RequestFuncOutput.init_new by @fzyzcjy in #6108
- Support custom DeepEP tuning config by @fzyzcjy in #6257
- Fix expert distribution recorder and profiler command stuck forever by @fzyzcjy in #6284
- Reland tiny refactor DefaultModelLoader.Source by @fzyzcjy in #6041
- Add expert distribution APIs for engine by @fzyzcjy in #6290
- fix: allow
launch_dummy_health_check_server
to start inside of running asyncio loop by @ishandhanani in #6330
- [Fix Chat API] add request id for chat/completion for tracing by @whybeyoung in #6364
- Fix CI tests by @merrymercy in #6362
- chore: upgrade sgl-kernel v0.1.3 by @zhyncs in #6377
- Do not use FA3 for mistral by @merrymercy in #6379
- refactor: minor refactors regarding multimodal processing by @mickqian in #6187
- Add pipeline parallelism for Qwen2 and Qwen3 Model by @libratiger in #6250
- Clean up AMD CI. by @saienduri in #6365
- feat: add long context example by @zhyncs in #6391
- The Gemma template is missing a newline after the user role. by @ysulsky in #6331
- chore: tiny remove duplicated code by @doujiang24 in #6392
- Add 4-GPU runner tests and split existing tests by @fzyzcjy in #6383
- Add fp8 shared_expert kernel for CPU in sgl-kernel and add UT by @chunyuan-w in #6339
- [fix] fix fa3 forward_decode with spec_decode by @Alcanderian in #6395
- Add missing model to doc by @applesaucethebun in #6396
- [OAI] Add rid tracing for v1/embeddings and fix rid type in Chat by @CatherineSue in #6397
- [Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py by @CatherineSue in #6333
- refactor: Extract repeated member variables in KVCache subclasses to base class. by @wangxiyu191 in #6323
- Refactor DeepSeek MoE layer to unify the two forward branches by @fzyzcjy in #6325
- vlm: tensor hash kernel by @mickqian in #5974
- [Bugfix] Fix field error in v1_embedding_request by @CatherineSue in #6400
- Fix request id error by @fzyzcjy in #6401
- Implement
return_hidden_states
for the OpenAI API by @kyle-pena-kuzco in #6137
- Fix nodeepgemm init by @sleepcoo in #6417
- Improve supported models doc by @simveit in #6430
- Fix throughput threshold for amd ci test by @Fridge003 in #6414
- [Metrics] Add KV events publishing by @trevor-m in #6098
- [BUG] fix stop_profile crash by @yizhang2077 in #6431
- Revert "Implement
return_hidden_states
for the OpenAI API (#6137)" by @zhyncs in #6440
- Expert distribution recording without overhead for EPLB by @fzyzcjy in #4957
- Refactor communication logic of DeepSeek for extensibility and understandability by @fzyzcjy in #6321
- Remove
Cargo.lock
, add it into .gitignore by @hnyls2002 in #6438
- Refactor DeepSeek logic into atomic operations by @fzyzcjy in #6326
- Support loading weights when physical experts are different from logical experts by @fzyzcjy in #6386
- Support DeepSeek EPLB algorithm with static distributions by @fzyzcjy in #6387
- Address performance regression: disable multiple streams on ROCm by @HaiShaw in #6412
- [QuickFix] fix gptq model initialize by @yinfan98 in #6429
- Update extend/decode attention kernel for CPU in sgl-kernel and add UTs by @yanbing-j in #6405
- [doc] add note for get_num_kv_splits in triton_backend by @Alcanderian in #6444
- Support dispatching logical to physical experts by @fzyzcjy in #6385
- Fix master CI for DeepSeek by @fzyzcjy in #6447
- [docs] Fix torch version by @Edenzzzz in #6472
- Disable all two stream overlap on amd by @merrymercy in #6475
- Refactor group_concurrent_contiguous in NIXL by @yuan-luo in #6214
- aiter attention-backend (default enabled on AMD/ROCm) by @HaiShaw in #6381
- Implement Siglip Vision model, and support BNB quantization for gemma3-mm by @guapisolo in #5339
- [router] support http2 in router by @zhuzilin in #6487
- [RL] allow weight updation with dp attention enabled by @zhuzilin in #6311
- Refactor DeepSeek attention dispatching by @fzyzcjy in #6476
- Fix num_qps_per_rank computation when providing custom DeepEP configuration by @fzyzcjy in #6468
- Tiny add stage assertions to DeepEPDispatcher to avoid misuse by @fzyzcjy in #6467
- Support redundant experts in expert parallel by @fzyzcjy in #6461
- Tiny make Lint CI show diff by @fzyzcjy in #6445
- Let bench_one_batch_server use sharegpt data to make expert distribution more natural by @fzyzcjy in #5573
- Fix bench_one_batch_server by @fzyzcjy in #6503
- [Fix]Fix capture fail bug for DeepSeek by @Fridge003 in #6275
- [CPU] Fix build issue by @blzheng in #6419
- fix: EXAONE when using tie_word_embeddings by @lkm2835 in #5759
- doc: Update README.md with adding deepwiki badge to enable weekly auto-refresh by @JustinTong0323 in #6508
- Recover from corrupted cache file in bench serving by @fzyzcjy in #6510
- Apply constraint grammar to EAGLE by @ispobock in #6499
- [1/2] Support Qserve by @HandH1998 in #6457
- [PD] Add doc and simplify sender.send by @ByronHsu in #6019
- [PD] Abort request if transfer fails by @ByronHsu in #6504
- Add main for merge state tests by @yuan-luo in #6492
- Support updating expert locations dynamically by @fzyzcjy in #6388
- [RL] Remove the w13 weight_scale and input_scale for UnquantizedEPMoE… by @zhuzilin in #6308
- Support logging expert balancedness metrics by @fzyzcjy in #6482
- Support dynamically rebalancing experts using EPLB by @fzyzcjy in #6469
- Fix missing http status import for PD failure handler by @ShangmingCai in #6520
- chore: bump sgl-kernel v0.1.4 by @zhyncs in #6522
- Support qwen3 deepep by @sleepcoo in #6120
- chore: upgrade sgl-kernel v0.1.4 by @zhyncs in #6532
- Support XiaomiMiMo inference with mtp by @ryang-max in #6059
- misc: fix accept_length by @zhyncs in #6536
- [PD] Fix failure abort by @ByronHsu in #6535
- [VLM] Support chunk prefill for VLM by @CatherineSue in #6355
- Add fp8 qkv_proj_with_rope kernel for CPU in sgl-kernel and add UT by @blzheng in #6493
- Add fp8 fused_experts kernel for CPU in sgl-kernel and add UT by @chunyuan-w in #6404
- Update sgl-kernel UTs for activation/topk/norm/rope kernels by @yanbing-j in #6452
- Fix topk inference performance reduce by @lambert0312 in #6474
- [PD] support spec decode by @ByronHsu in #6507
- [2/2] Support Qserve by @HandH1998 in #6521
- [PD] Support logprob & Add failure test by @ByronHsu in #6558
- fix: remove content=none test when tool called by @shuaills in #6347
- Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. by @MiterV1 in #6524
- Bugfix: handle flatten_batch constraint for multiple images by @CatherineSue in #6562
- support eplb for qwen3 by @yizhang2077 in #6533
- feat(Tool Calling): Support
required
and specific function mode by @CatherineSue in #6550
- [PD] Support structured output by @ByronHsu in #6560
- [FIX]remove ServerArgs duplicate code by @pc-neo in #6485
- Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP by @fzyzcjy in #6567
- chore: bump v0.4.6.post5 by @zhyncs in #6566
- Temporarily disable MI325x 8 gpu testing. by @saienduri in #6576
- Fix GPU OOM by @kkHuang-amd in #6564
- Refactor attention into multiple stages by @fzyzcjy in #6477
- Add back DeepSeek non-TBO branches by @fzyzcjy in #6578
- Utilize static dispatching for communicator by @fzyzcjy in #6577
- Support overlapping two batches by @fzyzcjy in #4068
- Refactor vlm embedding routine to use precomputed feature by @JustinTong0323 in #6543
- [OAI] Support non-normalized logprobs in OpenAI server by @CatherineSue in #5961
- Support Phi-4 Multi-Modal (text + vision only) by @lifuhuang in #6494
- Sgl-router Prometheus metrics endpoint and usage track metrics by @upfixer in #6537
- added support for tied weights in qwen pipeline parallelism by @FrankLeeeee in #6546
- Hint users when weight update timeouts by @fzyzcjy in #6570
- Fix some issues with current docs. by @simveit in #6588
- [PD] Fix prefill_servers in mini_lb by @wangxiyu191 in #6527
- Fix bench_serving does not support changing warmup requests by @fzyzcjy in #6439
- Support fake perfectly balanced EP dispatch algorithm by @fzyzcjy in #6571
- Fix profiling will crash the server when using num_steps by @fzyzcjy in #6586
- Improve performance of two batch overlap in some imbalanced cases by @fzyzcjy in #6593
- Logging and minor fixes to two batch overlap and EPLB by @fzyzcjy in #6595
- Tiny change killall_sglang.sh by @fzyzcjy in #6596
- Auto handle PD disaggregation in bench_serving by @fzyzcjy in #6587
- Support accurate length control for bench serving by @fzyzcjy in #6594
- Tiny fix lint CI does not trigger on master by @fzyzcjy in #6609
- chore: upgrade transformers 4.52.3 by @zhyncs in #6575
- Revert "Tiny fix lint CI does not trigger on master (#6609)" by @zhyncs in #6610
- refactor qwen moe code, use communicator to support tp+dp by @yizhang2077 in #6581
- feat: Improve Mistral and Qwen25 function call parsing by @CatherineSue in #6597
- qwen3moe support two batch overlap by @yizhang2077 in #6598
- Tiny fix CI by @fzyzcjy in #6611
- Supported precomputed feature for Kimi VL by @lifuhuang in #6599
- [FA][Test] Fix Sparse FA test by @b8zhong in #6306
- fix qwen3moe eplb prefill bug by @yizhang2077 in #6617
- Automatically configure for EPLB-related args by @fzyzcjy in #6628
- Fix EPLB algorithm fail to run when using 3 nodes for prefill by @fzyzcjy in #6629
- Tiny fix missing expert location dispatch info by @fzyzcjy in #6620
- Update nightly thresholds and dependencies. by @saienduri in #6635
- Tiny fix sampler error when prob is not contiguous by @fzyzcjy in #6639
- [PD] Handle P/D failure and reconnect without affecting other instances by @ShangmingCai in #6263
- follow-up: move Idefics2 to a shared location to eliminate unexpected dependency. by @lifuhuang in #6603
- fix: added "\n" to qwen25 tool parser structural tags by @shuaills in #6631
- [New Model] Devstral support by @JustinTong0323 in #6547
- chore: upgrade mooncake-transfer-engine by @zhyncs in #6643
- Tiny refactor communicator by @fzyzcjy in #6646
- Support TP in attention for two batch overlap by @fzyzcjy in #6634
- Super tiny rename environment variable by @fzyzcjy in #6648
- Refactor LoRA handling to support adapter tensors in fused format by @lifuhuang in #6585
- [Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py by @CatherineSue in #6650
- update toc for doc and dockerfile code style format by @habaohaba in #6450
- Add note to add supported model to documentation by @b8zhong in #6640
- docs: Update documentation to reflect xgrammar as default grammar backend by @vincentzed in #6601
- Add environment flag for disabling message queue broadcaster by @Fridge003 in #6403
- fix: fix nightly test from updating transformers by @mickqian in #6658
- Fix qwen3 tbo/dp-lm-head by @yizhang2077 in #6652
- fix communicator for non-dp lm head by @ch-wan in #6662
- Support EAGLE draft extend CUDA graph by @ispobock in #6606
- DeepSeek: enable none block-quant FP8 quantizations by @HaiShaw in #6638
- Fix OOM when updating expert locations by @fzyzcjy in #6660
- Speed up expert location update by @fzyzcjy in #6661
- Revert "fix communicator for non-dp lm head (#6662)" by @zhyncs in #6677
- [PD] Make bootstrap code common between NIXL and Mooncake by @trevor-m in #6473
- [CI] update verlengine ci to 4-gpu test by @ocss884 in #6007
- Fix DeepEP error in Qwen 3 MoE models by @fzyzcjy in #6673
- Support gathering expert distribution details by @fzyzcjy in #6665
- Disable compiling arch below sm_90 in aarch64 by default by @Qiaolin-Yu in #6380
- fix(tool call): Fix tool_index in PythonicDetector and issues with mixed output in non-streaming by @CatherineSue in #6678
- Add batch test for draft extend by @ispobock in #6672
- feat: Add warnings for invalid tool_choice and UTs by @CatherineSue in #6582
- Update amd docker and nightly models. by @saienduri in #6687
- Refine pre_reorder_triton_kernel slightly to improve performance by @yuan-luo in #6627
- fix log_info_on_rank0 error when run benchmark by @BBuf in #6260
- fix(deepseekv3): Fix DeepSeekV3Detector tool_index assignment and multi-tool call streaming support by @CatherineSue in #6655
- [Bugfix] Fix missing abort finish reason for PD with ChatCompletion by @ShangmingCai in #6693
- [CI] Fix flaky pp single node test by @ShangmingCai in #6689
- [PD] Abort unbootstrapped prefill requests through timeout by @ShangmingCai in #6685
- [PD Perf] replace Queue to FastQueue by @whybeyoung in #6649
- [Bugfix] Fix slice operation when chunk size mismatch by @ShangmingCai in #6697
- [Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set by @ShangmingCai in #6703
- [CI] Fix setup of disaggregation with different tp by @ShangmingCai in #6706
- [PD] Remove Unnecessary Exception Handling for FastQueue.get() by @Hongbosherlock in #6712
- Fuse routed_scaling_factor in DeepSeek by @fzyzcjy in #6710
- Overlap two kernels in DeepSeek with communication by @fzyzcjy in #6711
- Minor refactor two-batch overlap by @fzyzcjy in #6682
- Speed up when having padding tokens two-batch overlap by @fzyzcjy in #6668
- [Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell by @Fridge003 in #6479
- Fix LoRA bench by @Edenzzzz in #6719
- Fix PP for Qwen3 MoE by @jinyouzhi in #6709
- [feat] triton kernel for get_last_loc by @Alcanderian in #6676
- [fix] more mem for draft_extend cuda_graph by @Alcanderian in #6726
- [PD] bug fix: Update status if nixl receiver send a a dummy req. by @thesues in #6720
- Tune memory arguments on B200 by @Fridge003 in #6718
- Add DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #6725
- refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor
parse_streaming_increment
by @CatherineSue in #6715
- Add draft extend CUDA graph for Triton backend by @ispobock in #6705
- refactor apply_w8a8_block_fp8_linear in fp by @ChangyiYang in #6545
- [PD] Support completion endpoint by @ShangmingCai in #6729
- Init PD Rust LB (PO2) by @hnyls2002 in #6437
- Super tiny enable sole usage of expert distribution metrics and update doc by @fzyzcjy in #6680
- Support picking variants of EPLB algorithms by @fzyzcjy in #6728
- Support tuning DeepEP configs by @fzyzcjy in #6742
- [test] add ut and bm for get_last_loc by @Alcanderian in #6746
- Fix mem_fraction_static for AMD CI by @Fridge003 in #6748
- [fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight by @zhuzilin in #6265
- Improve EPLB logical to physical dispatch map by @fzyzcjy in #6727
- Update DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #6765
- [PD] Optimize time out logic and add env var doc for mooncake by @ShangmingCai in #6761
- Fix aiohttp 'Chunk too big' in bench_serving by @guoyuhong in #6737
- Support sliding window in triton backend by @NorthmanPKU in #6509
- Fix shared experts fusion error by @lambert0312 in #6289
- Fix one bug in the grouped-gemm triton kernel by @ch-wan in #6772
- update llama4 chat template and pythonic parser by @upfixer in #6679
- feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream by @CatherineSue in #6784
- Support token-level quantization for EP MoE by @ch-wan in #6782
- Temporarily lower mmlu threshold for triton sliding window backend by @NorthmanPKU in #6785
- ci: relax test_function_call_required by @CatherineSue in #6786
- Add intel_amx backend for Radix Attention for CPU by @yanbing-j in #6408
- Fix incorrect LoRA weight loading for fused gate_up_proj by @lifuhuang in #6734
- fix(PD-disaggregation): Can not get local ip by @storyicon in #6792
- [FIX] mmmu bench serving result display error (#6525) by @Arist12 in #6791
- Bump torch to 2.7.0 by @Qiaolin-Yu in #6788
- chore: bump sgl-kernel v0.1.5 by @zhyncs in #6794
- Improve profiler and integrate profiler in bench_one_batch_server by @merrymercy in #6787
- chore: upgrade sgl-kernel v0.1.5 by @zhyncs in #6795
- [Minor] Always append newline after image token when parsing chat message by @lifuhuang in #6797
- Update CI tests for Llama4 models by @ravi03071991 in #6421
- [Feat] Enable PDL automatically on Hopper architecture by @PopSoda2002 in #5981
- chore: update blackwell docker by @zhyncs in #6800
- misc: cache is_hopper_arch by @Edenzzzz in #6799
- Remove contiguous before Flashinfer groupwise fp8 gemm by @Fridge003 in #6804
- Correctly abort the failed grammar requests & Improve the handling of abort by @merrymercy in #6803
- [EP] Add cuda kernel for moe_ep_pre_reorder by @yuan-luo in #6699
- Add draft extend CUDA graph for flashinfer backend by @ispobock in #6805
- Refactor CustomOp to avoid confusing bugs by @fzyzcjy in #5382
- Tiny log prefill time by @fzyzcjy in #6780
- Tiny fix EPLB assertion about rebalancing period and recorder window size by @fzyzcjy in #6813
- Add simple utility to dump tensors for debugging by @fzyzcjy in #6815
- Fix profiles do not have consistent names by @fzyzcjy in #6811
- Speed up rebalancing when using non-static dispatch algorithms by @fzyzcjy in #6812
- [1/2] Add Kernel support for Cutlass based Fused FP4 MoE by @pavanimajety in #6093
- [Router] Fix k8s Service Discovery by @YouNeedCryDear in #6766
- Add CPU optimized kernels for topk and rope fusions by @jianan-gu in #6456
- fix new_page_count_next_decode by @pansicheng in #6671
- Fix wrong weight reference in dynamic EPLB by @fzyzcjy in #6818
- Minor add metrics to expert location updater by @fzyzcjy in #6816
- [Refactor] Rename
n_share_experts_fusion
as num_fused_shared_experts
by @ch-wan in #6735
- [FEAT] Add transformers backend support by @SunMarc in https://github.com/sgl-project/sglang/pull/5929
- [fix] recover auto-dispatch for rmsnorm and rope by @Alcanderian in https://github.com/sgl-project/sglang/pull/6745
- fix ep_moe_reorder kernel bugs by @BBuf in https://github.com/sgl-project/sglang/pull/6858
- [Refactor] Multimodal data processing for VLM by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6659
- Decoder-only Scoring API by @chanh in https://github.com/sgl-project/sglang/pull/6460
- feat: add dp-rank to KV events by @ishandhanani in https://github.com/sgl-project/sglang/pull/6852
- Set
num_fused_shared_experts
as num_shared_experts
when shared_experts fusion is not disabled by @ch-wan in https://github.com/sgl-project/sglang/pull/6736
- Fix one missing arg in DeepEP by @ch-wan in https://github.com/sgl-project/sglang/pull/6878
- Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. by @lifuhuang in https://github.com/sgl-project/sglang/pull/6861
- support 1 shot allreduce in 1-node and 2-node using mscclpp by @zyksir in https://github.com/sgl-project/sglang/pull/6277
- Fix Qwen3MoE missing token padding optimization by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6820
- Tiny update error hints by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6846
- Support layerwise rebalancing experts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6851
- Tiny allow profiler API to auto create directory by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6865
- Support Blackwell DeepEP docker images by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6868
- [EP] Add cuda kernel for moe_ep_post_reorder by @yuan-luo in https://github.com/sgl-project/sglang/pull/6837
- Fix OpenAI Client error with single request via batch api by @ravi03071991 in https://github.com/sgl-project/sglang/pull/6170
- [PD] Fix potential perf spike caused by tracker gc and optimize doc by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6764
- Use deepgemm instead of triton for fused_qkv_a_proj_with_mqa by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6890
- [CUTLASS-FP4-MOE] Introduce CutlassMoEParams class for easy initialization of Cutlass Grouped Gems Metadata by @pavanimajety in https://github.com/sgl-project/sglang/pull/6887
- bugfix(OAI): Fix image_data processing for jinja chat templates by @CatherineSue in https://github.com/sgl-project/sglang/pull/6877
- [CPU] enable CI for PRs, add Dockerfile and auto build task by @ZailiWang in https://github.com/sgl-project/sglang/pull/6458
- AITER backend extension and workload optimizations by @HaiShaw in https://github.com/sgl-project/sglang/pull/6838
- [Feature] Support Flashinfer fmha on Blackwell by @NorthmanPKU in https://github.com/sgl-project/sglang/pull/6930
- Fix a bug in abort & Improve docstrings for abort by @merrymercy in https://github.com/sgl-project/sglang/pull/6931
- Tiny support customize DeepEP max dispatch tokens per rank by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6934
- Sync the changes on cuda graph runners by @merrymercy in https://github.com/sgl-project/sglang/pull/6932
- [PD] Optimize transfer queue forward logic for dummy rank by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6922
- [Refactor] image data process in bench_serving by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6879
- [fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. by @MiterV1 in https://github.com/sgl-project/sglang/pull/6767
- Add triton fused moe kernel config for E=257 on B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6939
- [sgl-kernel] update deepgemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/6942
- chore: bump sgl-kernel v0.1.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/6943
- Minor compile fused topk by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6944
- [Bugfix] pipeline parallelism and Eagle Qwen2 by @Swipe4057 in https://github.com/sgl-project/sglang/pull/6910
- Tiny re-introduce profile id logging by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6912
- Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version by @BBuf in https://github.com/sgl-project/sglang/pull/5955
- reduce torch.zeros overhead in moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/6369
- chore: upgrade sgl-kernel v0.1.6 by @Alcanderian in https://github.com/sgl-project/sglang/pull/6945
- add fbgemm moe grouped gemm kernel benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/6924
- [Docker] Add docker file for SGL Router by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/6915
- Disabling mixed chunked prefill when eagle is enabled by @Swipe4057 in https://github.com/sgl-project/sglang/pull/6874
- Add canary for EPLB rebalancing by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6895
- Refactor global_server_args_dict by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6866
- Fuse routed scaling factor in topk_reduce kernel by @BBuf in https://github.com/sgl-project/sglang/pull/6220
- Update server timeout time in AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6953
- [misc] add is_cpu() by @Alcanderian in https://github.com/sgl-project/sglang/pull/6950
- Add H20 fused MoE kernel tuning configs for DeepSeek-R1/V3 by @Xu-Wenqing in https://github.com/sgl-project/sglang/pull/6885
- Add a CUDA kernel for fusing mapping and weighted sum for MoE. by @elfiegg in https://github.com/sgl-project/sglang/pull/6916
- chore: bump sgl-kernel v0.1.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6955
- chore: upgrade sgl-kernel v0.1.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6957
- [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model by @pavanimajety in https://github.com/sgl-project/sglang/pull/6853
- Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6968
- [AMD] Add more tests to per-commit-amd by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/6926
- chore: bump sgl-kernel v0.1.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/6963
- Slightly improve the sampler to skip unnecessary steps by @merrymercy in https://github.com/sgl-project/sglang/pull/6956
- rebase h20 fused_moe config by @BBuf in https://github.com/sgl-project/sglang/pull/6966
- Fix CI and triton moe Configs by @merrymercy in https://github.com/sgl-project/sglang/pull/6974
- Remove unnecessary kernels of num_token_non_padded by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6965
- Extend cuda graph capture bs for B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6937
- Fuse routed scaling factor in deepseek by @BBuf in https://github.com/sgl-project/sglang/pull/6970
- Sync cuda graph runners by @merrymercy in https://github.com/sgl-project/sglang/pull/6976
- Fix draft extend ut stability with flush cache by @ispobock in https://github.com/sgl-project/sglang/pull/6979
- Fix triton sliding window test case by @merrymercy in https://github.com/sgl-project/sglang/pull/6981
- Fix expert distribution dumping causes OOM by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6967
- Minor remove one kernel for DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6977
- [perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 by @Alcanderian in https://github.com/sgl-project/sglang/pull/6929
- Enable more unit tests for AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6983
- Use torch.compile to fuse flash attention decode metadata preparation by @merrymercy in https://github.com/sgl-project/sglang/pull/6973
- Speed up set_lora_info by eliminating unnecessary H2D transfers by @lifuhuang in https://github.com/sgl-project/sglang/pull/6960
- support qwen3 emebedding by @Titan-p in https://github.com/sgl-project/sglang/pull/6990
- Fix torch profiler bugs for bench_offline_throughput.py by @PanJason in https://github.com/sgl-project/sglang/pull/6557
- chore: upgrade flashinfer v0.2.6.post1 jit by @zhyncs in https://github.com/sgl-project/sglang/pull/6958
- cleanup tmp dir by @zhyncs in https://github.com/sgl-project/sglang/pull/7007
- chore: update pr test xeon by @zhyncs in https://github.com/sgl-project/sglang/pull/7008
- Fix cutlass MLA gets almost zero accuracy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6998
- Update amd nightly models CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6992
- feat: add direct routing strategy to DP worker by @ishandhanani in https://github.com/sgl-project/sglang/pull/6884
- Fallback to lower triton version for unfound fused moe configs by @Fridge003 in https://github.com/sgl-project/sglang/pull/7013
- Fix torchvision version for Blackwell by @Edenzzzz in https://github.com/sgl-project/sglang/pull/7015
- Simplify prepare_extend_after_decode by @merrymercy in https://github.com/sgl-project/sglang/pull/6987
- Migrate to assertEqual by @emmanuel-ferdman in https://github.com/sgl-project/sglang/pull/6741
- Fix torch version in blackwell dockerfile by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/7017
- chore: update pr test xeon by @zhyncs in https://github.com/sgl-project/sglang/pull/7018
- Update default settings for blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/7023
- Support both approximate and exact expert distribution collection by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6964
- Add decode req pool by @ByronHsu in https://github.com/sgl-project/sglang/pull/6980
- [CI] Add CI workflow for sgl-router docker build by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/7027
- Fix fused_moe triton configs by @yudian0504 in https://github.com/sgl-project/sglang/pull/7029
- CPU: map changes from developing branch in sgl-kernel by @yanbing-j in https://github.com/sgl-project/sglang/pull/6833
- chore: bump v0.4.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/7038
New Contributors
Full Changelog: v0.4.6...v0.4.7
RetroSearch is an open source project built by @garambo
| Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4