RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/sgl-project/sglang/releases/tag/v0.4.7 below:

Release Release v0.4.7 · sgl-project/sglang · GitHub

Highlights

The PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.
The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.
SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.
PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.
Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.
SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.

We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!

What's Changed

Update nightly-test.yml by @merrymercy in #5797
[CI] Improve github summary & enable fa3 for more models by @merrymercy in #5796
[Docs] update grafana setup guide in production metrics by @PopSoda2002 in #5643
[Misc] add structure logging, write to file and log tracing for SGL R… by @slin1237 in #5741
Improve overlap scheduling by @hnyls2002 in #5788
Add Cutlass MLA attention backend by @trevor-m in #5390
chore: upgrade sgl-kernel 0.1.0 by @zhyncs in #5690
Dockerfile.dev pip scikit_build_core by @BBuf in #5807
Add a doc to fix sgl-kernel build link error in py39 with ccache by @BBuf in #5809
Turn on overlap scheduler for multimodal models by @merrymercy in #5771
Tiny refactor DefaultModelLoader.Source by @fzyzcjy in #5482
[Docs] Replace lists with tables for cleanup and readability in server_arguments by @windsonsea in #5276
Revert "Tiny refactor DefaultModelLoader.Source" by @merrymercy in #5825
Feat: add support for thinking mode via chat_template_kwargs.enable_t… by @minleminzui in #5551
fix: fix the error where the content is None when reasoning and tool … by @minleminzui in #5838
feat: Add fused moe triton config for qwen3 moe on h100 by @JustinTong0323 in #5833
fused moe triton tuning script support qwen3 by @BBuf in #5842
feat: Add fused moe triton config for qwen3bf16 moe on h20 by @yhyang201 in #5839
[PD] support pd fake transfer for warmup by @whybeyoung in #5726
[qwen3] qwen3moe_tune_h20 fp8 tp4 by @whybeyoung in #5846
[Doc] Recover history of server_arguments.md by @Fridge003 in #5851
feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 by @GeLee-Q in #5850
[CI] test chunked prefill more by @merrymercy in #5798
ROCm: update AITER by @HaiShaw in #5816
[Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel by @yinfan98 in #5847
[Fix] Missing bootstrap_port field by @xutianyi1999 in #5823
feat: update is_fa3_default_architecture by @zhyncs in #5854
add fused moe config for qwen3moe fp8/bf16 by @yizhang2077 in #5849
chore: bump v0.4.6.post1 by @zhyncs in #5845
Support max_completion_tokens for OpenAIChatCompletions by @CatherineSue in #5857
simplify fused_moe config logging by @BBuf in #5801
[CI] tune the test order to warmup the server by @merrymercy in #5860
Cutlass MLA decode - fix dtype error by @trevor-m in #5868
cutlass 3.9 supported to improve fp8_blockwise_gemm by @BBuf in #5820
[Feature] support auto chat template by @woodx9 in #4949
Feat: support cuda graph for LoRA by @Qiaolin-Yu in #4115
Add qwen3 30b fused moe config by @JustinTong0323 in #5859
[Fix] Fix a bug for flashmla to run R1 model by @pengcuo in #5875
Add A800 fused moe config for qwen3 30b by @lambert0312 in #5880
[Misc] add service discovery for sgl router by @slin1237 in #5865
[fix]: PyO3 macOS linking and consolidate on tracing for logging by @slin1237 in #5856
chore: update Dockerfile by @zhyncs in #5894
[Docs] Update docs for Qwen3 and Qwen3MoE by @adarshxs in #5836
Tables instead of bulletpoints for sampling doc by @simveit in #5841
chore: update CODEOWNERS by @zhyncs in #5895
[FEATURE] Enhance platform compatibility for ARM by @johnnynunez in #5746
[CI] Add test_function_calling.py to run_suite.py by @CatherineSue in #5896
Auto set draft model path for MTP by @ispobock in #5793
[fix] relax mem_fraction_static for h200 by @Alcanderian in #5893
feat: support pythonic tool call and index in tool call streaming by @CatherineSue in #5725
[Bugfix]: fix missing queue_time_start for requests from grammar_queue by @CatherineSue in #5696
Add AMD MI300x Nightly Testing. by @saienduri in #5861
chore: use torch 2.6 for sgl-kernel build by @zhyncs in #5898
Fix check_env script by @lambert0312 in #5901
[PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels #134 by @whybeyoung in #5830
Bump Flashinfer to 0.2.5 by @Fridge003 in #5870
[Fix] Unload lora in HF_Runner if needed by @Qiaolin-Yu in #5899
Add A800 fused moe config for qwen3 235b by @lambert0312 in #5900
Add sm_120 for blackwell by @zhjunqin in #5903
[Feature] add support kimi vl model by @liwenju0 in #5383
support vlm benchmark profile by @yizhang2077 in #5905
[fix] kimi-vl test in test_vision_openai_server.py by @Alcanderian in #5910
[Misc] use parallel build for cmake in sgl-kernel by @yinfan98 in #5919
[qwen3] support qwen3 ep moe by @laixinn in #5917
Add TP2 MOE benchmarks for AMD. by @saienduri in #5909
[Feat] Scale up fa3 kernel to sm8x arch by @yinfan98 in #5912
chore: bump sgl-kernel 0.1.1 by @zhyncs in #5932
chore: upgrade sgl-kernel 0.1.1 by @zhyncs in #5933
Remove unused method calculate_num_image_tokens from qwen2_vl.py by @JustinTong0323 in #5783
[PP] Add pipeline parallelism by @Ying1123 in #5724
Fix lora batch processing when input lora_path contains None by @Qiaolin-Yu in #5930
add Thor & Spark by @johnnynunez in #5915
fix: correct stream response when enable_thinking is set to false by @minleminzui in #5881
fix: update model runner by @zhyncs in #5934
chore: bump v0.4.6.post2 by @zhyncs in #5939
Support XiaomiMiMo/MiMo model inference by @ryang-max in #5921
[PD] Vectorise group_concurrent_contiguous in NumPy by @yuan-luo in #5834
Remove extra contiguous by @ispobock in #5953
Update ci test and doc for MTP api change by @ispobock in #5952
docs: Fix Qwen model typo by @JiangJiaWei1103 in #5944
Optimize a pad operation to accelerate 25us by @hebiao064 in #5945
Properly return error response in vertex_generate HTTP endpoint by @KCFindstr in #5956
feat: add concurrency evaluation logic in mmmu benchmark by @JustinTong0323 in #5782
Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. by @saienduri in #5960
feat: Refactor DeepSeekV3 function call by @CatherineSue in #5908
Remove token in token out in Native API by @zhaochenyang20 in #5967
Support InternVL3 by @xiaomin-D in #5350
Support MMMU benchmark for InternVL by @JustinTong0323 in #5968
FA3 speed up: skip len operation and get batch size directly from forward batch by @lifuhuang in #5969
[PD] NIXL backend Prefill TP & Decode TP+DP by @jokerwyt in #5681
Fix set kv cache multi-stream by @ispobock in #5975
Overlap qk norm with two streams by @ispobock in #5977
fix: only upgrade nccl for cu128 by @zhyncs in #5986
Fix Phi3 serving which was broke by earlier change by @hebiao064 in #5991
[perf] H100 DeepSeek-V3 fused moe tuned config by @Alcanderian in #5998
[Fix] Suppress dynamo logging when using flashinfer backend with torch compile by @Fridge003 in #5992
[Minor] Fix duplicate method definitions in conversation.py by @lifuhuang in #6012
Fix flaky issues of lora and add multi batch tests by @Qiaolin-Yu in #5957
Add chat_template_kwargs documentation by @vincentzed in #5679
fix: fix broadcast_pyobj breaking VerlEngine by @ocss884 in #5997
[PD] Allow customizing reserved tokens to avoid KV cache waste by @fzyzcjy in #6002
Update dev container config to support live code sync and improve docker setup guide by @lifuhuang in #6018
[PD] Optimize disaggregation ib device help info by @ShangmingCai in #5781
[Test] Add flashmla attention backend test by @PopSoda2002 in #5587
Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" by @Edenzzzz in #5555
feat: Add a unified merge_state API by @DefTruth in #5428
feat: append more comprehensive fields in messages instead of merely role and content by @minleminzui in #5996
[Security][Bug] Prevent binding to all TCP interfaces by @adarshxs in #5752
Fix prefill OOM error in the case of large page size by @xiezhq-hermann in #5081
Fix problem of large page size with chunked prefill by @xiezhq-hermann in #6046
docs: add Google Cloud Vertex AI in Adoption and Sponsorship by @zhyncs in #6047
docs: add new blog by @zhyncs in #6048
Fix not "import os" by @hnyls2002 in #6057
Better PD initialization by @hnyls2002 in #5751
fix: deepep dockerfile, use pip install deepep. by @HanHan009527 in #5885
[Fix] Fix and rename flashmla CI test by @Fridge003 in #6045
chore: upgrade cutlass 3.9.2 by @zhyncs in #6004
Fix sgl-kernel build on aarch64 platforms by @Qiaolin-Yu in #6062
Add DeepEP to CI PR Test by @liz-badada in #5655
fix custom_allreduce namespace by @BBuf in #6039
feat: add release workflow for SGLang kernels on aarch64 by @johnnynunez in #6010
[Feature] Support for Ascend NPU backend by @botieking98 in #3853
Fix the timeout for 8 gpu tests by @merrymercy in #6084
Hint users DeepEP normal mode is incompatible with CUDA Graph by @fzyzcjy in #5014
Super tiny fix doc by @fzyzcjy in #5233
[Doc]Fix description for dp_size argument by @Fridge003 in #6063
feat(engine): add bootstrap parameters to generate methods (dynamo) by @ishandhanani in #6075
[refactor] slightly tidy fp8 module by @Alcanderian in #5993
Clean up fa3 test from 8 gpus by @hebiao064 in #6105
Deferring 8 GPU test by @ch-wan in #6102
Update doc for MLA attention backends by @Fridge003 in #6034
Clean logs for DeepSeek-V3 launching by @Fridge003 in #6079
[CI]Add performance CI for VLM by @JustinTong0323 in #6038
adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell by @Fridge003 in #6111
optimize pad operations in fa3 to accelarate 100+us by @zminglei in #6077
Overlap shared expert and routed expert computations by @fzyzcjy in #5121
Tiny refactor ModelConfig.from_server_args by @fzyzcjy in #5219
Tiny refactor weight loading logic by @fzyzcjy in #5232
[PD] Add control to slow down a server by @fzyzcjy in #5572
Change AMD test threshold by @fzyzcjy in #6091
DeepEP normal support deepgemm-contiguous by @sleepcoo in #5626
[fix] fix pyproject.toml dependencies by @Alcanderian in #6119
[Feature] Add FlashAttention3 as a backend for VisionAttention by @Othame in #5764
[perf] dsv3 bmm fallback to bf16 by @Alcanderian in #5662
[AMD] switch to custom allreduce regardless of MSCCL setting on ROCm by @hubertlu-tw in #6097
[sgl-kernel] fix: fix cu118 compile error by @yinfan98 in #6123
upgrade xgrammar to 0.1.19 by @Ubospica in #6129
Remove unecessary is_fa3_supported check by @hebiao064 in #6112
chore: bump sgl-kernel 0.1.2 by @zhyncs in #6131
docs: update README by @zhyncs in #6132
[Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP by @yhyang201 in #5745
Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 by @trevor-m in #6101
opt flashinfer mla cat by @xu-yfei in #5822
Update amd nightly concurrency. by @saienduri in #6141
sampling_params: add thinking_budget by @thyecust in #6089
[Bugfix] Fix Llama4 gibberish output with long context and CUDA graph by @CatherineSue in #6162
fix bug that gpu0 occupies more memory when hicache is turned on by @huangtingwei9988 in #5778
chore: bump v0.4.6.post3 by @zhyncs in #6165
KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost by @Simon-Li in #6016
[fix] fix determine_n_share_experts_fusion by @Alcanderian in #6118
Fix and Clean up chat-template requirement for VLM by @JustinTong0323 in #6114
[Docs]Delete duplicate content by @Ximingwang-09 in #6146
Revert "feat: add thinking_budget (#6089)" by @zhyncs in #6181
Added async_encode method to Engine by @shimizust in #4701
Fix data parallel perf regression by @merrymercy in #6183
Fix request abortion by @merrymercy in #6184
Add typo checker in pre-commit by @applesaucethebun in #6179
Remove duplicate IO Struct test by @emmanuel-ferdman in #6180
[PD] Add simple unit test for disaggregation feature by @ShangmingCai in #5654
[CI] Disabled deepep tests temporarily because it takes too much time. by @merrymercy in #6186
feat: support loogle eval by @zhyncs in #6190
[fix] remove mixtral from is_fa3_default_architecture by @Alcanderian in #6191
fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode by @GaoYusong in #6169
chore: upgrade deepgemm by @zhyncs in #6073
chore: bump sgl-kernel v0.1.2.post1 by @zhyncs in #6195
chore: upgrade sgl-kernel v0.1.2.post1 by @zhyncs in #6196
Handle empty input string for embedding models by @ravi03071991 in #5621
doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct by @minleminzui in #6199
[Docs] minor Qwen3 and reasoning parser docs fix by @adarshxs in #6032
Improve structured outputs: fix race condition, server crash, metrics and style by @merrymercy in #6188
[CI] Reorganize the 8 gpu tests by @merrymercy in #6192
Add dev-deepep docker image by @fzyzcjy in #6198
Replace time.time() to time.perf_counter() for benchmarking. by @lifuhuang in #6178
Update README.md by @merrymercy in #6202
Fix release-docs.yml to not use python 3.9 by @merrymercy in #6204
Fix start_profile does not support with_stack and record_shapes by @fzyzcjy in #6043
[doc] add a note for --n-share-experts-fusion args by @BBuf in #6154
Performing Vocabulary Parallelism for LM Head across Attention TP Groups by @ch-wan in #5558
Update AMD CI docker to v0.4.6.post3-rocm630. by @saienduri in #6213
Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs by @merrymercy in #6201
[CI] Fix PD mooncake dependency error by @ShangmingCai in #6212
[CI] Re-enable pd disaggregation test by @ShangmingCai in #6231
fix some typos by @applesaucethebun in #6209
[Docs] Add docs for SGLANG_ and SGL_ environment variables by @b8zhong in #6206
[PP] Fix init_memory_pool desync & add PP for mixtral by @Ying1123 in #6223
Revert "fix some typos" by @merrymercy in #6244
chore: add hf_xet dep by @zhyncs in #6243
Update AMD nightly deps. by @saienduri in #6241
[PD] Add support for different TP sizes per DP rank by @ShangmingCai in #5922
Support incremental streaming of logprob/token_ids between scheduler and detokenizer by @merrymercy in #6225
fix typo by @zhyncs in #6248
Support tuning moe for llama 4 model by @fzyzcjy in #6042
Skip the flaky test_stateful_custom_logit_processor by @merrymercy in #6251
[Llama4] Add docs note about enable multimodal by @b8zhong in #6235
[VERL Use Case] Add torch_memory_saver into deps by @hebiao064 in #6247
Fix two issues related to --moe-dense-tp-size=1 by @ch-wan in #5657
model(vlm): pixtral by @KivenChen in #5084
[misc] deep_gemm fallback to NVRTC when NVCC not found by @Alcanderian in #6252
Enable MI325X AMD CI. by @saienduri in #6259
chore: bump v0.4.6.post4 by @zhyncs in #6245
[CPU] Add CMakeLists.txt for sgl-kernel by @blzheng in #6115
perf: optimize local_block_table memory allocation by @CatherineSue in #6273
Fix a bug in schedule_policy by @Ying1123 in #6276
[Bug] Fix accidental logger override caused by internVL. by @lifuhuang in #6282
doc: update developer guide regarding mllms by @mickqian in #6138
docs: fix a bad redirect by @b8zhong in #6300
Enable unit tests for AMD CI. by @saienduri in #6283
[AMD] Fix Llama 4 Scout and Maverick accuracy issues on MI300X by @hubertlu-tw in #6274
feat: add flush cache to EngineBase and HttpServerEngineAdapter by @ocss884 in #6009
[fix][RL] Remove the incorrect barrier in init_weights_update_group by @zhuzilin in #5914
[Feat] Support FlashMLA backend with MTP and FP8 KV cache by @quinnrong94 in #6109
[misc] remove redundant platform codes by @Alcanderian in #6298
Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT by @chunyuan-w in #6216
Fix gpu mem check on CPU by @yiliu30 in #6317
Reduce MoE memory usage by @fzyzcjy in #6147
Fix lora bench by @Qiaolin-Yu in #6302
Minor improvements of TokenizerManager / health check by @merrymercy in #6327
Upgrade CUTLASS 4.0 by @elfiegg in #6336
Support precomputed multimodal features for Qwen-VL and Gemma3 models. by @ysulsky in #6136
[Fix] Improve dependencies for Blackwell image by @Fridge003 in #6334
[2/2] Add python wrapper for CUTLASS FP8 Blockscale MoE Kernel. by @elfiegg in #5694
feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 by @Fr4nk1inCs in #6121
Update CODEOWNERS by @merrymercy in #6359
[Minor] cleanup unused imports by @merrymercy in #6358
Fix amd ci by @merrymercy in #6360
docs: update readme by @zhyncs in #6361
model(vlm): mistral 3.1 by @KivenChen in #5099
Fix one wasted kernel in DeepSeek and minor refactor by @fzyzcjy in #6316
Minor code cleanup refactor for DeepSeek models by @fzyzcjy in #6324
chore: bump sgl-kernel v0.1.3 by @zhyncs in #6368
perf: Optimize local attention memory allocation in FlashAttentionBackend by @CatherineSue in #6356
docs: Update the MD files by @vincentzed in #6373
[router] Add /list_workers endpoint to router by @zhuzilin in #6366
Speed up when having padding tokens in DeepEP by @fzyzcjy in #6175
Use monotonic clock for interval measurement by @lifuhuang in #6211
[fix] illegal memory in _fwd_kernel_ep_scatter_2 and _fwd_kernel_ep_gather by @xutizhou in #6348
Fix stop_profile does not wait for finishing by @fzyzcjy in #4741
Support outputing details for bench_serving by @fzyzcjy in #6107
Tiny refactor bench_serving to improve extensibility by @fzyzcjy in #6134
Tiny refactor bench_serving to extract RequestFuncOutput.init_new by @fzyzcjy in #6108
Support custom DeepEP tuning config by @fzyzcjy in #6257
Fix expert distribution recorder and profiler command stuck forever by @fzyzcjy in #6284
Reland tiny refactor DefaultModelLoader.Source by @fzyzcjy in #6041
Add expert distribution APIs for engine by @fzyzcjy in #6290
fix: allow launch_dummy_health_check_server to start inside of running asyncio loop by @ishandhanani in #6330
[Fix Chat API] add request id for chat/completion for tracing by @whybeyoung in #6364
Fix CI tests by @merrymercy in #6362
chore: upgrade sgl-kernel v0.1.3 by @zhyncs in #6377
Do not use FA3 for mistral by @merrymercy in #6379
refactor: minor refactors regarding multimodal processing by @mickqian in #6187
Add pipeline parallelism for Qwen2 and Qwen3 Model by @libratiger in #6250
Clean up AMD CI. by @saienduri in #6365
feat: add long context example by @zhyncs in #6391
The Gemma template is missing a newline after the user role. by @ysulsky in #6331
chore: tiny remove duplicated code by @doujiang24 in #6392
Add 4-GPU runner tests and split existing tests by @fzyzcjy in #6383
Add fp8 shared_expert kernel for CPU in sgl-kernel and add UT by @chunyuan-w in #6339
[fix] fix fa3 forward_decode with spec_decode by @Alcanderian in #6395
Add missing model to doc by @applesaucethebun in #6396
[OAI] Add rid tracing for v1/embeddings and fix rid type in Chat by @CatherineSue in #6397
[Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py by @CatherineSue in #6333
refactor: Extract repeated member variables in KVCache subclasses to base class. by @wangxiyu191 in #6323
Refactor DeepSeek MoE layer to unify the two forward branches by @fzyzcjy in #6325
vlm: tensor hash kernel by @mickqian in #5974
[Bugfix] Fix field error in v1_embedding_request by @CatherineSue in #6400
Fix request id error by @fzyzcjy in #6401
Implement return_hidden_states for the OpenAI API by @kyle-pena-kuzco in #6137
Fix nodeepgemm init by @sleepcoo in #6417
Improve supported models doc by @simveit in #6430
Fix throughput threshold for amd ci test by @Fridge003 in #6414
[Metrics] Add KV events publishing by @trevor-m in #6098
[BUG] fix stop_profile crash by @yizhang2077 in #6431
Revert "Implement return_hidden_states for the OpenAI API (#6137)" by @zhyncs in #6440
Expert distribution recording without overhead for EPLB by @fzyzcjy in #4957
Refactor communication logic of DeepSeek for extensibility and understandability by @fzyzcjy in #6321
Remove Cargo.lock, add it into .gitignore by @hnyls2002 in #6438
Refactor DeepSeek logic into atomic operations by @fzyzcjy in #6326
Support loading weights when physical experts are different from logical experts by @fzyzcjy in #6386
Support DeepSeek EPLB algorithm with static distributions by @fzyzcjy in #6387
Address performance regression: disable multiple streams on ROCm by @HaiShaw in #6412
[QuickFix] fix gptq model initialize by @yinfan98 in #6429
Update extend/decode attention kernel for CPU in sgl-kernel and add UTs by @yanbing-j in #6405
[doc] add note for get_num_kv_splits in triton_backend by @Alcanderian in #6444
Support dispatching logical to physical experts by @fzyzcjy in #6385
Fix master CI for DeepSeek by @fzyzcjy in #6447
[docs] Fix torch version by @Edenzzzz in #6472
Disable all two stream overlap on amd by @merrymercy in #6475
Refactor group_concurrent_contiguous in NIXL by @yuan-luo in #6214
aiter attention-backend (default enabled on AMD/ROCm) by @HaiShaw in #6381
Implement Siglip Vision model, and support BNB quantization for gemma3-mm by @guapisolo in #5339
[router] support http2 in router by @zhuzilin in #6487
[RL] allow weight updation with dp attention enabled by @zhuzilin in #6311
Refactor DeepSeek attention dispatching by @fzyzcjy in #6476
Fix num_qps_per_rank computation when providing custom DeepEP configuration by @fzyzcjy in #6468
Tiny add stage assertions to DeepEPDispatcher to avoid misuse by @fzyzcjy in #6467
Support redundant experts in expert parallel by @fzyzcjy in #6461
Tiny make Lint CI show diff by @fzyzcjy in #6445
Let bench_one_batch_server use sharegpt data to make expert distribution more natural by @fzyzcjy in #5573
Fix bench_one_batch_server by @fzyzcjy in #6503
[Fix]Fix capture fail bug for DeepSeek by @Fridge003 in #6275
[CPU] Fix build issue by @blzheng in #6419
fix: EXAONE when using tie_word_embeddings by @lkm2835 in #5759
doc: Update README.md with adding deepwiki badge to enable weekly auto-refresh by @JustinTong0323 in #6508
Recover from corrupted cache file in bench serving by @fzyzcjy in #6510
Apply constraint grammar to EAGLE by @ispobock in #6499
[1/2] Support Qserve by @HandH1998 in #6457
[PD] Add doc and simplify sender.send by @ByronHsu in #6019
[PD] Abort request if transfer fails by @ByronHsu in #6504
Add main for merge state tests by @yuan-luo in #6492
Support updating expert locations dynamically by @fzyzcjy in #6388
[RL] Remove the w13 weight_scale and input_scale for UnquantizedEPMoE… by @zhuzilin in #6308
Support logging expert balancedness metrics by @fzyzcjy in #6482
Support dynamically rebalancing experts using EPLB by @fzyzcjy in #6469
Fix missing http status import for PD failure handler by @ShangmingCai in #6520
chore: bump sgl-kernel v0.1.4 by @zhyncs in #6522
Support qwen3 deepep by @sleepcoo in #6120
chore: upgrade sgl-kernel v0.1.4 by @zhyncs in #6532
Support XiaomiMiMo inference with mtp by @ryang-max in #6059
misc: fix accept_length by @zhyncs in #6536
[PD] Fix failure abort by @ByronHsu in #6535
[VLM] Support chunk prefill for VLM by @CatherineSue in #6355
Add fp8 qkv_proj_with_rope kernel for CPU in sgl-kernel and add UT by @blzheng in #6493
Add fp8 fused_experts kernel for CPU in sgl-kernel and add UT by @chunyuan-w in #6404
Update sgl-kernel UTs for activation/topk/norm/rope kernels by @yanbing-j in #6452
Fix topk inference performance reduce by @lambert0312 in #6474
[PD] support spec decode by @ByronHsu in #6507
[2/2] Support Qserve by @HandH1998 in #6521
[PD] Support logprob & Add failure test by @ByronHsu in #6558
fix: remove content=none test when tool called by @shuaills in #6347
Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. by @MiterV1 in #6524
Bugfix: handle flatten_batch constraint for multiple images by @CatherineSue in #6562
support eplb for qwen3 by @yizhang2077 in #6533
feat(Tool Calling): Support required and specific function mode by @CatherineSue in #6550
[PD] Support structured output by @ByronHsu in #6560
[FIX]remove ServerArgs duplicate code by @pc-neo in #6485
Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP by @fzyzcjy in #6567
chore: bump v0.4.6.post5 by @zhyncs in #6566
Temporarily disable MI325x 8 gpu testing. by @saienduri in #6576
Fix GPU OOM by @kkHuang-amd in #6564
Refactor attention into multiple stages by @fzyzcjy in #6477
Add back DeepSeek non-TBO branches by @fzyzcjy in #6578
Utilize static dispatching for communicator by @fzyzcjy in #6577
Support overlapping two batches by @fzyzcjy in #4068
Refactor vlm embedding routine to use precomputed feature by @JustinTong0323 in #6543
[OAI] Support non-normalized logprobs in OpenAI server by @CatherineSue in #5961
Support Phi-4 Multi-Modal (text + vision only) by @lifuhuang in #6494
Sgl-router Prometheus metrics endpoint and usage track metrics by @upfixer in #6537
added support for tied weights in qwen pipeline parallelism by @FrankLeeeee in #6546
Hint users when weight update timeouts by @fzyzcjy in #6570
Fix some issues with current docs. by @simveit in #6588
[PD] Fix prefill_servers in mini_lb by @wangxiyu191 in #6527
Fix bench_serving does not support changing warmup requests by @fzyzcjy in #6439
Support fake perfectly balanced EP dispatch algorithm by @fzyzcjy in #6571
Fix profiling will crash the server when using num_steps by @fzyzcjy in #6586
Improve performance of two batch overlap in some imbalanced cases by @fzyzcjy in #6593
Logging and minor fixes to two batch overlap and EPLB by @fzyzcjy in #6595
Tiny change killall_sglang.sh by @fzyzcjy in #6596
Auto handle PD disaggregation in bench_serving by @fzyzcjy in #6587
Support accurate length control for bench serving by @fzyzcjy in #6594
Tiny fix lint CI does not trigger on master by @fzyzcjy in #6609
chore: upgrade transformers 4.52.3 by @zhyncs in #6575
Revert "Tiny fix lint CI does not trigger on master (#6609)" by @zhyncs in #6610
refactor qwen moe code, use communicator to support tp+dp by @yizhang2077 in #6581
feat: Improve Mistral and Qwen25 function call parsing by @CatherineSue in #6597
qwen3moe support two batch overlap by @yizhang2077 in #6598
Tiny fix CI by @fzyzcjy in #6611
Supported precomputed feature for Kimi VL by @lifuhuang in #6599
[FA][Test] Fix Sparse FA test by @b8zhong in #6306
fix qwen3moe eplb prefill bug by @yizhang2077 in #6617
Automatically configure for EPLB-related args by @fzyzcjy in #6628
Fix EPLB algorithm fail to run when using 3 nodes for prefill by @fzyzcjy in #6629
Tiny fix missing expert location dispatch info by @fzyzcjy in #6620
Update nightly thresholds and dependencies. by @saienduri in #6635
Tiny fix sampler error when prob is not contiguous by @fzyzcjy in #6639
[PD] Handle P/D failure and reconnect without affecting other instances by @ShangmingCai in #6263
follow-up: move Idefics2 to a shared location to eliminate unexpected dependency. by @lifuhuang in #6603
fix: added "\n" to qwen25 tool parser structural tags by @shuaills in #6631
[New Model] Devstral support by @JustinTong0323 in #6547
chore: upgrade mooncake-transfer-engine by @zhyncs in #6643
Tiny refactor communicator by @fzyzcjy in #6646
Support TP in attention for two batch overlap by @fzyzcjy in #6634
Super tiny rename environment variable by @fzyzcjy in #6648
Refactor LoRA handling to support adapter tensors in fused format by @lifuhuang in #6585
[Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py by @CatherineSue in #6650
update toc for doc and dockerfile code style format by @habaohaba in #6450
Add note to add supported model to documentation by @b8zhong in #6640
docs: Update documentation to reflect xgrammar as default grammar backend by @vincentzed in #6601
Add environment flag for disabling message queue broadcaster by @Fridge003 in #6403
fix: fix nightly test from updating transformers by @mickqian in #6658
Fix qwen3 tbo/dp-lm-head by @yizhang2077 in #6652
fix communicator for non-dp lm head by @ch-wan in #6662
Support EAGLE draft extend CUDA graph by @ispobock in #6606
DeepSeek: enable none block-quant FP8 quantizations by @HaiShaw in #6638
Fix OOM when updating expert locations by @fzyzcjy in #6660
Speed up expert location update by @fzyzcjy in #6661
Revert "fix communicator for non-dp lm head (#6662)" by @zhyncs in #6677
[PD] Make bootstrap code common between NIXL and Mooncake by @trevor-m in #6473
[CI] update verlengine ci to 4-gpu test by @ocss884 in #6007
Fix DeepEP error in Qwen 3 MoE models by @fzyzcjy in #6673
Support gathering expert distribution details by @fzyzcjy in #6665
Disable compiling arch below sm_90 in aarch64 by default by @Qiaolin-Yu in #6380
fix(tool call): Fix tool_index in PythonicDetector and issues with mixed output in non-streaming by @CatherineSue in #6678
Add batch test for draft extend by @ispobock in #6672
feat: Add warnings for invalid tool_choice and UTs by @CatherineSue in #6582
Update amd docker and nightly models. by @saienduri in #6687
Refine pre_reorder_triton_kernel slightly to improve performance by @yuan-luo in #6627
fix log_info_on_rank0 error when run benchmark by @BBuf in #6260
fix(deepseekv3): Fix DeepSeekV3Detector tool_index assignment and multi-tool call streaming support by @CatherineSue in #6655
[Bugfix] Fix missing abort finish reason for PD with ChatCompletion by @ShangmingCai in #6693
[CI] Fix flaky pp single node test by @ShangmingCai in #6689
[PD] Abort unbootstrapped prefill requests through timeout by @ShangmingCai in #6685
[PD Perf] replace Queue to FastQueue by @whybeyoung in #6649
[Bugfix] Fix slice operation when chunk size mismatch by @ShangmingCai in #6697
[Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set by @ShangmingCai in #6703
[CI] Fix setup of disaggregation with different tp by @ShangmingCai in #6706
[PD] Remove Unnecessary Exception Handling for FastQueue.get() by @Hongbosherlock in #6712
Fuse routed_scaling_factor in DeepSeek by @fzyzcjy in #6710
Overlap two kernels in DeepSeek with communication by @fzyzcjy in #6711
Minor refactor two-batch overlap by @fzyzcjy in #6682
Speed up when having padding tokens two-batch overlap by @fzyzcjy in #6668
[Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell by @Fridge003 in #6479
Fix LoRA bench by @Edenzzzz in #6719
Fix PP for Qwen3 MoE by @jinyouzhi in #6709
[feat] triton kernel for get_last_loc by @Alcanderian in #6676
[fix] more mem for draft_extend cuda_graph by @Alcanderian in #6726
[PD] bug fix: Update status if nixl receiver send a a dummy req. by @thesues in #6720
Tune memory arguments on B200 by @Fridge003 in #6718
Add DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #6725
refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor parse_streaming_increment by @CatherineSue in #6715
Add draft extend CUDA graph for Triton backend by @ispobock in #6705
refactor apply_w8a8_block_fp8_linear in fp by @ChangyiYang in #6545
[PD] Support completion endpoint by @ShangmingCai in #6729
Init PD Rust LB (PO2) by @hnyls2002 in #6437
Super tiny enable sole usage of expert distribution metrics and update doc by @fzyzcjy in #6680
Support picking variants of EPLB algorithms by @fzyzcjy in #6728
Support tuning DeepEP configs by @fzyzcjy in #6742
[test] add ut and bm for get_last_loc by @Alcanderian in #6746
Fix mem_fraction_static for AMD CI by @Fridge003 in #6748
[fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight by @zhuzilin in #6265
Improve EPLB logical to physical dispatch map by @fzyzcjy in #6727
Update DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #6765
[PD] Optimize time out logic and add env var doc for mooncake by @ShangmingCai in #6761
Fix aiohttp 'Chunk too big' in bench_serving by @guoyuhong in #6737
Support sliding window in triton backend by @NorthmanPKU in #6509
Fix shared experts fusion error by @lambert0312 in #6289
Fix one bug in the grouped-gemm triton kernel by @ch-wan in #6772
update llama4 chat template and pythonic parser by @upfixer in #6679
feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream by @CatherineSue in #6784
Support token-level quantization for EP MoE by @ch-wan in #6782
Temporarily lower mmlu threshold for triton sliding window backend by @NorthmanPKU in #6785
ci: relax test_function_call_required by @CatherineSue in #6786
Add intel_amx backend for Radix Attention for CPU by @yanbing-j in #6408
Fix incorrect LoRA weight loading for fused gate_up_proj by @lifuhuang in #6734
fix(PD-disaggregation): Can not get local ip by @storyicon in #6792
[FIX] mmmu bench serving result display error (#6525) by @Arist12 in #6791
Bump torch to 2.7.0 by @Qiaolin-Yu in #6788
chore: bump sgl-kernel v0.1.5 by @zhyncs in #6794
Improve profiler and integrate profiler in bench_one_batch_server by @merrymercy in #6787
chore: upgrade sgl-kernel v0.1.5 by @zhyncs in #6795
[Minor] Always append newline after image token when parsing chat message by @lifuhuang in #6797
Update CI tests for Llama4 models by @ravi03071991 in #6421
[Feat] Enable PDL automatically on Hopper architecture by @PopSoda2002 in #5981
chore: update blackwell docker by @zhyncs in #6800
misc: cache is_hopper_arch by @Edenzzzz in #6799
Remove contiguous before Flashinfer groupwise fp8 gemm by @Fridge003 in #6804
Correctly abort the failed grammar requests & Improve the handling of abort by @merrymercy in #6803
[EP] Add cuda kernel for moe_ep_pre_reorder by @yuan-luo in #6699
Add draft extend CUDA graph for flashinfer backend by @ispobock in #6805
Refactor CustomOp to avoid confusing bugs by @fzyzcjy in #5382
Tiny log prefill time by @fzyzcjy in #6780
Tiny fix EPLB assertion about rebalancing period and recorder window size by @fzyzcjy in #6813
Add simple utility to dump tensors for debugging by @fzyzcjy in #6815
Fix profiles do not have consistent names by @fzyzcjy in #6811
Speed up rebalancing when using non-static dispatch algorithms by @fzyzcjy in #6812
[1/2] Add Kernel support for Cutlass based Fused FP4 MoE by @pavanimajety in #6093
[Router] Fix k8s Service Discovery by @YouNeedCryDear in #6766
Add CPU optimized kernels for topk and rope fusions by @jianan-gu in #6456
fix new_page_count_next_decode by @pansicheng in #6671
Fix wrong weight reference in dynamic EPLB by @fzyzcjy in #6818
Minor add metrics to expert location updater by @fzyzcjy in #6816
[Refactor] Rename n_share_experts_fusion as num_fused_shared_experts by @ch-wan in #6735
[FEAT] Add transformers backend support by @SunMarc in https://github.com/sgl-project/sglang/pull/5929
[fix] recover auto-dispatch for rmsnorm and rope by @Alcanderian in https://github.com/sgl-project/sglang/pull/6745
fix ep_moe_reorder kernel bugs by @BBuf in https://github.com/sgl-project/sglang/pull/6858
[Refactor] Multimodal data processing for VLM by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6659
Decoder-only Scoring API by @chanh in https://github.com/sgl-project/sglang/pull/6460
feat: add dp-rank to KV events by @ishandhanani in https://github.com/sgl-project/sglang/pull/6852
Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled by @ch-wan in https://github.com/sgl-project/sglang/pull/6736
Fix one missing arg in DeepEP by @ch-wan in https://github.com/sgl-project/sglang/pull/6878
Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. by @lifuhuang in https://github.com/sgl-project/sglang/pull/6861
support 1 shot allreduce in 1-node and 2-node using mscclpp by @zyksir in https://github.com/sgl-project/sglang/pull/6277
Fix Qwen3MoE missing token padding optimization by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6820
Tiny update error hints by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6846
Support layerwise rebalancing experts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6851
Tiny allow profiler API to auto create directory by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6865
Support Blackwell DeepEP docker images by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6868
[EP] Add cuda kernel for moe_ep_post_reorder by @yuan-luo in https://github.com/sgl-project/sglang/pull/6837
Fix OpenAI Client error with single request via batch api by @ravi03071991 in https://github.com/sgl-project/sglang/pull/6170
[PD] Fix potential perf spike caused by tracker gc and optimize doc by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6764
Use deepgemm instead of triton for fused_qkv_a_proj_with_mqa by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6890
[CUTLASS-FP4-MOE] Introduce CutlassMoEParams class for easy initialization of Cutlass Grouped Gems Metadata by @pavanimajety in https://github.com/sgl-project/sglang/pull/6887
bugfix(OAI): Fix image_data processing for jinja chat templates by @CatherineSue in https://github.com/sgl-project/sglang/pull/6877
[CPU] enable CI for PRs, add Dockerfile and auto build task by @ZailiWang in https://github.com/sgl-project/sglang/pull/6458
AITER backend extension and workload optimizations by @HaiShaw in https://github.com/sgl-project/sglang/pull/6838
[Feature] Support Flashinfer fmha on Blackwell by @NorthmanPKU in https://github.com/sgl-project/sglang/pull/6930
Fix a bug in abort & Improve docstrings for abort by @merrymercy in https://github.com/sgl-project/sglang/pull/6931
Tiny support customize DeepEP max dispatch tokens per rank by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6934
Sync the changes on cuda graph runners by @merrymercy in https://github.com/sgl-project/sglang/pull/6932
[PD] Optimize transfer queue forward logic for dummy rank by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6922
[Refactor] image data process in bench_serving by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6879
[fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. by @MiterV1 in https://github.com/sgl-project/sglang/pull/6767
Add triton fused moe kernel config for E=257 on B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6939
[sgl-kernel] update deepgemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/6942
chore: bump sgl-kernel v0.1.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/6943
Minor compile fused topk by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6944
[Bugfix] pipeline parallelism and Eagle Qwen2 by @Swipe4057 in https://github.com/sgl-project/sglang/pull/6910
Tiny re-introduce profile id logging by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6912
Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version by @BBuf in https://github.com/sgl-project/sglang/pull/5955
reduce torch.zeros overhead in moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/6369
chore: upgrade sgl-kernel v0.1.6 by @Alcanderian in https://github.com/sgl-project/sglang/pull/6945
add fbgemm moe grouped gemm kernel benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/6924
[Docker] Add docker file for SGL Router by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/6915
Disabling mixed chunked prefill when eagle is enabled by @Swipe4057 in https://github.com/sgl-project/sglang/pull/6874
Add canary for EPLB rebalancing by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6895
Refactor global_server_args_dict by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6866
Fuse routed scaling factor in topk_reduce kernel by @BBuf in https://github.com/sgl-project/sglang/pull/6220
Update server timeout time in AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6953
[misc] add is_cpu() by @Alcanderian in https://github.com/sgl-project/sglang/pull/6950
Add H20 fused MoE kernel tuning configs for DeepSeek-R1/V3 by @Xu-Wenqing in https://github.com/sgl-project/sglang/pull/6885
Add a CUDA kernel for fusing mapping and weighted sum for MoE. by @elfiegg in https://github.com/sgl-project/sglang/pull/6916
chore: bump sgl-kernel v0.1.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6955
chore: upgrade sgl-kernel v0.1.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6957
[DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model by @pavanimajety in https://github.com/sgl-project/sglang/pull/6853
Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6968
[AMD] Add more tests to per-commit-amd by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/6926
chore: bump sgl-kernel v0.1.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/6963
Slightly improve the sampler to skip unnecessary steps by @merrymercy in https://github.com/sgl-project/sglang/pull/6956
rebase h20 fused_moe config by @BBuf in https://github.com/sgl-project/sglang/pull/6966
Fix CI and triton moe Configs by @merrymercy in https://github.com/sgl-project/sglang/pull/6974
Remove unnecessary kernels of num_token_non_padded by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6965
Extend cuda graph capture bs for B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6937
Fuse routed scaling factor in deepseek by @BBuf in https://github.com/sgl-project/sglang/pull/6970
Sync cuda graph runners by @merrymercy in https://github.com/sgl-project/sglang/pull/6976
Fix draft extend ut stability with flush cache by @ispobock in https://github.com/sgl-project/sglang/pull/6979
Fix triton sliding window test case by @merrymercy in https://github.com/sgl-project/sglang/pull/6981
Fix expert distribution dumping causes OOM by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6967
Minor remove one kernel for DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6977
[perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 by @Alcanderian in https://github.com/sgl-project/sglang/pull/6929
Enable more unit tests for AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6983
Use torch.compile to fuse flash attention decode metadata preparation by @merrymercy in https://github.com/sgl-project/sglang/pull/6973
Speed up set_lora_info by eliminating unnecessary H2D transfers by @lifuhuang in https://github.com/sgl-project/sglang/pull/6960
support qwen3 emebedding by @Titan-p in https://github.com/sgl-project/sglang/pull/6990
Fix torch profiler bugs for bench_offline_throughput.py by @PanJason in https://github.com/sgl-project/sglang/pull/6557
chore: upgrade flashinfer v0.2.6.post1 jit by @zhyncs in https://github.com/sgl-project/sglang/pull/6958
cleanup tmp dir by @zhyncs in https://github.com/sgl-project/sglang/pull/7007
chore: update pr test xeon by @zhyncs in https://github.com/sgl-project/sglang/pull/7008
Fix cutlass MLA gets almost zero accuracy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6998
Update amd nightly models CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6992
feat: add direct routing strategy to DP worker by @ishandhanani in https://github.com/sgl-project/sglang/pull/6884
Fallback to lower triton version for unfound fused moe configs by @Fridge003 in https://github.com/sgl-project/sglang/pull/7013
Fix torchvision version for Blackwell by @Edenzzzz in https://github.com/sgl-project/sglang/pull/7015
Simplify prepare_extend_after_decode by @merrymercy in https://github.com/sgl-project/sglang/pull/6987
Migrate to assertEqual by @emmanuel-ferdman in https://github.com/sgl-project/sglang/pull/6741
Fix torch version in blackwell dockerfile by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/7017
chore: update pr test xeon by @zhyncs in https://github.com/sgl-project/sglang/pull/7018
Update default settings for blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/7023
Support both approximate and exact expert distribution collection by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6964
Add decode req pool by @ByronHsu in https://github.com/sgl-project/sglang/pull/6980
[CI] Add CI workflow for sgl-router docker build by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/7027
Fix fused_moe triton configs by @yudian0504 in https://github.com/sgl-project/sglang/pull/7029
CPU: map changes from developing branch in sgl-kernel by @yanbing-j in https://github.com/sgl-project/sglang/pull/6833
chore: bump v0.4.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/7038

New Contributors

@slin1237 made their first contribution in #5741
@GeLee-Q made their first contribution in #5850
@xutianyi1999 made their first contribution in #5823
@pengcuo made their first contribution in #5875
@johnnynunez made their first contribution in #5746
@zhjunqin made their first contribution in #5903
@xiaomin-D made their first contribution in #5350
@lifuhuang made their first contribution in #5969
@botieking98 made their first contribution in #3853
@ishandhanani made their first contribution in #6075
@zminglei made their first contribution in #6077
@Othame made their first contribution in #5764
@xu-yfei made their first contribution in #5822
@Simon-Li made their first contribution in #6016
@shimizust made their first contribution in #4701
@applesaucethebun made their first contribution in #6179
@emmanuel-ferdman made their first contribution in #6180
@KivenChen made their first contribution in #5084
@blzheng made their first contribution in #6115
@zhuzilin made their first contribution in #5914
@quinnrong94 made their first contribution in #6109
@yiliu30 made their first contribution in #6317
@ysulsky made their first contribution in #6136
@doujiang24 made their first contribution in #6392
@wangxiyu191 made their first contribution in #6323
@yanbing-j made their first contribution in #6405
@guapisolo made their first contribution in #5339
@MiterV1 made their first contribution in #6524
@upfixer made their first contribution in #6537
@habaohaba made their first contribution in #6450
@jinyouzhi made their first contribution in #6709
@thesues made their first contribution in #6720
@ChangyiYang made their first contribution in #6545
@NorthmanPKU made their first contribution in #6509
@storyicon made their first contribution in #6792
@Arist12 made their first contribution in #6791
@pavanimajety made their first contribution in #6093
@YouNeedCryDear made their first contribution in #6766
@jianan-gu made their first contribution in #6456
@pansicheng made their first contribution in #6671
@SunMarc made their first contribution in https://github.com/sgl-project/sglang/pull/5929
@chanh made their first contribution in https://github.com/sgl-project/sglang/pull/6460
@zyksir made their first contribution in https://github.com/sgl-project/sglang/pull/6277
@ZailiWang made their first contribution in https://github.com/sgl-project/sglang/pull/6458

Full Changelog: v0.4.6...v0.4.7

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4