The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.
New FeaturesLlama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct
model and 80.7 for Llama-4-Maverick-17B-128E-Instruct
model. #5092
FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709
EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247
DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.
Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.
Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!
Coming SoonDisaggregated Prefill and Decoding: #4655
Llama 4 Optimization: #5118
EP Enhancement: #4734
FA3 Enhancement: #4709
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
What's Changedtorch.cat
instead of torch.concat
to prevent entering the Autograd
backends. by @Alcanderian in #4466torch.inference_mode()
instead of torch.no_grad()
by @Alcanderian in #4372n
in OpenAI API completions by @ChuyueSun in #3446SGLANG_LOGGING_CONFIG_PATH
by @guoyuhong in #4592--attention-backend fa3
by @hebiao064 in #4680reply
to replay
in base_attn_backend.py
by @Thysrael in #4784get_attention_sliding_window_size
for attn init by @vhain in #4823gguf
and torchvision
by @vhain in #4826neuralmagic/gemma-2-2b-it-FP8
by @merrymercy in #4830self.worker
assignment in TpModelWorker
and refactor references by @JustinTong0323 in #4788mem_fraction_static
for gemma3 vision test by @vhain in #4840import vllm
in quantization/init.py by @merrymercy in #4834cuda_device_count_stateless
by @Alcanderian in #5060Full Changelog: v0.4.4...v0.4.5
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4