RetroSearch Browse

Showing content from https://github.com/sgl-project/sglang/issues/8004 below:

SGLang NPU support on 2025 H2

During 2025 H1, we have contributed initial supports for NPU (#3853, #7022), which make it possible for users to run SGLang on NPU hardware.

Our goal on 2025 H2 is to provide a seamless running experience on NPUs, and here is a rough development roadmap:

CI on NPU hardware

User / Developer experience

User experience is also to be taken into our consideration, containers and documents will be provided soon

Model support

We will start with supporting the hotest models

Performance Enhancement Attention Backend

[July] Ascend Attention Backend implementation w/ PA & MLA fused kernels Ascend attention backend(PA&MLA) #7722

Parallelism

[August] Support DeepEP expert parallelism [Feature] Optimize DeepSeek's DeepEP on Ascend NPU #8355
[August] Optimization on DeepEPMoE implementation with fused kernels

Quantization

[July] Support for Ascend-specific W8A8 quant method [feature]Ascend quantization support #7791
[September] Support for AWQ quant method [Feature] Support AWQ quantization on NPU #9104 thx @ErvinXie
[September] Support for GPTQ quant method

Cache

[July] A new transfer-engine implementation supports Device-to-device transfer on NPUs [feature] kv transfer support of ascend npu #7795
[November] A new cache pooling system supports HBM & DRAM mixed-pooling, coherent memory access and remote L3 cache direct copy to L1 cache on NPUs
[October] An optimized bucketing router policy for extremely uneven prompt length

Support Graph Mode

EPLB

Speculative Decoding

Community

zhyncs, lambert0312, AniZpZ, ErvinXie, Alcanderian and 6 moreSwipe4057Swipe4057moyans

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4