AWQ is a high-performance 4-bit weight quantization method that offers excellent trade-offs between efficiency and accuracy. By enabling AWQ quantization on the NPU backend in SGLang, we can allow all 8-card NPUs to run the DeepSeek 671B model. This feature follows the Roadmap#8004 of NPU.
Proposal
Add AWQ quantization format support for the Ascend NPU backend.
Use MM kernels for MLP layers and GMM kernels for MoE layers to fully utilize NPU performance.
No response
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4