After some features had been merged, we could run SGLang on Ascend servers with eager mode. But if we want to get better performance, we need implement ACLGraph or NPUGraph now.
Goals
Goal 1: Define a NPUGraphRunner
class for SGLang, which provides basic functions and supports llama or Qwen models.
Goal 2: Adapt to TP/DP , GraphTree and dynamic shape scenarios, including memory reuse.
Goal 3: Improve performance based on torch.compile
,
Key messages
we have torch_npu.npu.NPUGraph, which has similar interfaces and functions to torch.cuda.CUDAGraph
Concerning the level of RTS, we can refer to this document.
Phase 1: Basic support
Implement NPUGraphRunner
refer to CUDAGraphRunner
, but we should handle some special case:
Because we use this torch_npu.npu_fused_infer_attention_score API, which has a host_list input, we have to update its value each time using torch_npu.npu.NPUGraph.update
. For more details, please refer to task update.
Phase 2:
No response
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4