Currently, PyTorch is the only supported framework for sglang. We would like to propose supporting different frameworks in addition to PyTorch for the following reasons:
Optimizations in specific fields. For example, TF/Jax and MindSpore both made a lot of optimizations on static graphs. Also Jax works well with XLA on TPU. In the case of Ascend, currently sglang has some support on Ascend NPU through torch_npu, but MindSpore can bring performance benefits such as graph/op fusion and integrated compilation, as well as developer friendly Triton op support.
Each framework has its own user groups and ecosystems. Supporting multiple frameworks makes sglang more accessible to them compared with torch-exclusive LLM serving frameworks.
The core philosophies are:
Co-existence of torch and the new framework: Now that PyTorch is deeply integrated into the project structure, replacing it with another framework would be rather expensive. Instead, new framework can focus on model running and the attention backend. Tensors can be converted across frameworks through dlpack.
Maximal reuse: Frontend language, scheduler, tokenizer/detokenizer, sampler, etc. are reused without change.
Avoid intrusive changes: Current code stucture is kept intact. Codes relevant to the new framework are clustered in a separate directory.
In a single forward call, the oringal call stack is:
Engine -> Scheduler -> TpModelWorker -> Model Runnner -> Model/Model Loader -> Layer/Ops/Attn
The new framework takes over from model runner onward. After the model returns the hidden states,
Code StructureSuppose we want to support framework A, we add the following directories. There are 2 plans, we'd like to know which one the community prefers:
Plan 1:
/python
/sglang
/srt
...
/distributed_A
/layers_A
/models_A
/model_loader_A
Plan 2:
/python
/sglang
/srt
...
/srt_A
/distributed
/layers
/models
/model_loader
...
Scripts will also be added to /examples and /tests
Proof of ConceptWe made some early attempts on running a MindSpore model on Ascend NPU. Currently it uses torch CPU and mindspore native attention implementation. We can now successfully run a basic inference sample with a single device (Atlas 800I A2).
Current ModificationsHEAD detached at v0.4.9
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: python/sglang/srt/configs/load_config.py
modified: python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
modified: python/sglang/srt/layers/quantization/utils.py
modified: python/sglang/srt/model_executor/model_runner.py
modified: python/sglang/srt/model_loader/loader.py
modified: python/sglang/srt/server_args.py
Untracked files:
(use "git add <file>..." to include in what will be committed)
ms_example/
python/sglang/srt/models/ms_model/
That includes:
ms_model: the mindspore model implementation, and corresponding wrapper to process ForwardBatch as input.
server_args, load_config, loader: make a new load option for mindspore, and corresponding mindspore model loader.
quantization: resolve some import conflicts with vllm and sgl_kernel
model_runner: some changes for distributed init, not related to single-card inference.
In the current POC implementation, we get KV Cache from ForwardBatch of pytorch in the first run, then cache and maintain it on the device by mindspore for the next steps. For each step we convert the input tensor from torch on cpu to mindspore on npu. In future we'll eliminate the conversion through dlpack.
ExperimentWe run a Qwen2.5 7B model with the following script:
import sglang as sgl
def main():
llm = sgl.Engine(model_path="/home/ckpt/qwen2.5-7b-hf",
device="cpu",
load_format="mindspore",
max_total_tokens=20000,
disable_overlap_schedule=True,
tp_size=1,
dp_size=1)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.01, "top_p": 0.9}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
if __name__ == "__main__":
main()
Screenshot of the result:
As for the performance analysis, mindspore will compile the graph twice (for prefill and decode) which takes time. For prefill we use flashattention kernel for ascend. For decode we use a mindspore implementation similar to the TorchNativeAttnBackend.
Here is the time for each forward including torch-to-ms conversion.
ms run time: 17.34398579597473
ms run time: 0.014678716659545898
ms run time: 15.733514785766602
ms run time: 0.057671546936035156
ms run time: 0.054368019104003906
ms run time: 0.05373978614807129
ms run time: 0.05384349822998047
ms run time: 0.054900407791137695
ms run time: 0.055121660232543945
ms run time: 0.052907705307006836
ms run time: 0.05253124237060547
Roadmap
We hope this work will be included in the H2 Roadmap of sglang. We have the current tentative schedule:
Q3:
Q4: Full-feature support including all parallelisms (DP/TP/EP/PP), PD disaggregation, quantization, speculative decoding, LoRA, etc.
Swipe4057 and huskyisdogSwipe4057 and huskyisdog
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4