A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/sgl-project/sglang/issues/7941 below:

[RFC] Multi-framework support for sglang · Issue #7941 · sgl-project/sglang · GitHub

Motivation

Currently, PyTorch is the only supported framework for sglang. We would like to propose supporting different frameworks in addition to PyTorch for the following reasons:

The core philosophies are:

  1. Co-existence of torch and the new framework: Now that PyTorch is deeply integrated into the project structure, replacing it with another framework would be rather expensive. Instead, new framework can focus on model running and the attention backend. Tensors can be converted across frameworks through dlpack.

  2. Maximal reuse: Frontend language, scheduler, tokenizer/detokenizer, sampler, etc. are reused without change.

  3. Avoid intrusive changes: Current code stucture is kept intact. Codes relevant to the new framework are clustered in a separate directory.

Design Model Execution

In a single forward call, the oringal call stack is:

Engine -> Scheduler -> TpModelWorker -> Model Runnner -> Model/Model Loader -> Layer/Ops/Attn

The new framework takes over from model runner onward. After the model returns the hidden states,

Code Structure

Suppose we want to support framework A, we add the following directories. There are 2 plans, we'd like to know which one the community prefers:

Plan 1:

/python
  /sglang
    /srt
      ...
      /distributed_A
      /layers_A
      /models_A
      /model_loader_A

Plan 2:

/python
  /sglang
    /srt
    ...
    /srt_A
      /distributed
      /layers
      /models
      /model_loader
      ...

Scripts will also be added to /examples and /tests

Proof of Concept

We made some early attempts on running a MindSpore model on Ascend NPU. Currently it uses torch CPU and mindspore native attention implementation. We can now successfully run a basic inference sample with a single device (Atlas 800I A2).

Current Modifications
HEAD detached at v0.4.9
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   python/sglang/srt/configs/load_config.py
        modified:   python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
        modified:   python/sglang/srt/layers/quantization/utils.py
        modified:   python/sglang/srt/model_executor/model_runner.py
        modified:   python/sglang/srt/model_loader/loader.py
        modified:   python/sglang/srt/server_args.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        ms_example/
        python/sglang/srt/models/ms_model/

That includes:

ms_model: the mindspore model implementation, and corresponding wrapper to process ForwardBatch as input.
server_args, load_config, loader: make a new load option for mindspore, and corresponding mindspore model loader.
quantization: resolve some import conflicts with vllm and sgl_kernel
model_runner: some changes for distributed init, not related to single-card inference.

Pytorch-MindSpore conversion:

In the current POC implementation, we get KV Cache from ForwardBatch of pytorch in the first run, then cache and maintain it on the device by mindspore for the next steps. For each step we convert the input tensor from torch on cpu to mindspore on npu. In future we'll eliminate the conversion through dlpack.

Experiment

We run a Qwen2.5 7B model with the following script:

import sglang as sgl

def main():
    llm = sgl.Engine(model_path="/home/ckpt/qwen2.5-7b-hf",
                     device="cpu",
                     load_format="mindspore",
                     max_total_tokens=20000,
                     disable_overlap_schedule=True,
                     tp_size=1,
                     dp_size=1)

    prompts = [
         "Hello, my name is",
         "The president of the United States is",
         "The capital of France is",
         "The future of AI is",
    ]

    sampling_params = {"temperature": 0.01, "top_p": 0.9}

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")


if __name__ == "__main__":
    main()

Screenshot of the result:

As for the performance analysis, mindspore will compile the graph twice (for prefill and decode) which takes time. For prefill we use flashattention kernel for ascend. For decode we use a mindspore implementation similar to the TorchNativeAttnBackend.

Here is the time for each forward including torch-to-ms conversion.

ms run time:  17.34398579597473
ms run time:  0.014678716659545898
ms run time:  15.733514785766602
ms run time:  0.057671546936035156
ms run time:  0.054368019104003906
ms run time:  0.05373978614807129
ms run time:  0.05384349822998047
ms run time:  0.054900407791137695
ms run time:  0.055121660232543945
ms run time:  0.052907705307006836
ms run time:  0.05253124237060547
Roadmap

We hope this work will be included in the H2 Roadmap of sglang. We have the current tentative schedule:

Q3:

Q4: Full-feature support including all parallelisms (DP/TP/EP/PP), PD disaggregation, quantization, speculative decoding, LoRA, etc.

Swipe4057 and huskyisdogSwipe4057 and huskyisdog


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4