As described in #5858 (comment), DP Attention (together with the new EPMoE support introduced by #5917) is failing on idle batches.
Reproductionpython -m sglang.launch_server \ --model-path path-to-Qwen2-57B-A14B --trust-remote-code \ --mem-fraction-static 0.8 \ --max-prefill-tokens 2048 \ --chunked-prefill-size -1 \ --tp-size 4 \ --enable-dp-attention \ --dp-size 4
import requests port = 30000 prompt = "The capital of France is" response = requests.post( f"http://localhost:{port}/generate", json={ "text": prompt, "sampling_params": { "temperature": 0.7, "max_new_tokens": 2048, }, "stream": True, }, stream=True, )
This would cause the server to crash. Error from one TPWorker:
[2025-05-07 12:19:42 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/code/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 118, in forward_thread_func
self.forward_thread_func_()
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/code/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 148, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/code/sglang/python/sglang/srt/managers/tp_worker.py", line 206, in forward_batch_generation
logits_output = self.model_runner.forward(
File "/code/sglang/python/sglang/srt/model_executor/model_runner.py", line 1105, in forward
return self.forward_idle(forward_batch, pp_proxy_tensors=pp_proxy_tensors)
File "/code/sglang/python/sglang/srt/model_executor/model_runner.py", line 1071, in forward_idle
return self.model.forward(
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/code/sglang/python/sglang/srt/models/qwen2_moe.py", line 418, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/code/sglang/python/sglang/srt/models/qwen2_moe.py", line 379, in forward
hidden_states, residual = layer(
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/code/sglang/python/sglang/srt/models/qwen2_moe.py", line 318, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/code/sglang/python/sglang/srt/layers/layernorm.py", line 56, in forward
return self.forward_cuda(*args, **kwargs)
File "/code/sglang/python/sglang/srt/layers/layernorm.py", line 70, in forward_cuda
out = rmsnorm(x, self.weight.data, self.variance_epsilon)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/sgl_kernel/elementwise.py", line 41, in rmsnorm
torch.ops.sgl_kernel.rmsnorm.default(out, input, weight, eps, enable_pdl)
File "/opt/conda/envs/sglang/lib/python3.10/site-packages/torch/_ops.py", line 723, in __call__
return self._op(*args, **kwargs)
RuntimeError: RMSNorm failed with error code invalid configuration argument
Environment
$ python3 -m sglang.check_env Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA A40 GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.99 CUDA Driver Version: 535.146.02 PyTorch: 2.6.0+cu124 sglang: 0.4.6.post2 sgl_kernel: 0.1.1 flashinfer_python: 0.2.5 triton: 3.2.0 transformers: 4.51.1 torchao: 0.10.0 numpy: 2.2.5 aiohttp: 3.11.18 fastapi: 0.115.12 hf_transfer: 0.1.9 huggingface_hub: 0.30.2 interegular: 0.3.3 modelscope: 1.25.0 orjson: 3.10.16 outlines: 0.1.11 packaging: 25.0 psutil: 7.0.0 pydantic: 2.11.3 python-multipart: 0.0.20 pyzmq: 26.4.0 uvicorn: 0.34.2 uvloop: 0.21.0 vllm: Module Not Found xgrammar: 0.1.17 openai: 1.76.0 tiktoken: 0.9.0 anthropic: 0.50.0 litellm: 1.67.4 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PXB PXB PXB SYS SYS SYS SYS PXB PXB 0-25,52-77 0 N/A GPU1 PXB X PXB PXB SYS SYS SYS SYS PIX PIX 0-25,52-77 0 N/A GPU2 PXB PXB X PIX SYS SYS SYS SYS PXB PXB 0-25,52-77 0 N/A GPU3 PXB PXB PIX X SYS SYS SYS SYS PXB PXB 0-25,52-77 0 N/A GPU4 SYS SYS SYS SYS X PXB PXB PXB SYS SYS 26-51,78-103 1 N/A GPU5 SYS SYS SYS SYS PXB X PXB PXB SYS SYS 26-51,78-103 1 N/A GPU6 SYS SYS SYS SYS PXB PXB X PIX SYS SYS 26-51,78-103 1 N/A GPU7 SYS SYS SYS SYS PXB PXB PIX X SYS SYS 26-51,78-103 1 N/A NIC0 PXB PIX PXB PXB SYS SYS SYS SYS X PIX NIC1 PXB PIX PXB PXB SYS SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 ulimit soft: 1024
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4