A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/sgl-project/sglang/issues/6017 below:

Instruction for Running DeepSeek with Large-scale PD and EP · Issue #6017 · sgl-project/sglang · GitHub

Using main branch

NOTE: The feature is already on main, but the performance still needs some improvements on main branch. will be good after a few already opened PRs - PR 6680, 6727, 6728

NOTE: I will try other config like 4 node for P and 9 node for D later. updated

Environment Preparation

Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake.

4P + 9D experiments

Start server
where DeepEP config can be tuned by #6742

# prefill nodes
MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_PATH --ep-num-redundant-experts 32 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-config YOUR_PATH

# decode nodes
MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode low_latency --mem-fraction-static 0.835 --max-running-requests 18432 --context-length 4500 --init-expert-location YOUR_PATH --ep-num-redundant-experts 32 --cuda-graph-bs 128 --num-reserved-decode-tokens YOUR_VALUE

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

Benchmark for prefill

# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup

Benchmark for decode

# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
4P + 9D + dynamic EPLB

May still have room for improvements, just preliminary tests.

# prefill
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 65536 --context-length 8192 --enable-eplb --ep-num-redundant-experts 32 --eplb-rebalance-num-iterations YOUR_VALUE --ep-dispatch-algorithm dynamic --deepep-config YOUR_PATH

# decode
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 18432 --context-length 4500 --enable-eplb --ep-num-redundant-experts 32 --eplb-rebalance-num-iterations YOUR_VALUE --cuda-graph-bs 256  --num-reserved-decode-tokens YOUR_VALUE
Create expert distribution data

Need PR 6964, 6967

# prefill
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=4 MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.1:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --expert-distribution-recorder-mode stat --disable-overlap-schedule --expert-distribution-recorder-buffer-size -1 --deepep-mode normal --mem-fraction-static 0.82 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 131072 --context-length 8192 --ep-num-redundant-experts 32 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-config /host_home/primary_synced/tom_sglang_server/misc/deepep_vp.json

# decode
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.5:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --expert-distribution-recorder-mode stat --disable-overlap-schedule --expert-distribution-recorder-buffer-size -1 --deepep-mode low_latency --mem-fraction-static 0.81 --max-running-requests 18432 --context-length 4500 --ep-num-redundant-experts 32 --cuda-graph-bs 256  --num-reserved-decode-tokens YOUR_VALUE

curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.1:30000/start_expert_distribution_record' -d '{}' 
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/start_expert_distribution_record' -d '{}' 
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/slow_down' -d '{"forward_sleep_time": 90.0}' 
python3 -m sglang.bench_one_batch_server  --base-url http://10.5.55.1:8000 --model-path /dev/shm/DeepSeek-V3-0324 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup 
# after a while
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/slow_down' -d '{"forward_sleep_time": null}' 
# after a while
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.1:30000/dump_expert_distribution_record' -d '{}' 
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/dump_expert_distribution_record' -d '{}' 

Then you will get one .pt file for prefill and one for decode. They can be used in --init-expert-location.

Using the blog branch Environment Preparation

It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.

Stress-testing Prefill Nodes
# prefill nodes
MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*131072)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --ep-dispatch-algorithm random

# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests $((${num_decode}*1024)) --context-length 4500 --init-expert-location YOUR_EXPERT_LOCATION_HERE --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup
Stress-testing Decode Nodes
# prefill nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*65536)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131076 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache

# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=YOUR_NUM_HERE SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.846 --chunked-prefill-size 81920 --max-running-requests $((${num_decode}*2048)) --context-length 4096 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 256 --disable-radix-cache --decode-log-interval 1

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
Analyzing Results

Since we are stress testing one side of P or D, we need to look at the server logs instead of benchmark script outputs.

Remarks Report Template

If you face any issues, feel free to discuss here or in Slack channel, and it would be great to provide the following information:

zhyncs, merrymercy, ch-wan, ispobock, Alcanderian and 52 morejosephrocca, merrymercy, lihongyang1990, yiakwy-xpu-ml-framework-team, mugglew and 1 more


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4