LightLLM supports various deployment solutions for DeepSeek models, including DeepSeek-R1, DeepSeek-V2, DeepSeek-V3, etc. This document provides detailed information on various deployment modes and configuration solutions.
Deployment Mode Overview#LightLLM supports the following deployment modes:
Single node TP Mode: Deploy using tensor parallelism on a single node
Single node EP Mode: Deploy using expert parallelism on a single node
Multi-node TP Mode: Use tensor parallelism across multiple nodes
Multi-node EP Mode: Use expert parallelism across multiple nodes
PD disaggregation Mode: Separate prefill and decode deployment
Multi PD Master Mode: Support multiple PD Master nodes
Suitable for deploying DeepSeek-R1 model on a single H200 node.
Launch Command:
# H200 Single node DeepSeek-R1 TP Mode LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \ --model_dir /path/DeepSeek-R1 \ --tp 8 \ --enable_fa3
Parameter Description: - LOADWORKER=18: Model loading thread count, improves loading speed - –tp 8: Tensor parallelism, using 8 GPUs - –enable_fa3: Enable Flash Attention 3.0 - –port 8088: Service port
1.2 Single node DP + EP Mode (Data Parallel + Expert Parallel)#Suitable for expert parallelism deployment of MoE models like DeepSeek-V2/V3.
Launch Command:
# H200 Single node DeepSeek-R1 DP + EP Mode MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \ --model_dir /path/DeepSeek-R1 \ --tp 8 \ --dp 8 \ --enable_fa3
Parameter Description: - MOE_MODE=EP: Set expert parallelism mode - –tp 8: Tensor parallelism - –dp 8: Data parallelism, usually set to the same value as tp - –enable_fa3: Enable Flash Attention 3.0
Optional Optimization Parameters: - –enable_prefill_microbatch_overlap: Enable prefill microbatch overlap - –enable_decode_microbatch_overlap: Enable decode microbatch overlap
2. Multi-node Deployment Solutions# 2.1 Multi-node TP Mode#Suitable for deployment across multiple H200/H100 nodes.
Node 0 Launch Command:
# H200/H100 Multi-node DeepSeek-R1 TP Mode Node 0 # Usage: sh multi_node_tp_node0.sh <nccl_host> export nccl_host=$1 LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \ --model_dir /path/DeepSeek-R1 \ --tp 16 \ --enable_fa3 \ --nnodes 2 \ --node_rank 0 \ --nccl_host $nccl_host \ --nccl_port 2732
Node 1 Launch Command:
# H200/H100 Multi-node DeepSeek-R1 TP Mode Node 1 # Usage: sh multi_node_tp_node1.sh <nccl_host> export nccl_host=$1 LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \ --model_dir /path/DeepSeek-R1 \ --tp 16 \ --enable_fa3 \ --nnodes 2 \ --node_rank 1 \ --nccl_host $nccl_host \ --nccl_port 2732
Parameter Description: - –nnodes 2: Total number of nodes - –node_rank 0/1: Current node rank - –nccl_host: NCCL communication host address - –nccl_port 2732: NCCL communication port
2.2 Multi-node EP Mode#Suitable for deploying MoE models across multiple nodes.
Node 0 Launch Command:
# H200 Multi-node DeepSeek-R1 EP Mode Node 0 # Usage: sh multi_node_ep_node0.sh <nccl_host> export nccl_host=$1 MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \ --model_dir /path/DeepSeek-R1 \ --tp 16 \ --dp 16 \ --enable_fa3 \ --nnodes 2 \ --node_rank 0 \ --nccl_host $nccl_host \ --nccl_port 2732
Node 1 Launch Command:
# H200 Multi-node DeepSeek-R1 EP Mode Node 1 # Usage: sh multi_node_ep_node1.sh <nccl_host> export nccl_host=$1 MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \ --model_dir /path/DeepSeek-R1 \ --tp 16 \ --dp 16 \ --enable_fa3 \ --nnodes 2 \ --node_rank 1 \ --nccl_host $nccl_host \ --nccl_port 2732
Optional Optimization Parameters: - –enable_prefill_microbatch_overlap: Enable prefill microbatch overlap - –enable_decode_microbatch_overlap: Enable decode microbatch overlap
3. PD disaggregation Deployment Solutions#PD (Prefill-Decode) disaggregation mode separates prefill and decode stages for deployment, which can better utilize hardware resources.
3.1 Single PD Master Mode#Step 1: Launch PD Master Service
# PD Master for DeepSeek-R1 # Usage: sh pd_master.sh <pd_master_ip> export pd_master_ip=$1 python -m lightllm.server.api_server --model_dir /path/DeepSeek-R1 \ --run_mode "pd_master" \ --host $pd_master_ip \ --port 60011
Step 2: Launch Prefill Service
# PD prefill mode for DeepSeek-R1 (DP+EP) on H200 # Usage: sh pd_prefill.sh <host> <pd_master_ip> # nvidia-cuda-mps-control -d, run MPS (optional, performance will be much better with mps support, but some GPUs may encounter errors when enabling mps, it's recommended to upgrade to a higher driver version, especially for H-series cards) export host=$1 export pd_master_ip=$2 nvidia-cuda-mps-control -d MOE_MODE=EP KV_TRANS_USE_P2P=1 LOADWORKER=18 python -m lightllm.server.api_server \ --model_dir /path/DeepSeek-R1 \ --run_mode "prefill" \ --tp 8 \ --dp 8 \ --host $host \ --port 8019 \ --nccl_port 2732 \ --enable_fa3 \ --disable_cudagraph \ --pd_master_ip $pd_master_ip
Step 3: Launch Decode Service
# PD decode mode for DeepSeek-R1 (DP+EP) on H200 # Usage: sh pd_decode.sh <host> <pd_master_ip> export host=$1 export pd_master_ip=$2 nvidia-cuda-mps-control -d MOE_MODE=EP KV_TRANS_USE_P2P=1 LOADWORKER=18 python -m lightllm.server.api_server \ --model_dir /path/DeepSeek-R1 \ --run_mode "decode" \ --tp 8 \ --dp 8 \ --host $host \ --port 8121 \ --nccl_port 12322 \ --enable_fa3 \ --disable_cudagraph \ --pd_master_ip $pd_master_ip \ --pd_master_port 60011 # if you want to enable microbatch overlap, you can uncomment the following lines #--enable_decode_microbatch_overlap3.2 Multi PD Master Mode#
Supports multiple PD Master nodes, providing better load balancing and high availability.
Step 1: Launch Config Server
# Config Server # Usage: sh config_server.sh <config_server_host> export config_server_host=$1 python -m lightllm.server.api_server \ --run_mode "config_server" \ --config_server_host $config_server_host \ --config_server_port 60088
Step 2: Launch Multiple PD Masters
# PD Master 1 # Usage: sh pd_master_1.sh <host> <config_server_host> export host=$1 export config_server_host=$2 python -m lightllm.server.api_server \ --model_dir /path/DeepSeek-R1 \ --run_mode "pd_master" \ --host $host \ --port 60011 \ --config_server_host $config_server_host \ --config_server_port 60088 # PD Master 2 # Usage: sh pd_master_2.sh <host> <config_server_host> export host=$1 export config_server_host=$2 python -m lightllm.server.api_server \ --model_dir /path/DeepSeek-R1 \ --run_mode "pd_master" \ --host $host \ --port 60012 \ --config_server_host $config_server_host \ --config_server_port 60088
Step 3: Launch Prefill and Decode Services
# Prefill Service export host=$1 export config_server_host=$2 nvidia-cuda-mps-control -d MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \ --model_dir /path/DeepSeek-R1 \ --run_mode "prefill" \ --host $host \ --port 8019 \ --tp 8 \ --dp 8 \ --nccl_port 2732 \ --enable_fa3 \ --disable_cudagraph \ --config_server_host $config_server_host \ --config_server_port 60088 # if you want to enable microbatch overlap, you can uncomment the following lines #--enable_prefill_microbatch_overlap # Decode Service export host=$1 export config_server_host=$2 nvidia-cuda-mps-control -d MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \ --model_dir /path/DeepSeek-R1 \ --run_mode "decode" \ --host $host \ --port 8121 \ --nccl_port 12322 \ --tp 8 \ --dp 8 \ --enable_fa3 \ --config_server_host $config_server_host \ --config_server_port 60088 # if you want to enable microbatch overlap, you can uncomment the following lines #--enable_decode_microbatch_overlap4. Testing and Validation# 4.1 Basic Functionality Testing#
curl http://server_ip:server_port/generate \ -H "Content-Type: application/json" \ -d '{ "inputs": "What is AI?", "parameters":{ "max_new_tokens":17, "frequency_penalty":1 } }'4.2 Performance Benchmark Testing#
# DeepSeek-R1 Performance Testing cd test python benchmark_client.py \ --num_clients 100 \ --input_num 2000 \ --tokenizer_path /path/DeepSeek-R1/ \ --url http://127.0.0.1:8088/generate_stream
All the above scripts can be referenced from the scripts in the test/start_scripts/multi_pd_master/ directory.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4