No response
Report of performance regressionNo response
Misc discussion on performanceTo reproduce vLLM's performance benchmark, please launch a shell in the following docker images:
lmsysorg/sglang:v0.3.0-cu124
openmmlab/lmdeploy:v0.6.0a0-cu12
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
vllm/vllm-openai:v0.6.0
And then run the following bash script (don't forget to replace with your huggingface token that has Llama-3 model access):
export HF_TOKEN=<your HF TOKEN> apt update apt install -y wget unzip # download benchmarking code wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8532/jobs/0191bbbf-c603-4c15-9f5d-e0b2933ba097/artifacts/0191bd2a-d6cd-4f6d-b618-a7aa2c39456c unzip benchmarking_code.zip # remove previous results rm -r ./benchmarks/results VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Your benchmarking results will be in ./benchmarks/results
, with the name format of xxx_nightly_results.json
and can be loaded and converted to pandas dataframe by pandas.DataFrame.from_dict()
. Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).
When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code
# temporary fix for trt kill_gpu_processes bash -c "python3 /tensorrtllm_backend/scripts/launch_triton_server.py \ --world_size=${tp} \ --model_repo=/tensorrtllm_backend/triton_model_repo & " </dev/null >/dev/null 2>&1 & wait_for_server
to Line 211 (right after the for loop) in ./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
to force TensorRT-LLM to restart the serve more often.
Known issue:
ignore_eos
or max_length
due to imperfect implementation of these two flags in different engines). That said, the number of tokens generated by vLLM is roughly aligned with other engines as all engines are performing greedy sampling using the same model.The output of `python collect_env.py`
Before submitting a new issue...
WoosukKwon, elinx and Shaoting-FengWoosukKwon and Shaoting-Feng
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4