A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/vllm-project/vllm/issues/8176 below:

reproducing vLLM performance benchmark · Issue #8176 · vllm-project/vllm · GitHub

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

To reproduce vLLM's performance benchmark, please launch a shell in the following docker images:

And then run the following bash script (don't forget to replace with your huggingface token that has Llama-3 model access):

export HF_TOKEN=<your HF TOKEN>
apt update
apt install -y wget unzip 
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8532/jobs/0191bbbf-c603-4c15-9f5d-e0b2933ba097/artifacts/0191bd2a-d6cd-4f6d-b618-a7aa2c39456c
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Your benchmarking results will be in ./benchmarks/results, with the name format of xxx_nightly_results.json and can be loaded and converted to pandas dataframe by pandas.DataFrame.from_dict(). Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).

When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code

      # temporary fix for trt
      kill_gpu_processes
      bash -c "python3 /tensorrtllm_backend/scripts/launch_triton_server.py \
              --world_size=${tp} \
              --model_repo=/tensorrtllm_backend/triton_model_repo & " </dev/null >/dev/null 2>&1 &
      wait_for_server

to Line 211 (right after the for loop) in ./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh to force TensorRT-LLM to restart the serve more often.

Known issue:

Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...

WoosukKwon, elinx and Shaoting-FengWoosukKwon and Shaoting-Feng


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4