The LLM API is a high-level Python API designed to streamline LLM inference workflows.
It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo.
While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.
Quick Start Example#A simple inference example with TinyLlama using the LLM API:
For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this README.
Model Input#The LLM()
constructor accepts either a Hugging Face model ID or a local model path as input.
To load a model directly from the Hugging Face Model Hub, simply pass its model ID (i.e., repository name) to the LLM constructor. The model will be automatically downloaded:
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
You can also use quantized checkpoints (FP4, FP8, etc) of popular models provided by NVIDIA in the same way.
2. Using a Local Hugging Face Model#To use a model from local storage, first download it manually:
git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
Then, load the model by specifying a local directory path:
llm = LLM(model=<local_path_to_model>)
Tips and Troubleshooting#Note: Some models require accepting specific license agreements. Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.
The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:
RuntimeError: only rank 0 can start multi-node session, got 1
There is no need to add an mpirun
prefix for launching single node multi-GPU inference with the LLM API.
For example, you can run python llm_inference_distributed.py
to perform multi-GPU on a single node.
Hang issue on Slurm Node
If you experience a hang or other issue on a node managed with Slurm, add prefix mpirun -n 1 --oversubscribe --allow-run-as-root
to your launch script.
For example, try mpirun -n 1 --oversubscribe --allow-run-as-root python llm_inference_distributed.py
.
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
Because the LLM API relies on the mpi4py
library, put the LLM class in a function and protect the main entrypoint to the program under the __main__
namespace to avoid a recursive spawn process in mpi4py
.
This limitation is applicable for multi-GPU inference only.
Cannot quit after generation
The LLM instance manages threads and processes, which may prevent its reference count from reaching zero. To address this issue, there are two common solutions:
Wrap the LLM instance in a function, as demonstrated in the quickstart guide. This will reduce the reference count and trigger the shutdown process.
Use LLM as an contextmanager, with the following code: with LLM(...) as llm: ...
, the shutdown methed will be invoked automatically once it goes out of the with
-statement block.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4