A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://nvidia.github.io/TensorRT-LLM/llm-api/index.html below:

LLM API Introduction — TensorRT-LLM

LLM API Introduction#

The LLM API is a high-level Python API designed to streamline LLM inference workflows.

It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo.

While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.

Quick Start Example#

A simple inference example with TinyLlama using the LLM API:

For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this README.

Model Input#

The LLM() constructor accepts either a Hugging Face model ID or a local model path as input.

1. Using a Model from the Hugging Face Hub#

To load a model directly from the Hugging Face Model Hub, simply pass its model ID (i.e., repository name) to the LLM constructor. The model will be automatically downloaded:

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

You can also use quantized checkpoints (FP4, FP8, etc) of popular models provided by NVIDIA in the same way.

2. Using a Local Hugging Face Model#

To use a model from local storage, first download it manually:

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

Then, load the model by specifying a local directory path:

llm = LLM(model=<local_path_to_model>)

Note: Some models require accepting specific license agreements. Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.

Tips and Troubleshooting#

The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4