TensorRT Model Optimizer's goal is to provide a unified library that enables our developers to easily achieve state of the art model optimizations resulting in the best inference speed-ups. Model Optimizer will continuously enhance its existing features leveraging advanced capabilities to introduce new cutting-edge techniques and stay at the forefront of AI model optimization.
In striving for this, our roadmap and development follow these product strategies:
In the following sections, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of Model-Optimizer's directions and upcoming features.
Community contributions are highly encouraged. If you're interested in contributing to specific features, we welcome any questions and feedback in this thread and feature requests in github Issues 😊.
Roadmap:We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.
High level goals: QuantizationNVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. Model-Optimizer has provided initial FP4 recipes and quantization techniques and will continue to improve FP4 with advanced techniques:
Model-Optimizer collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:
Model-Optimizer works with TensorRT-LLM, vLLM and SGLang to streamline optimized model deployment. This includes expanding focus on model optimizations that require finetuning. To allow streamlined experience, Model-optimizer is working with (Hugging Face/ NVIDIA NeMo and Megatron-LM) to deliver exceptional E2E solution for these optimizations. Our focus areas include:
Model-Optimizer will continue to accelerate image generation inference by investing in these areas:
To provide extensibility and transparency for everyone, Model-Optimizer is now Open Source! Paired with continued documentation/code additions to improve extensibility/usability, Model-Optimizer will continue to have a large focus on enabling our community to expand and contribute for their own use-cases. This will enable developers, for example, to experiment with custom calibration algorithms or contribute to the latest techniques. Users can also self-service to add model support or non-standard data-types, and benefit from improved debuggability and accessibility.
3.2 Ready-to-deploy optimized checkpointsFor developers who have limited GPU resources to optimize large models or prefer to skip the optimization steps, we currently offer quantized checkpoints of popular models in the Hugging Face Model Optimizer collection. Developers can deploy these optimized checkpoints directly on TensorRT-LLM, vLLM and SGLang (Depending on the checkpoint). We currently have published FP8/FP4/Medusa Llama family model checkpoints and FP4 checkpoint for DeepSeek-R1. In the near future we are working to expand to optimized FLUX, diffusion, Medusa-trained checkpoints, Eagle-trained checkpoints and more.
4. Choice of Deployment 4.1 Popular Community FrameworksTo offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example for deploying Unified HuggingFace Checkpoint, with more model support planned.
4.2 In-Framework DeploymentWe have enabled and released a path for deployment within native PyTorch. This decouples model build/compile from runtime and offers several benefits:
Developers can utilize AutoDeploy or Real Quantization for these in-framework deployments.
5. Expand Support Matrix 5.1 Data typesAlongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with the least possible impact on model fidelity.
5.2 Model SupportWe strive to streamline our techniques to provide the shortest time from new model/feature to optimized model. This provides our community with the shortest time to deploy. We’ll continue to expand LLM/Diffusion model support, invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and continuously expand our model support based on community interests.
5.3 Platform & Other SupportModel-Optimizer's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. Model-Optimizer also has planned support for ONNX FP4 for DRIVE Thor.
In Q4 2024, Model-Optimizer added formal support for Windows (see Model-Optimizer-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as torch.onnx.export, HuggingFace-Optimum, GenAI, and Olive. It currently supports quantization such as INT4 AWQ, INT8, FP8 and we’ll expand to more techniques suitable for Windows.
omrialmog, RalphMao, jingyu-ml, wxsms, lkm2835 and 3 morejenchen13, realAsma and kevalmorabia97
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4