The latest state-of-the-art foundation large language models (LLMs) have billions of parameters and are pretrained on trillions of tokens of input text. They often achieve striking results on a wide variety of use cases without any need for customization. Despite this, studies have shown that the best accuracy on downstream tasks can be achieved by adapting LLMs with high-quality, domain-specific datasets.
In many cases, smaller customized models can match or even outperform larger generic LLMs while offering significantly lower deployment costs. However, customizing models for specific downstream tasks can bring significant challenges, during both creation and deployment.
Full fine-tuning (that is, updating all parameters of the model) for the largest LLMs can be difficult due to the amount of computational infrastructure required to learn across the whole model. Infrastructure costs are also increased at deployment time, where users are required to either host multiple large models in memory or tolerate increased latency as entire models are swapped in and out. Low-rank adaptation (LoRA) is a technique for mitigating both of these issues.
This post provides a brief overview of LoRA, and explains the two ways to deploy LoRA fine-tuned models. We will also discuss our approach for enabling a heterogeneous LoRA deployment of a swarm of LoRA adapters, enabling mixed-batch inference requests.
Low-rank adaptationIn the past few years, LoRA has emerged as a popular technique that tunes a very small number of additional parameters, as compared to full fine-tuning. These additional parameters, called the LoRA adapter, represent the low-rank decomposition of the changes in the dense layers of the network. LoRA operates on the observation that LLMs are overparameterized, and that newly learned information during fine-tuning has a low “intrinsic rank.” In other words, the effective changes in the model parameters are confined to a lower-dimensional subspace of the entire, very high-dimensional parameter space. With LoRA, it’s possible to reduce the number of trainable parameters by 10,000x.
Figure 1. Parameters in A and B represent the newly added information. Image credit: LoRA: Low-Rank Adaptation of Large Language ModelsFigure 1 depicts the core idea behind LoRA:
The ranks of A and B matrices are small values like 8, 16, and so on. Cumulatively, they have far fewer trainable parameters than W, which makes customization computationally and memory efficient. This rank (r) parameter is typically customizable at training time.
There exists a tradeoff between rank size and computational efficiency. A larger rank value enables better expressivity, so the model can capture more patterns relevant to the downstream task. Very high rank values (like 64) approach the capacity of learning information close to full supervised fine-tuning. That is, updating all the parameters in the model. On the downside, larger ranks are also more expensive to train and inference, both in terms of memory and compute requirements. In practice, LoRA fine-tuning with a rank value as small as 8 is already very effective, and is a good starting point for a variety of downstream tasks.
Deploying a LoRA-tuned modelLoRA fine-tunes can be deployed in the following ways.
Option 1: Merging the LoRA adapterThe additional LoRA weights can be merged with the pretrained model to create a purpose-built variant that is structurally equivalent to its predecessor. This avoids incurring any additional inference latency of managing the adapter separately. Merging weights is a simpler approach, but less flexible. The disadvantage of this approach is that the whole model becomes “bespoke” and can only serve one task at a time—that is, the one it is fine-tuned for. This makes it difficult to batch together inputs for different tasks for efficiency in deployment. It is only recommended if you plan to serve a single task per deployment.
Option 2: Dynamically loading the LoRA adapterLoRA adapters (A and B in Figure 1) are kept separate from the base model (W). At inference, the runtime dynamically loads the adapter weights corresponding to incoming requests to serve it. It enables flexibility in serving and batching inputs from various tasks concurrently to make the best use of the available compute, without having to maintain separate custom models.
Some use cases require several, and even hundreds or thousands of LoRAs over the same base model. For these, dynamic LoRA adapter selection is a better path. Examples include:
NVIDIA NIM offers optimized inference microservices that support such dynamic loading of LoRA adapters and allow sending mixed-batch requests. The following sections take a deeper look at our approach.
Heterogenous, multiple LoRA deployment with NVIDIA NIMWith NIM, each inference microservice is associated with a single foundation model. This model can have any number of “customizations” in the form of low-rank adapters associated with it.
The requests in one batch might use different LoRA adapters to support different tasks. Therefore, one traditional General Matrix Multiplication (GEMM) can’t be used to compute all the requests together. Computing them one-by-one sequentially would lead to significant additional overhead. To solve this problem, we used NVIDIA CUTLASS to implement a batched GEMM to fuse batched, heterogeneous request processing into a single kernel. This improves GPU utilization and performance.
Furthermore, we found that the GPU utilization of the batched GEMM is not sufficiently high for the first matrix component of each adapter, because this first matrix has a very large input dimension and small output dimension. Each adapter has two matrix components, A (shaped d-by-r) and B (shaped r-by-d), as seen in Figure 1. Since d is typically much larger than the LoRA rank r, we applied the splitK method to split the GEMM into several tiles on more streaming multiprocessors (SMs), improving the GPU utilization, and use an additional reduction kernel to reduce the partial results after the splitK-batched-GEMM.
Best practices for performance benchmarkingEvaluating the latency and throughput performance of such a multi-LoRA deployment is nontrivial. In this section, we discuss several major considerations generally worth looking at when benchmarking the performance of an LLM LoRA inference framework.
ignore_eos
parameter tells the inference framework to continue generating text until it reaches the max_token_length
limit. This ensures the use case OSL (output sequence length) specification is met. This parameter is increasingly supported by LLM inference frameworks and significantly simplifies benchmarking setup. Notably, with ignore_eos
you don’t have to train on “real” tasks for performance profiling purposes.Other supplementary metrics include total requests per second and end-to-end request latency.
Compared to serving a base model (or merged LoRA model), the addition of dynamic LoRAs—a single LoRA, multiple LoRAs of the same rank, or multiple LoRAs of different ranks—all induce increasing cost, both in latency and throughput. Ideally, this cost should be reasonable in exchange for the improved accuracy and flexibility that dynamic LoRAs provide.
In the coming weeks and months, we’ll have more to share on the performance characteristics of NIM when serving LoRA.
What’s nextThere are exciting new enhancements to LoRA in research that aim to improve the efficiency or accuracy of fine-tuned models. Our future direction includes incorporating these into NIM.
Tied-LoRATied-LoRA is a novel technique from NVIDIA Research that increases the parameter efficiency of LoRA. In LoRA, task-specific low-rank matrices are added that approximate the weight updates for each layer of the LLM. In Tied-LoRA, these low-rank matrices are shared (“tied”) between the various layers, further reducing the number of trainable parameters. Additionally, this technique allows selectively training or freezing of different components of LoRA (low-rank matrices, and scaling vectors) enabling the user to experiment with performance and parameter efficiency trade-offs.
Support for this method with NVIDIA NIM is planned for future releases.
DoRADoRA, another technique developed by NVIDIA Research, bridges the performance gap between fully fine-tuned models and LoRA tuning. It achieves this by decomposing pretrained weights into two components: magnitude and direction. For fine-tuning, DoRA specifically uses LoRA for directional updates, thereby minimizing the number of trainable parameters efficiently. This approach enhances the learning capacity and training stability of LoRA without incurring additional inference overhead. DoRA consistently outperforms LoRA in fine-tuning models like LLaMA, LLaVA, and VL-BART across various downstream tasks, including commonsense reasoning, visual instruction tuning, and image and video-text understanding.
ConclusionNVIDIA NIM enables you to seamlessly deploy and scale multiple LoRA adapters. NIM is generally available now, starting with support for Meta Llama 3 8B and Llama 3 70B, and LoRA adapters in both NVIDIA NeMo and Hugging Face model formats. We’re committed to adding support for additional state-of-the-art community models in future releases.
To get started with multi-LoRA in NIM, check out the Jupyter Notebook tutorial on LoRA tuning a Llama 3 model using NVIDIA NeMo, deploying fine-tuned adapter(s) with NIM, and sending mixed inference requests. For more information about NIM, see the documentation.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4