Overview | Installation | Get Started | Documentation | Community | Citing torchtune | License
torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:
torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.
Supervised Finetuning (SFT) Type of Weight Update 1 Device >1 Device >1 Node Full ✅ ✅ ✅ LoRA/QLoRA ✅ ✅ ✅Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device
You can also run e.g. tune ls lora_finetune_single_device
for a full list of available configs.
Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed
You can also run e.g. tune ls knowledge_distillation_distributed
for a full list of available configs.
Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device
You can also run e.g. tune ls full_dpo_distributed
for a full list of available configs.
Example: tune run qat_distributed --config llama3_1/8B_qat_lora
You can also run e.g. tune ls qat_distributed
or tune ls qat_single_device
for a full list of available configs.
The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.
For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:
Model Sizes Llama4 Scout (17B x 16E) [models, configs] Llama3.3 70B [models, configs] Llama3.2-Vision 11B, 90B [models, configs] Llama3.2 1B, 3B [models, configs] Llama3.1 8B, 70B, 405B [models, configs] Mistral 7B [models, configs] Gemma2 2B, 9B, 27B [models, configs] Microsoft Phi4 14B [models, configs] Microsoft Phi3 Mini [models, configs] Qwen3 0.6B, 1.7B, 4B, 8B, 14B, 32B [models, configs] Qwen2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs] Qwen2 0.5B, 1.5B, 7B [models, configs]We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.
Memory and training speedBelow is an example of the memory requirements and training speed for different Llama 3.1 models.
Note
For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.
If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.
Model Finetuning Method Runnable On Peak Memory per GPU Tokens/sec * Llama 3.1 8B Full finetune 1x 4090 18.9 GiB 1650 Llama 3.1 8B Full finetune 1x A6000 37.4 GiB 2579 Llama 3.1 8B LoRA 1x 4090 16.2 GiB 3083 Llama 3.1 8B LoRA 1x A6000 30.3 GiB 4699 Llama 3.1 8B QLoRA 1x 4090 7.4 GiB 2413 Llama 3.1 70B Full finetune 8x A100 13.9 GiB ** 1568 Llama 3.1 70B LoRA 8x A100 27.6 GiB 3497 Llama 3.1 405B QLoRA 8x A100 44.8 GB 653*= Measured over one full training epoch
**= Uses CPU offload with fused optimizer
torchtune exposes a number of levers for memory efficiency and performance. The table below demonstrates the effects of applying some of these techniques sequentially to the Llama 3.2 3B model. Each technique is added on top of the previous one, except for LoRA and QLoRA, which do not use optimizer_in_bwd
or AdamW8bit
optimizer.
Baseline uses Recipe=full_finetune_single_device, Model=Llama 3.2 3B, Batch size=2, Max sequence length=4096, Precision=bf16, Hardware=A100
The final row in the table vs baseline + Packed Dataset uses 81.9% less memory with a 284.3% increase in tokens per second.
Command to reproduce final row.tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device \ dataset.packed=True \ compile=True \ loss=torchtune.modules.loss.CEWithChunkedOutputLoss \ enable_activation_checkpointing=True \ optimizer_in_bwd=False \ enable_activation_offloading=True \ optimizer=torch.optim.AdamW \ tokenizer.max_seq_len=4096 \ gradient_accumulation_steps=1 \ epochs=1 \ batch_size=2
torchtune is only tested with the latest stable PyTorch release (currently 2.6.0) as well as the preview nightly version, and leverages torchvision for finetuning multimodal LLMs and torchao for the latest in quantization techniques; you should install these as well.
# Install stable PyTorch, torchvision, torchao stable releases pip install torch torchvision torchao pip install torchtune
# Install PyTorch, torchvision, torchao nightlies. pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4 pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu
You can also check out our install documentation for more information, including installing torchtune from source.
To confirm that the package is installed correctly, you can run the following command:
And should see the following output:
usage: tune [-h] {ls,cp,download,run,validate} ... Welcome to the torchtune CLI! options: -h, --help show this help message and exit ...
To get started with torchtune, see our First Finetune Tutorial. Our End-to-End Workflow Tutorial will show you how to evaluate, quantize, and run inference with a Llama model. The rest of this section will provide a quick overview of these steps with Llama3.1.
Follow the instructions on the official meta-llama
repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.
To download Llama3.1, you can run:
tune download meta-llama/Meta-Llama-3.1-8B-Instruct \ --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ --ignore-patterns "original/consolidated.00.pth" \ --hf-token <HF_TOKEN> \Running finetuning recipes
You can finetune Llama3.1 8B with LoRA on a single GPU using the following command:
tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device
For distributed training, tune CLI integrates with torchrun. To run a full finetune of Llama3.1 8B on two GPUs:
tune run --nproc_per_node 2 full_finetune_distributed --config llama3_1/8B_full
Tip
Make sure to place any torchrun commands before the recipe specification. Any CLI args after this will override the config and not impact distributed training.
There are two ways in which you can modify configs:
Config Overrides
You can directly overwrite config fields from the command line:
tune run lora_finetune_single_device \ --config llama2/7B_lora_single_device \ batch_size=8 \ enable_activation_checkpointing=True \ max_steps_per_epoch=128
Update a Local Copy
You can also copy the config to your local directory and modify the contents directly:
tune cp llama3_1/8B_full ./my_custom_config.yaml Copied to ./my_custom_config.yaml
Then, you can run your custom recipe by directing the tune run
command to your local files:
tune run full_finetune_distributed --config ./my_custom_config.yaml
Check out tune --help
for all possible CLI commands and options. For more information on using and updating configs, take a look at our config deep-dive.
torchtune supports finetuning on a variety of different datasets, including instruct-style, chat-style, preference datasets, and more. If you want to learn more about how to apply these components to finetune on your own custom dataset, please check out the provided links along with our API docs.
torchtune supports finetuning on a variety of devices, including NVIDIA GPU, Intel XPU, AMD ROCm, Apple MPS, and Ascend NPU. If you're interested in running recipes on a custom device, such as Intel XPU, follow the steps below.
Step 1: Refer to the Getting Started on Intel GPU guide to configure your environment.
Step 2: Update device information via either CLI override or config changes. You can directly overwrite config fields from the command line:
tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device device=xpu
Or edit your local copy of configuration files and replace device: cuda
with device: xpu
torchtune focuses on integrating with popular tools and libraries from the ecosystem. These are just a few examples, with more under development:
We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions. If you'd like to help out as well, please see the CONTRIBUTING guide.
The transformer code in this repository is inspired by the original Llama2 code. We also want to give a huge shout-out to EleutherAI, Hugging Face and Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune. In addition, we want to acknowledge some other awesome libraries and tools from the ecosystem:
If you find the torchtune library useful, please cite it in your work as below.
@software{torchtune, title = {torchtune: PyTorch's finetuning library}, author = {torchtune maintainers and contributors}, url = {https//github.com/pytorch/torchtune}, license = {BSD-3-Clause}, month = apr, year = {2024} }
torchtune is released under the BSD 3 license. However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4