Pretraining models made easy
Nanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:
📚 Check out our Ultrascale Playbook - A comprehensive guide to efficiently scale LLM training with Nanotron!
📝 AI generated docs thanks to DeepWiki
To run the code in this project, first create a Python virtual environment using e.g. uv
:
uv venv nanotron --python 3.11 && source nanotron/bin/activate && uv pip install --upgrade pip
Tip
For Hugging Face cluster users, add export UV_LINK_MODE=copy
to your .bashrc
to suppress cache warnings from uv
Next, install Pytorch:
uv pip install torch --index-url https://download.pytorch.org/whl/cu124
Then install the core dependencies with:
To run the example scripts, install the remaining dependencies as follows:
uv pip install datasets transformers datatrove[io] numba wandb # Fused kernels uv pip install ninja triton "flash-attn>=2.5.0" --no-build-isolation
Next, log into your Hugging Face and Weights and Biases accounts as follows:
huggingface-cli login wandb login
Finally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:
If it isn't installed, run:
sudo apt-get install git-lfsTraining a tiny Llama model
The following command will train a tiny Llama model on a single node of 8 x H100s in about 10 minutes:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml
The model will be saved in the checkpoints
directory as specified in the config file.
Note
You can use examples/config_tiny_llama.py
to generate your own training config
For detailed instructions on training your first model, check out our Your First Training guide. For multi-node training with Slurm, see our Multi-Node Training guide.
Run generation from your checkpointtorchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/{checkpoint_number}/ --tp 1 --pp 1
Increase the value of --tp
(tensor parallel) to accelerate generation with multiple GPUs and use a larger value of --pp
(pipeline parallel) for very large models.
To debug with VSCode, add the following configuration to your launch.json
file:
{ "name": "run_train.py", "type": "python", "request": "launch", "program": "torchrun", // or full path to torchrun by running `which torchrun` "console": "integratedTerminal", "justMyCode": false, "args": [ "--nproc_per_node=2", "run_train.py", "--config-file=examples/config_tiny_llama.yaml", // or use examples/config_tiny_llama.py to generate your own config ], "env": { // "NANOTRON_BENCHMARK": "1", // enable to benchmark your training for a couple of steps "CUDA_DEVICE_MAX_CONNECTIONS": "1", "WANDB_MODE": "disabled", } },
You can find more examples in the /examples
directory:
custom-dataloader
Plug a custom dataloader to nanotron datatrove
Use the datatrove library to load data doremi
Use DoReMi to speed up training mamba
Train an example Mamba model moe
Train an example Mixture-of-Experts (MoE) model mup
Use spectral µTransfer to scale up your model examples/config_tiny_llama_with_s3_upload.yaml
For automatically uploading checkpoints to S3
We're working on adding more examples soon! Feel free to add a PR to add your own example. 🚀
We've conducted extensive benchmarking of Nanotron across various model sizes and configurations. The complete benchmark data, configurations, and logs are available in our ultrascale-playbook-data repository.
The diagram above showcases the best configurations we discovered for each model size and node count in nanotron v0.5, highlighting optimal MFU (Model FLOPS Utilization) and memory usage. These represent the most efficient training setups identified through our comprehensive benchmarking process. Stay tuned for even more optimizations coming soon! 🚀
For detailed analysis and best practices derived from these benchmarks, see our Ultrascale Playbook.
We currently support the following features:
And we have on our roadmap:
torch.compile
supportWe would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for Megatron-LM/apex
, Microsoft for DeepSpeed
, HazyResearch for flash-attn
..
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4