Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually quite accessible today, even for the GPU poor. With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb:
The left pane shows that we outperform the checkpoint released by OpenAI on the FineWeb withheld validation dataset. This is not the ideal metric because the data distribution of GPT-2 was different (it was trained on the never released "WebText" dataset) and the statistics of the internet may have been different 5 years ago, so it's not a super fair comparison. Therefore, in addition on the right we also plot the HellaSwag accuracy, a benchmark commonly used to assess LLM capability that is nice, smooth, and well-behaved. I'd mostly look at HellaSwag, but FineWeb val is a nice confirmation. That said, HellaSwag has no math/code so it slightly favors our setting (common crawl-like data). One more point of reference is that GPT-3 in Appendix H cites HellaSwag accuracy at 33.7 for GPT-3 Small (124M) model. We get to 29.9 here, which surpasses GPT-2 (124M) at 29.4. Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens.
Now here is the shortest path to reproducing this result yourself. You'll need a GPU. I like and run my work on Lambda labs (who graciously sponsors in llm.c development), though the inventory can be limited at times. Many other providers exist and you can use the Discussion below for tips and tricks around this. Here is the example process for a Linux x86 64bit Ubuntu 22.04 with CUDA 12 (this is somewhere around the current, default "modern" configuration). If you're on a different system, the comments and discussion in the main README file might be helpful.
# install miniconda mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash source ~/.bashrc # pytorch nightly (optional) https://pytorch.org/get-started/locally/ # conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia # pip installs so we can tokenize the FineWeb dataset yes | pip install tqdm tiktoken requests datasets # install cudnn so we can use FlashAttention and run fast (optional) # https://developer.nvidia.com/cudnn-downloads # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install libcudnn9-dev-cuda-12 # "install" cudnn-frontend to ~/ git clone https://github.com/NVIDIA/cudnn-frontend.git # install MPI (optional, if you intend to use multiple GPUs) sudo apt install openmpi-bin openmpi-doc libopenmpi-dev # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb git clone https://github.com/karpathy/llm.c.git cd llm.c python dev/data/fineweb.py --version 10B # compile llm.c (mixed precision, with cuDNN flash-attention) # first compilation is ~1 minute, mostly due to cuDNN make train_gpt2cu USE_CUDNN=1 # train on a single GPU ./train_gpt2cu \ -i "dev/data/fineweb10B/fineweb_train_*.bin" \ -j "dev/data/fineweb10B/fineweb_val_*.bin" \ -o log124M \ -e "d12" \ -b 64 -t 1024 \ -d 524288 \ -r 1 \ -z 1 \ -c 0.1 \ -l 0.0006 \ -q 0.0 \ -u 700 \ -n 5000 \ -v 250 -s 20000 \ -h 1 # if you have multiple GPUs (e.g. 8), simply prepend the mpi command, e.g.: # mpirun -np 8 ./train_gpt2cu \ ... (the rest of the args are same)
Args guide. A lot of these hyperparameters follow the GPT-3 paper instead of the GPT-2 paper, because it was a lot more detailed. Args explanation:
-i -j
are training and validation splits token files, written by fineweb.py
-o
is the output directory to write logs and checkpoints into-e "d12"
asks to initialize, a depth 12 GPT-2 model from scratch-b 64
sets the micro-batch size to 64 . If you are running out of memory, decrease this value, e.g. try 32, 16, 8, all the way down to 1 potentially.-t 1024
sets the maximum sequence length to 1024, as GPT-2 did-d 524288
requests that the total batch size per single update be ~0.5M tokens. The code will take this desired batch size and calculate the needed gradient accumulation "inner loop" steps of the optimization. For example on 8 GPUs, at -b 64 and -t 1024, every microbatch is doing exactly 8 X 64 X 1024 = 524288 tokens, so there is no need for gradient accumulation. But if we we only have 1 GPU, then the code will set it to 8, and do an inner loop of 8 iterations to add up to this "total batch size" per step. While the batch size used to train GPT-2 is unknown, this number ~0.5M comes from the GPT-3 paper table, for this model size.-r 1
sets the recompute setting = 1, so we will re-compute the GeLU activations. This slightly increases the runtime, but saves quite a bit of memory, allowing us to increase the batch size and get a net increase in token throughput.-z 1
turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op.-c 0.1
sets the weight decay to 0.1. Only (2D) weights are decayed exactly as in GPT-2, and this number comes from the GPT-3 paper-l 0.0006
sets the maximum learning rate, from GPT-3 paper.-q 0.0
says that we will decay the learning rate to 0 over the course of training.-u 700
says that we will ramp up the learning rate from 0 to max learning rate over the first 700 iterations, which at total batch size 0.5M is 350M tokens, following GPT-3 paper.-n 5000
asks to save model checkpoints every 5000 steps.-v 250
asks to evaluate and log the validation loss every 250 steps-s 20000
asks to sample some tokens every 20000 steps. Because the total number of steps will be less than this (see below), this basically turns generation off and we will only basically sample a single time at the very end.-h 1
asks to evaluate the HellaSwag accuracy, something we can compare across papers.-x
flag, it defaults to exactly one epoch over the training data, i.e. 10B tokens. Because the total batch size is ~0.5M and total number of tokens is 10B, there will be a total of ~ 10B/0.5M = 20K steps.There's a lot of detail above but the TLDR is that we're training a 12-layer GPT-2 (124M), from scratch, on 10B tokens of FineWeb, with max sequence length of 1024 tokens. If you are running out of memory, I would first make sure you have -r 1
turned on, and then I would start decreasing the batch size -b
by dividing it by 2, until the runs. Once it runs, I'd see if you can get away with turning -r 0
back on to recover a little bit of speed.
Training. The code will print something like this over time (this is an example of a single A100 40GB PCIe GPU, $1.29/hr):
step 80/18865 | train loss 7.577051 | norm 1.1461 | lr 6.86e-05 | 2950.68 ms | 49.0% A100 fp16 MFU | 177968 tok/s
step 81/18865 | train loss 7.540626 | norm 1.4001 | lr 6.94e-05 | 2952.59 ms | 49.0% A100 fp16 MFU | 177948 tok/s
step 82/18865 | train loss 7.465753 | norm 1.0613 | lr 7.03e-05 | 2953.98 ms | 48.9% A100 fp16 MFU | 177924 tok/s
step 83/18865 | train loss 7.472681 | norm 1.1553 | lr 7.11e-05 | 2955.67 ms | 48.9% A100 fp16 MFU | 177897 tok/s
What is going on? Well, we have 10B training tokens and our batch size is ~0.5M, so we'd expect about 10B/0.5M ~= 20K steps in total. It actually works out to exactly 18,865 because one of the data shards is reserved for validation data and the exact batch size is a nice power of 2 @ 524,288. So here we are on step 80/18865, which in total took 2950.68ms. MFU is short for "Model Flops Utilization". The A100 claims to offer 312 TFLOPS, but in practice this is very hard to achieve because the training is memory-bound and we can't feed the TensorCores that do the matrix multiplies. On this A100 40GB PCIe GPU, we see that when we count up the FLOPs we're doing and divide by time, we're roughly at half the theoretical, maximum peak FLOPS, which is quite good. If you used the A100 80GB SXM with higher memory bandwidth and max thermal design power, this goes up to ~60%. (If you use a GPU that is not A100, ignore this number because it is in units of A100 fp16 FLOPS). We also see that the token throughput we are achieving is about 178K tok/s. Next, our current loss is 7.577. The lower this is, the better our model is at predicting the next token in the sequence on average. Step 80 is very early in the training here. Because the perplexity is exp(7.577) ~= 2K, our model is as confused about each next token on average, as if it was guessing at random from 2,000 tokens. The full vocab size is 50,257. By the end of the optimization we'll get to about 3.29, so it's as if we're guessing uniformly at random from exp(3.29) ~= 27 tokens at each time step. Finally we see the gradient norm is 1.1461. When this number spikes, the gradient is exploding and this is very bad. To mitigate gradient explosions, as is standard, llm.c uses gradient clipping at 1.0, so if the gradient norm exceeds 1.0 (like in this time step) we forcefully scale it down so it's norm is up to 1.0. Later in the optimization, the gradient norm usually "calms down" to lower values.
Visualization. Finally, you'll want to make pretty charts like the one I posted up above. For that, our program is printing some very rudimentary logs to an improvised log124M/main.log
file. I have attached an example Jupyter notebook that parses these files and visualizes them in the style above.
Tokenizer. When you're training up above, you'll see a warning that llm.c couldn't find the GPT-2 tokenizer .bin file. That's totally fine for training, but it means that we can't decode - i.e. we can't convert integer tokens that we sample into little string pieces, to create text that we can read. Here is how we can generate it:
# install pytorch nightly conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia # install huggingface transformers pip install transformers # preprocess the TinyShakespeare dataset (very fast, much faster than FineWeb) python dev/data/tinyshakespeare.py # run a little training loop in Python/PyTorch # it saved a lot of .bin files, including the Tokenizer python train_gpt2.py
The Python script is a parallel implementation to llm.c used for error checking and unit tests (but doesn't have full feature parity). In particular, if we run it like above it will write the file gpt2_tokenizer.bin
, which the C code can read and use to output nice text during sampling.
Sampling. The code is currently not really intended for inference, but you can hack the code to do inference very inefficiently (without any kv-cache etc.) with something like this:
make train_gpt2cu USE_CUDNN=1 ./train_gpt2cu \ -i "dev/data/fineweb10B/fineweb_train_*.bin" \ -j "dev/data/fineweb10B/fineweb_val_*.bin" \ -e "log124M/gpt2_124M_00018865.bin" \ -b 1 -t 1024 \ -x 1 \ -l 0.0 \ -s 1 -g 256
The -i -j
flags are spurious. -e
flag is pointing at the final model checkpoint of our GPT-2 124M model, which llm.c will initialize the model from. The -b 1
is saying to use only a single batch element (one row of length 1024 tokens in which we sample from left to right). The -x 1
is saying we only want to run for a single step, and -l 0.0
is setting the learning rate to zero so we don't actually train the model on this single step. Finally -s 1
is saying "sample every step" and -g 256
is saying sample 256 tokens.
Now, the above is just unconditional sampling. It's possible to hack the code to do conditional sampling, i.e. sequence completion. E.g. I asked our 124M model to complete the text "The GitHub project llm.c is a", and it continued: "free service to enhance the scholarly infrastructure of the academic community.". I then re-sampled with a different seed and got "The GitHub project llm.c is a collaborative effort that rocks GitHub itself". So, not bad I guess :) I had to directly hack the code by setting gen_tokens[1:10]
to be the prompt tokens 464, 21722, 1628, 32660, 76, 13, 66, 318, 257 (from tiktokenizer ty), then hacked the loop index that samples to start at token position 10, ... you get the idea TLDR conditional generation is not really supported but in principle possible, possibly coming soon.
Code. 95% of the heavy lifting is in the train_gpt2.cu file. It started as a nice clean 1,000 LOC C code, but has grown quite a bit and now it's closer to 3,500 LOC, with 4 supporting files of file I/O utils, tokenizer, dataloader, and random number generation. Roughly speaking, the first 500 LOC are just basic setup of up MPI, NCCL, cuDNN, cuBLAS, etc etc. The next 1,500 LOC are all the layers of the Transformer, and both their forward and backward implementation in efficient CUDA code. All the CUDA kernel development for these files happens in dev/cuda
. So for example there is a gelu_forward()
and then also a gelu_backward()
, and the same way for all the other layers. The next 1,000 LOC are the gpt2
model, which just strings together the layers and itself has one big gpt2_forward()
and gpt2_backward()
. The last 1,000 LOC are int main(), which has the main training loop and all the related bookkeeping and argument parsing, and a lot of tedious code around e.g. resuming training from a previous checkpoint, etc.
350M model. Overnight I also reproduced the 350M parameter model. Take a look at the file scripts/run_gpt2_350M.sh for the exact launch command. I found that 10B tokens was not enough for the 350M model, so you'll have to download and preprocess the FineWeb100B (or try to do multiple epochs on just the 10B above, which might work, I have not checked). I configured it to train for 30B tokens, so we have that:
FLOPS using 6ND approximation:
On 8X A100 80GB SXM the 350M stepped at 820ms/iter. Trained for 60K steps (instead of ~20K), for a total of ~30B tokens (instead of ~10B tokens). Total training time 14 hours. Cost $14/hr => 14 X 14 ~= $200 (10X of 124M). However looking at the plot, it's possible that we could have gotten away with slightly less:
Coming up. That's it for now! We are moving on to the 740M and then, of course, the actual "GPT-2" 1558M. If I can find the GPUs... By very rough napkin math, on my single 8X A100 80GB GPU box, the 1558M model would take ~1 week and cost ~$2.5K. This is in acceptable territory, but we'll want to take some time to make the current code better, cleaner, better tested, and add multi-node training support. And also very much still on my mind, I want to build the whole thing again, from scratch and piece by piece, coming to you soon^TM.
FAQ:
train_gpt2.py
does not have full feature parity (e.g. doesn't do sharded data loading, etc.) and is meant to be more as a reference, but I think you can get something similar to the 124M model above stepping as follows: torchrun --standalone --nproc_per_node=4 python train_gpt2.py --input_bin dev/data/fineweb10B/fineweb_train_000001.bin --write_tensors 0 --model d12 --batch_size 64 --sequence_length 1024 --total_batch_size 524288 --dtype bfloat16 --compile 1 --tensorcores 1 --flash 1 --num_iterations 18865 --weight_decay 0.1 --overfit_single_batch 0
. I am interested in and would accept PRs that bring the PyTorch training closer up to feature parity to the llm.c training loop.Acknowledgements
Call out to @ngc92 and @ademeure who have both made substantial contributions to llm.c across the board and especially on CUDA kernel optimization, @chinthysl and @PeterZhizhin for distributed optimization PRs, and @rosslwheeler for Windows support and tooling.
Please feel free to use the Discussions for any FAQ and related, or if you'd like something faster, #llmc on Discord, or #llmdotc on CUDA MODE Discord.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4