Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481

karpathy · 2024-05-28T15:32:20Z

karpathy
May 28, 2024
Maintainer

Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually quite accessible today, even for the GPU poor. With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb:

The left pane shows that we outperform the checkpoint released by OpenAI on the FineWeb withheld validation dataset. This is not the ideal metric because the data distribution of GPT-2 was different (it was trained on the never released "WebText" dataset) and the statistics of the internet may have been different 5 years ago, so it's not a super fair comparison. Therefore, in addition on the right we also plot the HellaSwag accuracy, a benchmark commonly used to assess LLM capability that is nice, smooth, and well-behaved. I'd mostly look at HellaSwag, but FineWeb val is a nice confirmation. That said, HellaSwag has no math/code so it slightly favors our setting (common crawl-like data). One more point of reference is that GPT-3 in Appendix H cites HellaSwag accuracy at 33.7 for GPT-3 Small (124M) model. We get to 29.9 here, which surpasses GPT-2 (124M) at 29.4. Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens.

Now here is the shortest path to reproducing this result yourself. You'll need a GPU. I like and run my work on Lambda labs (who graciously sponsors in llm.c development), though the inventory can be limited at times. Many other providers exist and you can use the Discussion below for tips and tricks around this. Here is the example process for a Linux x86 64bit Ubuntu 22.04 with CUDA 12 (this is somewhere around the current, default "modern" configuration). If you're on a different system, the comments and discussion in the main README file might be helpful.

# install miniconda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
source ~/.bashrc

# pytorch nightly (optional) https://pytorch.org/get-started/locally/
# conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# pip installs so we can tokenize the FineWeb dataset
yes | pip install tqdm tiktoken requests datasets

# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install libcudnn9-dev-cuda-12

# "install" cudnn-frontend to ~/
git clone https://github.com/NVIDIA/cudnn-frontend.git

# install MPI (optional, if you intend to use multiple GPUs)
sudo apt install openmpi-bin openmpi-doc libopenmpi-dev

# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
git clone https://github.com/karpathy/llm.c.git
cd llm.c
python dev/data/fineweb.py --version 10B

# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1

# train on a single GPU
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

# if you have multiple GPUs (e.g. 8), simply prepend the mpi command, e.g.:
# mpirun -np 8 ./train_gpt2cu \ ... (the rest of the args are same)

Args guide. A lot of these hyperparameters follow the GPT-3 paper instead of the GPT-2 paper, because it was a lot more detailed. Args explanation:

-i -j are training and validation splits token files, written by fineweb.py
-o is the output directory to write logs and checkpoints into
-e "d12" asks to initialize, a depth 12 GPT-2 model from scratch
-b 64 sets the micro-batch size to 64 . If you are running out of memory, decrease this value, e.g. try 32, 16, 8, all the way down to 1 potentially.
-t 1024 sets the maximum sequence length to 1024, as GPT-2 did
-d 524288 requests that the total batch size per single update be ~0.5M tokens. The code will take this desired batch size and calculate the needed gradient accumulation "inner loop" steps of the optimization. For example on 8 GPUs, at -b 64 and -t 1024, every microbatch is doing exactly 8 X 64 X 1024 = 524288 tokens, so there is no need for gradient accumulation. But if we we only have 1 GPU, then the code will set it to 8, and do an inner loop of 8 iterations to add up to this "total batch size" per step. While the batch size used to train GPT-2 is unknown, this number ~0.5M comes from the GPT-3 paper table, for this model size.
-r 1 sets the recompute setting = 1, so we will re-compute the GeLU activations. This slightly increases the runtime, but saves quite a bit of memory, allowing us to increase the batch size and get a net increase in token throughput.
-z 1 turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op.
-c 0.1 sets the weight decay to 0.1. Only (2D) weights are decayed exactly as in GPT-2, and this number comes from the GPT-3 paper
-l 0.0006 sets the maximum learning rate, from GPT-3 paper.
-q 0.0 says that we will decay the learning rate to 0 over the course of training.
-u 700 says that we will ramp up the learning rate from 0 to max learning rate over the first 700 iterations, which at total batch size 0.5M is 350M tokens, following GPT-3 paper.
-n 5000 asks to save model checkpoints every 5000 steps.
-v 250 asks to evaluate and log the validation loss every 250 steps
-s 20000 asks to sample some tokens every 20000 steps. Because the total number of steps will be less than this (see below), this basically turns generation off and we will only basically sample a single time at the very end.
-h 1 asks to evaluate the HellaSwag accuracy, something we can compare across papers.
Because we did not set the maximum number of steps using -x flag, it defaults to exactly one epoch over the training data, i.e. 10B tokens. Because the total batch size is ~0.5M and total number of tokens is 10B, there will be a total of ~ 10B/0.5M = 20K steps.

There's a lot of detail above but the TLDR is that we're training a 12-layer GPT-2 (124M), from scratch, on 10B tokens of FineWeb, with max sequence length of 1024 tokens. If you are running out of memory, I would first make sure you have -r 1 turned on, and then I would start decreasing the batch size -b by dividing it by 2, until the runs. Once it runs, I'd see if you can get away with turning -r 0 back on to recover a little bit of speed.

Training. The code will print something like this over time (this is an example of a single A100 40GB PCIe GPU, $1.29/hr):

step   80/18865 | train loss 7.577051 | norm 1.1461 | lr 6.86e-05 | 2950.68 ms | 49.0% A100 fp16 MFU | 177968 tok/s
step   81/18865 | train loss 7.540626 | norm 1.4001 | lr 6.94e-05 | 2952.59 ms | 49.0% A100 fp16 MFU | 177948 tok/s
step   82/18865 | train loss 7.465753 | norm 1.0613 | lr 7.03e-05 | 2953.98 ms | 48.9% A100 fp16 MFU | 177924 tok/s
step   83/18865 | train loss 7.472681 | norm 1.1553 | lr 7.11e-05 | 2955.67 ms | 48.9% A100 fp16 MFU | 177897 tok/s

What is going on? Well, we have 10B training tokens and our batch size is ~0.5M, so we'd expect about 10B/0.5M ~= 20K steps in total. It actually works out to exactly 18,865 because one of the data shards is reserved for validation data and the exact batch size is a nice power of 2 @ 524,288. So here we are on step 80/18865, which in total took 2950.68ms. MFU is short for "Model Flops Utilization". The A100 claims to offer 312 TFLOPS, but in practice this is very hard to achieve because the training is memory-bound and we can't feed the TensorCores that do the matrix multiplies. On this A100 40GB PCIe GPU, we see that when we count up the FLOPs we're doing and divide by time, we're roughly at half the theoretical, maximum peak FLOPS, which is quite good. If you used the A100 80GB SXM with higher memory bandwidth and max thermal design power, this goes up to ~60%. (If you use a GPU that is not A100, ignore this number because it is in units of A100 fp16 FLOPS). We also see that the token throughput we are achieving is about 178K tok/s. Next, our current loss is 7.577. The lower this is, the better our model is at predicting the next token in the sequence on average. Step 80 is very early in the training here. Because the perplexity is exp(7.577) ~= 2K, our model is as confused about each next token on average, as if it was guessing at random from 2,000 tokens. The full vocab size is 50,257. By the end of the optimization we'll get to about 3.29, so it's as if we're guessing uniformly at random from exp(3.29) ~= 27 tokens at each time step. Finally we see the gradient norm is 1.1461. When this number spikes, the gradient is exploding and this is very bad. To mitigate gradient explosions, as is standard, llm.c uses gradient clipping at 1.0, so if the gradient norm exceeds 1.0 (like in this time step) we forcefully scale it down so it's norm is up to 1.0. Later in the optimization, the gradient norm usually "calms down" to lower values.

Visualization. Finally, you'll want to make pretty charts like the one I posted up above. For that, our program is printing some very rudimentary logs to an improvised log124M/main.log file. I have attached an example Jupyter notebook that parses these files and visualizes them in the style above.

Tokenizer. When you're training up above, you'll see a warning that llm.c couldn't find the GPT-2 tokenizer .bin file. That's totally fine for training, but it means that we can't decode - i.e. we can't convert integer tokens that we sample into little string pieces, to create text that we can read. Here is how we can generate it:

# install pytorch nightly
conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# install huggingface transformers
pip install transformers

# preprocess the TinyShakespeare dataset (very fast, much faster than FineWeb)
python dev/data/tinyshakespeare.py

# run a little training loop in Python/PyTorch
# it saved a lot of .bin files, including the Tokenizer
python train_gpt2.py

The Python script is a parallel implementation to llm.c used for error checking and unit tests (but doesn't have full feature parity). In particular, if we run it like above it will write the file gpt2_tokenizer.bin, which the C code can read and use to output nice text during sampling.

Sampling. The code is currently not really intended for inference, but you can hack the code to do inference very inefficiently (without any kv-cache etc.) with something like this:

make train_gpt2cu USE_CUDNN=1
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -e "log124M/gpt2_124M_00018865.bin" \
    -b 1 -t 1024 \
    -x 1 \
    -l 0.0 \
    -s 1 -g 256

The -i -j flags are spurious. -e flag is pointing at the final model checkpoint of our GPT-2 124M model, which llm.c will initialize the model from. The -b 1 is saying to use only a single batch element (one row of length 1024 tokens in which we sample from left to right). The -x 1 is saying we only want to run for a single step, and -l 0.0 is setting the learning rate to zero so we don't actually train the model on this single step. Finally -s 1 is saying "sample every step" and -g 256 is saying sample 256 tokens.

Now, the above is just unconditional sampling. It's possible to hack the code to do conditional sampling, i.e. sequence completion. E.g. I asked our 124M model to complete the text "The GitHub project llm.c is a", and it continued: "free service to enhance the scholarly infrastructure of the academic community.". I then re-sampled with a different seed and got "The GitHub project llm.c is a collaborative effort that rocks GitHub itself". So, not bad I guess :) I had to directly hack the code by setting gen_tokens[1:10] to be the prompt tokens 464, 21722, 1628, 32660, 76, 13, 66, 318, 257 (from tiktokenizer ty), then hacked the loop index that samples to start at token position 10, ... you get the idea TLDR conditional generation is not really supported but in principle possible, possibly coming soon.

Code. 95% of the heavy lifting is in the train_gpt2.cu file. It started as a nice clean 1,000 LOC C code, but has grown quite a bit and now it's closer to 3,500 LOC, with 4 supporting files of file I/O utils, tokenizer, dataloader, and random number generation. Roughly speaking, the first 500 LOC are just basic setup of up MPI, NCCL, cuDNN, cuBLAS, etc etc. The next 1,500 LOC are all the layers of the Transformer, and both their forward and backward implementation in efficient CUDA code. All the CUDA kernel development for these files happens in dev/cuda. So for example there is a gelu_forward() and then also a gelu_backward(), and the same way for all the other layers. The next 1,000 LOC are the gpt2 model, which just strings together the layers and itself has one big gpt2_forward() and gpt2_backward(). The last 1,000 LOC are int main(), which has the main training loop and all the related bookkeeping and argument parsing, and a lot of tedious code around e.g. resuming training from a previous checkpoint, etc.

350M model. Overnight I also reproduced the 350M parameter model. Take a look at the file scripts/run_gpt2_350M.sh for the exact launch command. I found that 10B tokens was not enough for the 350M model, so you'll have to download and preprocess the FineWeb100B (or try to do multiple epochs on just the 10B above, which might work, I have not checked). I configured it to train for 30B tokens, so we have that:

FLOPS using 6ND approximation:

124M on 10B tokens => 6 * 124e6 * 10e9 = 7.44e18 ~= 7e18 capability model
350M on 30B tokens => 6 * 350e6 * 31.5e9 = 6.615e19 ~= 7e19 capability model (~10X)

On 8X A100 80GB SXM the 350M stepped at 820ms/iter. Trained for 60K steps (instead of ~20K), for a total of ~30B tokens (instead of ~10B tokens). Total training time 14 hours. Cost $14/hr => 14 X 14 ~= $200 (10X of 124M). However looking at the plot, it's possible that we could have gotten away with slightly less:

Coming up. That's it for now! We are moving on to the 740M and then, of course, the actual "GPT-2" 1558M. If I can find the GPUs... By very rough napkin math, on my single 8X A100 80GB GPU box, the 1558M model would take ~1 week and cost ~$2.5K. This is in acceptable territory, but we'll want to take some time to make the current code better, cleaner, better tested, and add multi-node training support. And also very much still on my mind, I want to build the whole thing again, from scratch and piece by piece, coming to you soon^TM.

FAQ:

Can I sample from it? kind of, but it's inefficient and a bit weird.
Can I chat with it? no, this is currently only pretraining, not chat finetuning.
Can you train multi-node distributed? in principle yes, there is a slurm PR up that got this working for up 50 nodes. In practice I personally haven't tried yet.
Are you bitwise deterministic? No but we are very close, one more kernel to patch.
Can you train in fp8? No, we're currently mostly training in bf16, but coming soon.
I have a non-NVIDIA GPU (AMD, Apple Silicon, etc.) can I run llm.c? No, llm.c supports C/CUDA only, but I am very happy to link to any forks under "notable forks" section, or accept PRs that would make porting llm.c to other platforms easier.
I only have a CPU, can I play? You won't be able to reproduce GPT-2 models, but you can take on fun projects by finetuning OpenAI GPT-2 models on other data, e.g. TinyShakespeare or TinyStories. Support for these datasets, initialization, and CPU finetuning exists in llm.c in train_gpt2.c. (It's a lot more rudimentary though, intended mostly as a reference for the CUDA code).
How does this compare to PyTorch? llm.c is a "straight up" C/CUDA implementation. The PyTorch code at train_gpt2.py does not have full feature parity (e.g. doesn't do sharded data loading, etc.) and is meant to be more as a reference, but I think you can get something similar to the 124M model above stepping as follows: torchrun --standalone --nproc_per_node=4 python train_gpt2.py --input_bin dev/data/fineweb10B/fineweb_train_000001.bin --write_tensors 0 --model d12 --batch_size 64 --sequence_length 1024 --total_batch_size 524288 --dtype bfloat16 --compile 1 --tensorcores 1 --flash 1 --num_iterations 18865 --weight_decay 0.1 --overfit_single_batch 0. I am interested in and would accept PRs that bring the PyTorch training closer up to feature parity to the llm.c training loop.
Why do you care so much about GPT-2? GPT-2 is the grand-daddy of LLMs, the first time that the modern LLM stack came together in a recognizably modern form, and the parameters were released by OpenAI. GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?). GPT-4 details were never published. Many other LLMs also strongly resemble GPT-2, despite it being from 2019, e.g. Llama 3 from the architecture perspective is a non-linearity change in the MLP and the addition of the RoPE relative positional encoding.

Acknowledgements
Call out to @ngc92 and @ademeure who have both made substantial contributions to llm.c across the board and especially on CUDA kernel optimization, @chinthysl and @PeterZhizhin for distributed optimization PRs, and @rosslwheeler for Windows support and tooling.

Please feel free to use the Discussions for any FAQ and related, or if you'd like something faster, #llmc on Discord, or #llmdotc on CUDA MODE Discord.

timlmit · 2024-05-28T17:17:29Z

timlmit
May 28, 2024

boss <3

0 replies

karpathy · 2024-05-28T17:45:11Z

karpathy
May 28, 2024
Maintainer Author

Few more pointers:

I answered some questions in HN thread
I answered some question in X thread

0 replies

Niskarsh12 · 2024-05-28T18:33:30Z

Niskarsh12
May 28, 2024

I wanted to try it but sadly I am gpu poor :/

8 replies

timlmit May 30, 2024

rent in cloud

dbmanifest May 30, 2024

rent in cloud

Terrible advice. Take the money you'd spend on the cloud and save until you can afford a GPU rig. Breakeven is less than a year of GPU time and if you're taking ML/AI research seriously, than you'll get your money's worth.

Bezos isnt one of the richest men on earth because he's giving everyone a sick deal on compute.

hotwa Jun 1, 2024

same

Eliah7 Jun 27, 2024

Check this out
#562

Minecon724 Aug 13, 2024

if anybody else doesn't want to pay that much try vast.ai where the exact same machine is 5$/hr right now

(just remember it might not be as reliable or even available as it's a marketplace)

sebhtml · 2024-05-28T18:44:06Z

sebhtml
May 28, 2024

Thank you @karpathy for your valuable teaching lessons in your GitHub repositories.

I cloned llm.c to check how you do the dropout.

I found some random number generation functions that run on the NVIDIA CUDA GPU devices.

Where is the Dropout being performed ?

2 replies

karpathy May 28, 2024
Maintainer Author

There is no Dropout right now. It probably works a bit better to add weak Dropout, e.g. 0.05, but it introduces a layer of complexity around having a train and eval mode, and dealing with all of that is too much headache at this point. We're training very small models on sufficiently large datasets, so I think there is less need for regularization. It still probably helps a bit.

sebhtml May 28, 2024

OK, thank you Mr @karpathy !

sebhtml · 2024-05-28T19:00:05Z

sebhtml
May 28, 2024

Hello Mr @karpathy

I saw MPI_Allgather in https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L425

Why is MPI_Allgather used here if all the 8 A100 80GB SXM GPUs are on the same node ?

3 replies

karpathy May 28, 2024
Maintainer Author

We wish to support multinode training imminently.

sebhtml May 28, 2024

Oh this is exciting !

karpathy May 28, 2024
Maintainer Author

👍 There's already a PR up (linked to above in the post), but I haven't gotten around to it yet.

YuchenJin · 2024-05-28T19:23:43Z

YuchenJin
May 28, 2024

There are a few places where train_gpt2cu should be changed to train_gpt2.cu.

11 replies

YuchenJin May 28, 2024

@ngc92 that would be awesome!

ngc92 May 28, 2024

@YuchenJin could you check what name the H100 gets when we print out the device name in the beginning?

YuchenJin May 28, 2024

It prints "NVIDIA H100 80GB HBM3", the full screenshot is attached.

Satyam7166-tech May 29, 2024

so in theory is it possible to do in mps?

Only slightly joking but how long do you think it will take to run on an m4 ultra 192gb ram?

Chetan-svg May 29, 2024

Yeah I got the same result.

banyan-god · 2024-05-28T20:30:31Z

banyan-god
May 28, 2024

Here is a model (500M) i am training it for last few days using llama2 architecture. Hoping to train it to around 200 billion tokens . This is using fineweb 2024 and 4x 4090
https://wandb.ai/banyan-t/llamac/runs/zjaods8q
n_heads:36
n_kv_heads:36
n_layers:36
dim: 970

1 reply

banyan-god May 31, 2024

bprimal22 · 2024-05-28T21:38:05Z

bprimal22
May 28, 2024

What a fucking legend. I'm starting on this tonight!!

0 replies

dbmanifest · 2024-05-28T22:01:25Z

dbmanifest
May 28, 2024

"You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU)"

Sorry to bother, but whats the oldest/cheapest/weakest GPU that will be able to train this within 24 hours?

4 replies

fernand May 28, 2024

24 hours needs 160+ theoretical BF16 Tensor Core TOPs assuming ~60% MFU, so just over one RTX 3090. 50% MFU is pretty likely for most high memory bandwidth PCIe GPUs.

fernand May 28, 2024

The Titan RTX should likely take <30 hours with FP16 training using loss scaling.

ngc92 May 28, 2024

loss scaling isn't implemented (yet)

anthonix May 29, 2024

Just get one 7900 XTX, it costs less than $1k and will train it in about 24 hours. I'm about 10% of the way through training on 4x 7900 XTX and it's looking good!

KarolDuracz · 2024-05-29T01:50:12Z

KarolDuracz
May 29, 2024

Wow. Nice work. Are you planning to make a video about llm repo and this post as a summary of this work? Regards.

3 replies

karpathy May 29, 2024
Maintainer Author

Yeah maybe I could try that, ok.

YuchenJin May 29, 2024

I thought making a video was always in the plan :)

KarolDuracz May 29, 2024

I know there is still a lot of work before such a video is created. But it would be nice to see somewhere (maybe) at the end the possibilities and differences between models 124M, 350M and with more parameters like 1.5B when you reduce training cost per hour. Because you are probably trying in future to obtain more parameters, such as 175B like GPT-3. This is ~100x more parameters than the 1.5B model. But what's interesting is the gains and differences between <1.5B models and gigant models like 175B parameters. But this is only next issue in the list to do maybe. Regards.

dbmanifest · 2024-05-29T03:26:58Z

dbmanifest
May 29, 2024

I just want to confirm how much VRAM will be needed?

…

On Tue, May 28, 2024 at 6:57 PM Yuchen Jin ***@***.***> wrote: I thought making a video was always in the plan :) — Reply to this email directly, view it on GitHub <#481 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A34CCZUAROV53MQOJVGFTZDZEUYY3AVCNFSM6AAAAABINF4WSKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TKOBXGM2DE> . You are receiving this because you commented.Message ID: <karpathy/llm. ***@***.***>

1 reply

YuchenJin May 29, 2024

depending on your batch size, others said 1 RTX 3090 can train the model, just reduce the batch size (and probably adjust learning rate).

dhbrojas · 2024-05-29T04:34:45Z

dhbrojas
May 29, 2024

Running at 4M tokens/s on 8xH100 80GB HBM3! Insane 😱

step 1979/18882 | train loss 1.932283 | norm 0.4582 | lr 5.93e-04 | 132.83 ms | 136.0% A100 fp16 MFU | 3913480 tok/s
step 1980/18882 | train loss 1.985383 | norm 0.5368 | lr 5.93e-04 | 134.43 ms | 134.4% A100 fp16 MFU | 3912810 tok/s
step 1981/18882 | train loss 1.926125 | norm 0.4854 | lr 5.93e-04 | 135.17 ms | 133.7% A100 fp16 MFU | 3911104 tok/s
step 1982/18882 | train loss 1.870695 | norm 0.4489 | lr 5.93e-04 | 132.60 ms | 136.3% A100 fp16 MFU | 3913252 tok/s
step 1983/18882 | train loss 1.862540 | norm 0.4103 | lr 5.93e-04 | 133.60 ms | 135.2% A100 fp16 MFU | 3913810 tok/s

3 replies

karpathy May 29, 2024
Maintainer Author

jelly!!

Abdeldjalil-H May 29, 2024

@rojas-diego How long does it take to complete the training?

dhbrojas May 29, 2024

@Abdeldjalil-H it takes ~40 minutes which corresponds to a 2.1x speedup over 8xA100 🚀 Someone on X gave more details.

michaelshekasta · 2024-05-29T08:16:33Z

michaelshekasta
May 29, 2024

@karpathy
Are you thinking about making a YouTube video to showcase this project? I'd really appreciate it if you could, as I sometimes find it difficult to understand all the details, and a summary would be incredibly helpful.

1 reply

michaelshekasta Jun 10, 2024

@karpathy

Thank you for sharing the awesome video:
https://www.youtube.com/watch?v=l8pRSuU81PU

Have you considered making a new one to explain the llm.c project? It could be really informative.

Devadeut · 2024-05-29T11:59:51Z

Devadeut
May 29, 2024

What kind of controls do we have while training GPT-2 on multi-node?
Maybe a different kind of parallelism, partition for heterogeneous configuration!

2 replies

Devadeut May 29, 2024

@karpathy @YuchenJin help!!

chinthysl May 29, 2024

If you have slurm installed in your multinode setup, git checkout to this PR and give it a try.
Outdated with master branch now. Will be updated later.
If you don't have slurm, try with mpirun -np 16 --host node1:8,node2:8 ./train_gptcu ...
Before that you might need to setup ssh using shared public key between nodes.

Chetan-svg · 2024-05-29T13:54:10Z

Chetan-svg
May 29, 2024

Thank you very much @karpathy.
You don't know how much your work is helping people like me who are curious to learn about AI and wants to make the world a better place.
This democratization of knowledge, I thinks is the only thing which can save our world rather than the concentration of knowledge in the hands of a few greedy elites who can can go to any extent to squeeze a scent.

0 replies

bigsnarfdude · 2024-05-31T21:15:21Z

bigsnarfdude
May 31, 2024

Single 4070 Ti Super trains with Batch_Size = 16 taking 14G VRAM in around 30 hours

0 replies

gavi · 2024-05-31T22:17:02Z

gavi
May 31, 2024

4x A6000 - Around 8 hours - GPT 2 - 124M

0 replies

eliebak · 2024-06-01T01:42:28Z

eliebak
Jun 1, 2024

Test it with a new learning rate schedule ("better" than cosine) and here are the result

Also if you want to see the code it's here #508 :)

0 replies

Devadeut · 2024-06-02T11:50:07Z

Devadeut
Jun 2, 2024

Will running on rtx 3060ti be possible?
Right now I don't have access to GPU, tomorrow I will have. I was getting errors related to nccl.

1 reply

ngc92 Jun 2, 2024

nccl is only required (and useful) for multi-GPU training

Abdeldjalil-H · 2024-06-02T17:09:20Z

Abdeldjalil-H
Jun 2, 2024

How can I use the final checkpoint with pytorch or transformers?

0 replies

scouzi1966 · 2024-06-02T22:32:42Z

scouzi1966
Jun 2, 2024

It took me 19h15min on an RTX 4090. That's not counting the pre-processing of the training set.

0 replies

Devadeut · 2024-06-06T10:47:06Z

Devadeut
Jun 6, 2024

If the model can fit on a single GPU, why are we using ZeRO-DP? Won't it increase communication overhead? This is for future extension, correct?

1 reply

el-tocino Jun 6, 2024

If the model can fit on a single GPU, why are we using ZeRO-DP? Won't it increase communication overhead? This is for future extension, correct?

In his first post he mentions...

-z 1 turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op.

linxkon · 2024-06-14T14:52:51Z

linxkon
Jun 14, 2024

great work🐂🍺

0 replies

virdel · 2024-07-09T15:14:10Z

virdel
Jul 9, 2024

YYDS

0 replies

Felix-Zhenghao · 2024-07-19T17:12:58Z

Felix-Zhenghao
Jul 19, 2024

Awesome

0 replies

iminfine · 2024-07-25T21:28:32Z

iminfine
Jul 25, 2024

I had trained a GPT-2-124M model with batch_size 32. However, I have unexpectedly observed that when the evaluation batch size is varied from the size used during training( i.e 8,16,or 64), the evaluation loss also varies. This observation is counterintuitive. The evaluation loss should remain consistent regardless of the evaluation batch size. Does any one know the reason why?

0 replies

BatchLion · 2024-09-08T02:32:24Z

BatchLion
Sep 8, 2024

It seems that all the bias going into the matmul_forward_cublaslt() function are ignored because of beta=0, including l_fcb and l_fcprojb, which should not be zero. Can anyone tell me whether I was wrong or not?

0 replies

torotoki · 2024-10-05T10:31:15Z

torotoki
Oct 5, 2024

@karpathy
I needed to run pip install -r requirements.txt on the llm.c directory before this:

python dev/data/fineweb.py --version 10B

0 replies

T4ras123 · 2024-10-07T15:13:39Z

T4ras123
Oct 7, 2024

this is truly amazing

0 replies

ChenyuanHu · 2024-10-21T13:51:12Z

ChenyuanHu
Oct 21, 2024

I reproduced this training process on my personal computer, which took 44 hours. The setup includes an i7 processor, 64GB of RAM, a single RTX 4080 Super GPU, Windows 11, and Ubuntu 22.04 in WSL. Using the following training parameters, CUDNN was not used.:

make train_gpt2cu
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 16 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

step 19553/19560 | loss 3.339984 (+1.07z)| norm 0.2221 (+0.08z)| lr 2.68e-10 | 7115.34 ms | 56.7% bf16 MFU | 73666 tok/s
step 19554/19560 | loss 3.297637 (-0.09z)| norm 0.2207 (-0.04z)| lr 1.97e-10 | 7114.66 ms | 56.7% bf16 MFU | 73667 tok/s
step 19555/19560 | loss 3.291130 (-0.28z)| norm 0.2171 (-0.33z)| lr 1.43e-10 | 7115.60 ms | 56.7% bf16 MFU | 73668 tok/s
step 19556/19560 | loss 3.274483 (-0.74z)| norm 0.2166 (-0.37z)| lr 1.07e-10 | 7118.47 ms | 56.6% bf16 MFU | 73667 tok/s
step 19557/19560 | loss 3.255625 (-1.25z)| norm 0.2118 (-0.75z)| lr 7.15e-11 | 7116.38 ms | 56.7% bf16 MFU | 73668 tok/s
step 19558/19560 | loss 3.289978 (-0.30z)| norm 0.2280 (+0.56z)| lr 3.58e-11 | 7114.96 ms | 56.7% bf16 MFU | 73669 tok/s
step 19559/19560 | loss 3.310406 (+0.25z)| norm 0.2179 (-0.25z)| lr 1.79e-11 | 7116.34 ms | 56.7% bf16 MFU | 73669 tok/s
step 19560/19560 | loss 3.283601 (-0.48z)| norm 0.2274 (+0.51z)| lr 0.00e+00 | 7116.36 ms | 56.7% bf16 MFU | 73669 tok/s
val loss 3.258295
HellaSwag: 2997/10042 = 0.298447
generating:
---
The camp was inaugurated on 24th July 2011 for the Duke and Duchess of Cambridge University. Today, the Duke and Duchess of Cambridge High School were rechristened the Duke and Princess of Cambridge High School in memory of the Duke and Duchess of Cambridge. As a chapel they founded the Dreyer Kings Hardenberg
---
Writing checkpoint at step 19560
Writing model to log124M/model_00019560.bin
Writing state to log124M/state_00019560_00000.bin
total average iteration time: 7147.485949 ms

The total power consumption of the system was 400W, and it took 44 hours, resulting in approximately 17.6 kWh.

0 replies

Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481

karpathy May 28, 2024 Maintainer

Replies: 42 comments · 79 replies

karpathy May 28, 2024 Maintainer Author

karpathy May 28, 2024 Maintainer Author

karpathy May 28, 2024 Maintainer Author

karpathy May 28, 2024 Maintainer Author

karpathy May 29, 2024 Maintainer Author

karpathy May 29, 2024 Maintainer Author

karpathy
May 28, 2024
Maintainer

Replies: 42 comments 79 replies

karpathy
May 28, 2024
Maintainer Author

karpathy May 28, 2024
Maintainer Author

karpathy May 28, 2024
Maintainer Author

karpathy May 28, 2024
Maintainer Author

karpathy May 29, 2024
Maintainer Author

karpathy May 29, 2024
Maintainer Author