#13: How LLMs Learn: A Beginner’s Guide to Training

Apr 26, 2025

In our last few posts, we explored what large language models (LLMs) are, how their data is collected and curated, and how that data is broken down into tokens. Now comes the question at the heart of it all: How do these models actually learn?

Let’s unpack the training process.

What Is “Training,” Really?

Training in Machine Learning/AI means learning from all the historical facts and data so that the model is capable of performing a given task in the real world. In the context of LLM training, we’re talking about teaching it to predict what comes next in a sequence of text. It’s a bit like teaching a kid to complete your sentences — but instead of using intuition, the model learns by learning from and processing enormous amounts of text and making billions of tiny adjustments to itself.

At its core, the model tries to guess the next word (or rather, the next token), based on the ones before it. When it guesses wrong, it learns from its mistake. And it does this millions of times.

Predicting the Next Token

As we discussed in the Data Collection post, the first step in training a LLM is to start with the vast amount of internet text. This input already has billions of sentences. The goal of the training process is to adjust the model’s weights and biases so the model can start to complete a sentence by itself when shown a part of the sentence. Let’s say for example:

The model is shown a sequence of tokens (like: "The cat sat on the").
Its job is to predict the next token (like: "mat") depending on the context.
If it predicts correctly, great! If not, it measures how wrong it was using a loss function — think of this as a score that quantifies the difference between the true value and the predicted values and the training process goal is to minimize the loss function values.
Based on that score, it makes adjustments to its internal wiring — specifically, its weights and biases— to do better next time.

This cycle repeats billions of times across vast datasets.

Behind the Scenes: Gradient Descent and Backpropagation

The model adjusts itself using an algorithm called gradient descent, which is basically a way of nudging the model’s settings in the right direction, one small step at a time. Imagine trying to roll a ball down a hill to find the lowest point — that’s gradient descent. Each step tries to lower the "loss."

To figure out which knobs to tweak, it uses backpropagation, which traces the error backward through the network and says, “Here’s how much each layer contributed to the mistake.”

You don’t need to know the calculus — just know that these tools help the model improve, one guess at a time. This youtube series from Andrej Karpathy is an exceptional tutorial on how training is done and I would highly recommend watching it.

Data Drives Everything

In a previous post, we talked about how important it is to have good data. Here’s why: during training, the model sees the data over and over again — sometimes for multiple epochs (passes through the dataset).

The better the data, the better the model’s understanding of grammar, facts, reasoning, and even tone. On the flip side, low-quality or biased data can lead to weird or even harmful outputs. That’s why data curation is such a critical (and human-led) step.

Scale and Compute: Why Training Is Expensive

Training a model like GPT-4o or GPT 4.1 isn’t just about having a smart algorithm — it also requires insane amounts of computing power.

This includes:

Terabytes of text
Hundreds of billions of parameters
Hundreds of thousands of GPUs running for weeks
Energy bills that rival small cities

This is why LLMs are often trained by well-funded companies or research labs. It’s not a weekend project — it’s a massive operation.

Evaluation and Avoiding Overfitting

As the model trains, we need to make sure it’s not just memorizing the training data. That’s where validation data comes in — a separate set of examples the model hasn't seen before. We use it to check whether the model is learning real patterns or just cramming.

We also save checkpoints — snapshots of the model at various stages — and sometimes stop training early if it starts to overfit (learn the training data too well and perform poorly on new data).

Pretraining vs. Fine-Tuning

There are two key phases in a model’s life:

Pretraining: This is the massive, general training we’ve been talking about.
Fine-tuning: This happens later, on smaller, more specific datasets (like legal text or medical notes), to make the model better at certain tasks. If we want to create a chatbot, we need to finetune the general purpose model that we built to respond like a chatbot and that is done in the fine-tuning phase.

Think of pretraining as the broad education, and fine-tuning as job-specific training. We will discuss fine-tuning in the next blog post.

Wrapping Up

Training is where the magic happens. It’s how a pile of data and math becomes a model that can write poetry, explain science, or help you code. In the next series, we will talk about fine-tuning techniques. Thanks for reading.

Tech Unpacked

Discussion about this post