#14: How LLMs Generate Text: A Peek Inside the Inference Process
In the last few posts, we walked through how large language models (LLMs) are built — from collecting training data to breaking down words into tokens, and finally training the model on massive datasets. Today, we will discuss the inference process, which is the reason LLMs exist.
How do LLMs generate a response to your prompt?
This phase is called inference, and it’s what happens every time you interact with a model like ChatGPT, Claude, or Gemini.
What Is Inference?
Inference is the process of using a trained model to generate text. Unlike training, which involves updating the model’s internal weights by learning from data, inference is all about using what the model has already learned.
When you type a prompt, the model doesn't go back and learn something new1. It simply tries to predict the most likely next word — or more precisely, the next token — based on the input you gave.
One Token at a Time
This is the core idea: LLMs generate text one token at a time.
Let’s say you enter this prompt:
"The cat sat on"
The model will:
Break that into tokens.
Look at the tokens you've given so far.
Predict the most likely next token — say, "the".
Add that to the input and predict the next one — maybe "mat".
Repeat until it hits a stop condition (like a special "end" token or a length limit).
So the full output might be:
"The cat sat on the mat and purred softly."
This prediction process is fast, but it happens one token at a time — every word you see was chosen based on what came before it.
Sampling: How Models Decide What to Say
If the model always picked the most likely next token, its responses would sound robotic and repetitive. So instead, sampling techniques are used to make outputs more interesting. Here are some example methods companies use to vary their model’s responses:
Greedy decoding: Picks the single most probable next token at every step. Fast, but dull.
Top-k sampling: Only considers the top k most likely tokens and picks from them randomly.
Top-p sampling: Picks from the smallest group of tokens whose combined probability exceeds p (like 90%).
Temperature: A knob that controls randomness.
Low temperature = safe and predictable.
High temperature = creative and risk-taking.
These techniques help balance accuracy and creativity — essential for everything from poetry to customer support.
LLMs Don’t "Think Ahead"
One key point: LLMs don’t plan an entire sentence or story in advance.
They don’t know how the sentence will end — they just predict the next token based on what's already been written.
This can lead to moments where the model:
Repeats itself
Changes tone or logic mid-sentence
Contradicts something it said earlier
It’s not reasoning in the way humans do2. It’s predicting the future — one tiny step at a time.
The Role of the Context Window
Every model has a context window, which is the limit of how much text it can “see” at once. Older models had 2,000–4,000 token windows. Newer ones (like GPT-4.1) can handle 1 million tokens.
This matters because:
If something falls outside the context window, the model forgets it.
Larger windows = more memory = more coherent long outputs.
When Inference Happens
Inference is used every single time you:
Ask ChatGPT a question
Autocomplete an email
Talk to a customer service chatbot
Behind the scenes, it's a giant model making token-by-token predictions using the techniques we just discussed.
Coming Up Next: RLHF
Now that we understand how LLMs generate responses, we can look at how we make those responses more aligned with human values. That’s where Reinforcement Learning from Human Feedback (RLHF) comes in — the technique that helps models be more helpful, harmless, and honest.
Stay tuned for the next post!
This has changed in recent times. Most LLMs now have a search option that allows them to “learn” new information from the web.
Most LLMs now also have a reasoning option that lets them organize their thoughts.