LLMs Do Not Predict the Next Word

March 2025

Way back in the day, Newton discovered an equation for gravity. Remarkably, this single equation was super simple (high school algebra at most) and yet it could predict not only the elliptical motion of planets and their moons, but even apples falling here on Earth.

Despite the power and simplicity of Newton's equation, there were a few small issues with it. Most famously, Mercury's orbit didn't match predictions. Einstein solved these when he came up with general relativity, which also predicted black holes and gravitational waves.

You've probably heard a phrase something like “LLMs are just statistical models that predict the next word.” Like Newton's gravity, this is a very good approximation of the truth. But there are some deeper layers that are worth looking at.

Today, I want to investigate LLMs from a reinforcement learning lens, treating them as thinking agents rather than advanced autocomplete models. I'll be touching on fundamental ideas like instruction finetuning and reinforcement learning from human feedback. My goal is not a deep dive into these ideas, but instead using them to explain to what extent LLMs are really doing something beyond predicting the next word, and taking “actions” of their own “will”. Next I'll compare and contrast this idea with the buzz around AI agents, and finally I'll speculate on what this means for the future of AI agent development.

Why LLMs Predict the Next Word

Before we get into my hot take, I want to briefly talk about the idea that LLMs predict the next word, and why it's mostly true.

LLMs are pretrained with something called the next-token objective. This part really is as simple as just predicting the next token. So if you have a sentence “I do not like green eggs and ham” in your training data, the training examples might be something like:

(I, do), (I do, not), (I do not, like), ..., (I do not like green eggs and, ham)

That is, at each token in the sentence, the model is given everything up to that token and has to predict the next one. In pretraining, the model is given a huge amount of text like that and has to predict what comes next.[]

Mathematically, the model's outputs are judged according to cross-entropy loss, which measures the difference between the model's output probabilities and the real next token. One possible formula in the case of language modeling is $$ \mathcal{L} = -\log p_y $$ where $ p_y $ is the probability given by the language model for the correct next token.[] (This formula is greatly simplified from the general cross-entropy formula, but is still valid in the special case of language modeling.) So if the model gives a probability 1 to the actual next token, the loss is zero. Lower probabilities (if the model thinks some other token is likely to come next) lead to higher losses.

So in pretraining, by trying to minimize the loss, we maximize the probability of correctly predicting the next token. This is why LLMs are so good at predicting the next token. In fact, they are even better than humans at this.[]

Instruction Finetuning

But this method alone — training an LLM to predict the next word on a large string of text — is not enough to make a chatbot. For example, if you asked GPT-3 to “Write an article about American football”, instead of writing an article, it might continue the sentence by predicting the most likely next tokens: “Write an article about American football and its influence on television in America.”

This is where instruction finetuning comes in, also known colloquially as instruction tuning.[] This allows better performance on zero-shot learning, meaning you can get the model to perform a task just by telling it to do it, without the need to include examples of the task.

The way instruction tuning is actually done is by training on a new dataset of instructions separate from the much larger dataset used in pretraining. FLAN, an early example of instruction tuning, trained on about 250 million tokens during finetuning. In contrast, the pretraining that FLAN built on used 2.49 trillion tokens.[]

The exact format of instruction tuning depends on the model. An example from Llama 3 is:[]

<|start_header|>user<|end_header|>
Hi! I am a human.<|eot|>

<|start_header|>assistant<|end_header|>
Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant<|eot|>

Unlike in pretraining, the model is usually trained only on the completion (labeled assistant in this example), not the rest of the prompt. But other than that, instruction tuning is essentially the same as pretraining, just with a new dataset specialized for prompting.[] The loss function is the same, meaning the model is still being trained to predict the next token.

Reinforcement Learning

So far, the model is just predicting the next token. First it learned to do so on a big dataset (pretraining), then it was fine-tuned on a more specific dataset designed for prompting (instruction finetuning). So is the claim “LLMs just predict the next token” true?

Even up to this point, you could make the argument that something deeper is happening. There's a lot of evidence that in order to predict the next word, models have to store detailed information about the world and its facts in their weights. Predicting the next word might be their loss function, but they have developed a rich internal world in the process. It's like saying humans are just gene-copying machines, ignoring all the complexity that humans have developed as a byproduct of evolution. There's even an argument that human brains are just advanced prediction machines.[]

All that is still assuming that the objective an LLM is trained on really is next-token prediction. But in fact, after pretraining and instruction finetuning, LLMs are trained on an objective that is fundamentally different from next token prediction. This is where reinforcement learning (RL) comes in.

To be more precise, LLMs are trained with something called reinforcement learning from human feedback (RLHF). It's debatable whether this is really RL (Andrej Karpathy says just barely, Yann LeCun says no), but the point is that it's a very different objective from next-token prediction. This has important implications for what the model is really doing.

At a high level, there are two steps to RLHF:[]

We let the model produce many different outputs for various prompts. For each prompt, we ask humans to rank the outputs. (That's the human feedback in RLHF.) This is used to train a reward model that predicts which output humans will prefer.
We use this reward model to train the LLM to produce outputs that humans will like. (That's the reinforcement learning in RLHF.)

I won't go into all the details here (both the original paper and the RLHF book are great resources), but today we are focused on the question: do LLMs really just predict the next word? For this, what matters is the loss functions of each step.

Reward Modeling

The first step is reward modeling, where we train a separate model to predict the reward of an output: essentially, how good humans think the output is. The loss function for the reward model is $$ \mathcal{L}(\theta) = -\log\left( \sigma\left( r_{\theta}(x,y_w) - r_{\theta}(x,y_l) \right) \right) $$ This is simplified from the original formula,[] but captures the core idea. In this formula, our input is a prompt $x$ along with a pair of outputs $ y_w $ and $ y_l $, where $ y_w $ is the output that the human labeler prefers and $ y_l $ is the one they don't like as much. The function $ r_{\theta} $ is the reward model, which takes an output and returns a score. $ \sigma $ is the sigmoid function.

If we graph the loss as a function of $r_{\theta}(x,y_w) - r_{\theta}(x,y_l)$ (how much higher the given reward is for the winning output), we get a curve that looks like this:

Reward model loss function (log sigmoid) — Reward model loss function

We can see that as we assign higher scores to the output that humans thought was better, the loss goes down, close to zero as we assign much higher scores. On the other hand, if we assign lower scores to the output that humans actually thought was better, the loss goes up.

By training the reward model on labels provided by humans in this way, we end up with a model that can predict how much humans will like an output.

Proximal Policy Optimization

So now we have this function $r_{\theta}$ for predicting rewards. That's cool, but what we actually wanted was to train the model. Now we can use the reward model in our loss function for training the actual LLM! At the same time, our LLM is already pretty good with instruction finetuning, so we'll try not to change it too much. This is the idea of proximal policy optimization (PPO).

The new objective function for the LLM is[]

$$ \begin{align} \text{objective}(\phi) &= \mathbb{E}_{(x,y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[r_{\theta}(x,y) -\beta \log\left( \pi_{\phi}^{\text{RL}}(y \mid x) / \pi^{\text{SFT}}(y \mid x) \right)\right] \\\\&\quad + \gamma\mathbb{E}_{x\sim D_{\text{pretrain}}}\left[\log (\pi_{\phi}^{\text{RL}}(x)) \right] \end{align}$$

This is a lot more complicated than the previous loss functions! I'll try to break it down step by step.

First of all, this is technically an objective function, not a loss function. So we are trying to maximize this function, not minimize it.

Let's start with the first term:

$$\mathbb{E}_{(x,y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[r_{\theta}(x,y)\right]$$

Here, $(x,y) \sim D_{\pi_{\phi}^{\text{RL}}} $ are prompts ($x$) that we are using for RL training, and $y$ is the output that the model produced for that prompt. So far, our objective function is just the expected value over the training data of the reward model $r_{\theta}(x,y)$. So we are trying to maximize the reward predicted by the reward model that we trained earlier.

Onto the next term:

$$ \mathbb{E}_{(x,y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[ -\beta \log\left( \pi_{\phi}^{\text{RL}}(y \mid x) / \pi^{\text{SFT}}(y \mid x) \right) \right]$$

Again, $x$ is the prompt and $y$ is the model output. Now $\pi_{\phi}^{\text{RL}}(y\mid x)$ is the predicted probabilities of the current model we're training, while $ \pi^{\text{SFT}}(y \mid x) $ is the predicted probabilities of the base model we started from, after pretraining and instruction finetuning.

If we move around some notation, letting $p = \pi_{\phi}^{\text{RL}}(y \mid x)$ and $q = \pi^{\text{SFT}}(y \mid x)$, this is also

$$-\beta \mathbb{E}_{(x,y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[\log\left( \frac{p}{q} \right)\right]$$

so we are taking the expected value of $\log (p/q)$, and using that times $\beta$ as a penalty. This expected value is the Kullback-Leibler divergence (or KL divergence) between the two distributions $p$ and $q$, which represents how different the two distributions are. So by applying a penalty on the difference, we are making sure that as we train the model, its output probabilities stay somewhat close to what they were in the base model, when all we had done was pretraining and instruction finetuning.

Finally the last term:

$$ \gamma\mathbb{E}_{x\sim D_{\text{pretrain}}}\left[\log (\pi_{\phi}^{\text{RL}}(x)) \right] $$

This time, instead of using our RLHF dataset of prompts and outputs, we are going back to the pretraining dataset $D_{\text{pretrain}}$. In fact, we are just predicting the next token here using the exact same loss function as before, $-\log p_y$, only times a constant $\gamma$. We add this term so that as we do RLHF, we maintain good performance on predicting the next token on the pretraining data.

Let's recap the meaning of the three terms of this loss function:

We try to maximize the reward given by the reward model we trained before. Hopefully, this means we are making outputs that humans will like.
We add a penalty for outputting a distribution that goes too far from the base model.
We mix in some normal next-token prediction on the pretraining data.

All this is called proximal policy optimization (PPO). Proximal because we are staying close to the base model, and policy optimization because the output probabilities of the model are called the model's policy in reinforcement learning.

Only one of these terms (#3) is directly training the model to predict the next token. Arguably, term #2 is also a proxy for next-token prediction, since we stay close to a base model that was trained on next-token prediction.

But term #1, the RLHF term, is fundamentally different from next-token prediction.

LLMs as Chess Players

Imagine a chess-playing model like AlphaZero. If we ignore the details of tree search, the model takes in a chessboard and outputs a distribution over possible moves (its policy). The model is trained based on the results of games it plays, and over time, it gets better at playing games.[]

This is the essence of reinforcement learning: we have an agent (the chess-playing model) that interprets the environment (the chessboard) and takes actions (moves on the chessboard) that in turn affect the environment. It tries to choose actions that maximize its perceived reward based on the environment. Its actions are expressed as a policy, which is a probability distribution over possible next moves.

Reinforcement learning diagram: agent and environment affect each other — Reinforcement learning diagram

Regardless of whether RLHF is truly RL, there is an important analogy we can draw here. Rather than next-token prediction machines, LLMs are agents that interpret their environment (the prompt and output so far) and take actions (the next token) that affect the environment. They try to choose actions that maximize their perceived reward (the reward model, which helps them produce outputs that humans like).

A computer chess game played between two bots — RL chess bots predict next moves as possible actions, just like LLMs predict next tokens as possible actions.

It's important to note that this is a consequence of how we trained the LLM in RLHF. The part that doesn't change, no matter how we train the LLM, is its input space (strings of tokens) and output space (distributions over tokens). We can interpret this in so many different ways:

During pretraining, an LLM becomes an agent that tries to take actions (next tokens) in order to predict the next token.
During RLHF, an LLM becomes an agent that tries to take actions (next tokens) to ultimately produce outputs in a way that (indirectly, via a reward model) appeals to human judges.
During chain-of-thought RL training like in DeepSeek R1,[] an LLM becomes an agent that tries to take actions (next tokens during both reasoning and output) to ultimately produce outputs that are more likely to be correct.

From here, we can imagine LLMs that write code and evaluate it by running it, or even solve math problems and evaluate their solutions with proof assistants.[] On the more sinister side, we can imagine LLMs that are rewarded for spreading misinformation.

In short, to the extent that words matter, LLMs can be thought of as agents that take actions rather than just statistical models, even if those “actions” are words.

Why AI Agents?

If LLMs are already agents, what's the buzz around “AI agents”?

LLMs alone are agents whose actions are limited to producing tokens. By mapping tokens to real-world actions, we can make the agent-like behavior already baked into LLMs into something even more tangible.

Remember that the reward function in RLHF is a proxy for how much humans like the output of an LLM. So an LLM is already trained to generally follow instructions in a way that is appealing to humans. This can easily be extended to all sorts of actions that an LLM can take.

Does that mean you don't have to fine tune an agentic LLM? Not necessarily. As is, the model is mainly trained to appeal to human judges. Code is a clear example where an agent could instead be trained to properly code with RL, rather than just assuming that whatever code the evaluators tend to like is correct. Prompt engineering can help with this, but there's a lot of room to improve through actual training, especially in clear-cut domains with easy feedback loops like writing code to solve well-defined problems.

Additionally, the proxy goal of pleasing human evaluators is great, but it's not perfect. LLMs can fool humans, creating outputs that seem good when they are actually flawed. In fact, this has already been demonstrated: when training on question answering and code generation, one team found that RLHF made the model produce outputs that evaluators liked better over time, even though it actually got worse at the task itself.[]

It's not easy to find a better technique, though: reinforcement learning of all kinds is very prone to reward hacking, where models learn to exploit the reward function without really improving at the task.

Although RLHF isn't perfect, it is extremely powerful. It's surprising that a “mindless next-token prediction machine” can appear to show intelligence, but if we reframe an LLM as a machine that aims to appeal to humans through producing tokens, this starts to make a lot more sense.

It's still true that next-token prediction is an important part of LLMs, not only in pretraining but even as a component of RLHF. But I hope I was able to explain why there are much deeper layers to how an LLM works, first with RLHF, and second with other kinds of RL like chain-of-thought reasoning.

Fundamentally, an LLM is not a next-token predictor. It's actually something even more basic: a machine that outputs tokens. We can choose whether we train that machine to predict the next token, appeal to human evaluators, write code, or do something else entirely. And we can choose whether we simply display those tokens to a user or use them to call functions and create effects in the real world. It's up to us to make the best choices.

References

Reinforcement Learning from Human Feedback (Nathan Lambert, 2024) ^
CrossEntropyLoss (PyTorch Contributors, 2024) ^
Language models are better than humans at next-token prediction (Buck Shlegeris, Fabien Roger, Lawrence Chan, & Euan McLean, 2022) ^
Finetuned Language Models Are Zero-Shot Learners (Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, & Quoc V. Le, 2022) ^
Training language models to follow instructions with human feedback (Ouyang et al., OpenAI, 2022) ^
Fine-tuning Llama3 with Chat Data (torchtune Contributors, 2023) ^
My objection(s) to the "LLMs are just next-token predictors" take (Alejandro Tlaie Boria, 2025) ^
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, & Demis Hassabis, DeepMind, 2017) ^
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, 2025) ^
AI achieves silver-medal standard solving International Mathematical Olympiad problems (AlphaProof and AlphaGeometry teams, 2024) ^
Language Models Learn to Mislead Humans via RLHF (Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, & Shi Feng, 2024) ^