How AI Models are Trained
by Aldous Gerbrot
In the previous article we looked at how modern AI systems turn language into geometry. Words, sentences, and even whole documents become points in a high‑dimensional space, and most of what the model “does” is clever linear algebra on those points. This time, we zoom out.
Instead of asking what the model’s internal space looks like, we ask how does such a model get built in the first place? How do we go from a pile of raw text to something that can carry on a conversation, write code, and give legal hypotheticals without (usually) going off the rails?
The short answer is that today’s large models are not born, they are cooked in several stages. Each with its own recipes, safety checks, and failure modes.
Stage 1: Pre‑training – letting the model soak in the world
Every large language model begins life with a massive unsupervised training phase called pre‑training.
The basic idea is simple:
Collect a huge corpus of text (and sometimes images, code, audio) from books, websites, documentation, and other sources.
Mask everything except the next token (the next word or piece of a word).
Train a huge neural network to predict that next token from the previous ones.
The task is incredibly dull from the model’s point of view: “given this sequence, what comes next?” But if you do this billions or trillions of times, patterns start to emerge.
During pre‑training, the model learns:
The statistical structure of language (grammar, idioms, rhetorical patterns).
Factual associations (Paris is in France, water boils at 100 °C, basic physics).
Styles and registers (formal vs. informal, technical vs. conversational).
Crucially, this stage is not interactive. The model does not learn from your questions; it learns from the pre‑training corpus and nothing else. When people say a model has “172 billion parameters,” they are describing the size of this pre‑trained network: a gigantic function mapping strings of tokens to probability distributions over next tokens.
On its own, a pre‑trained model is powerful but awkward. It can generate plausible text, but it is not yet a cooperative assistant. It will happily ramble, contradict itself, or ignore instructions.
That’s where the next stage comes in.
Stage 2: Fine‑tuning – teaching the model to follow instructions
Once the base model has absorbed the raw statistical patterns of language, developers fine‑tune it on more structured data.
Fine‑tuning looks like this:
Collect a dataset of example prompts and good responses:
“Explain quantum entanglement in simple terms.” → a clear, short explanation.
“Translate this sentence into Spanish.” → a correct translation.
“Write a polite email declining a meeting.” → an appropriate email.
Train the model further so that, given the prompt, it is more likely to produce the desired kind of answer.
Fine‑tuning repurposes the base model from “autocomplete everything” to “behave like a helpful tool.” It makes the model:
More responsive to explicit instructions.
Better at staying on topic.
Less likely to drift into free association.
You can think of pre‑training as giving the model knowledge of the language and the world, and fine‑tuning as giving it a role.
Stage 3: RLHF – aligning the model with human preferences
Fine‑tuning gets you partway. But “one good answer” is not the same as “the best answer according to human judgment.” This is where Reinforcement Learning from Human Feedback (RLHF) or, increasingly, from AI feedback (RLAIF) comes in.
RLHF has three main steps:
1. Generate candidate answers.
Take many prompts. For each, ask the model to produce several different responses.
2. Have humans (or judge models) rank them.
For each set of responses, label which is better, which is worse, and why.
Criteria include helpfulness, accuracy, safety, and tone.
3. Train a reward model and improve the base model.
A smaller network (the reward model) learns to predict the human rankings.
The main model is then trained with reinforcement learning to produce outputs that score higher according to this reward signal, while not straying too far from its original behavior.
After RLHF/RLAIF, the same prompt that once produced rambling or unsafe text is more likely to yield a concise, polite, “on policy” answer. The model has not become moral or wise; it has been trained to act as if it were, according to the patterns in the feedback it received.
Do user conversations go back into the model?
A natural question is whether models “learn” from every interaction, the way a person might.
The answer is nuanced:
Core weights are not updated in real time.
The model you talk to is usually a fixed snapshot. It does not change its parameters because of your specific conversation; if it did, behavior would drift unpredictably, and it would be very hard to debug or guarantee safety.
But interactions can be logged and used later.
Subject to privacy policies and opt‑in/opt‑out settings, providers may:
Sample and anonymize a subset of user chats.
Filter out problematic content.
Use the remaining data as additional fine‑tuning or RLHF training examples in future model versions.
So your conversation is not literally “growing” the model in the moment. Instead, it may become part of the training diet for its next incarnation.
On top of that, there are non‑learning layers:
Safety filters, classifiers, and routing logic that monitor prompts and outputs in real time (for self‑harm, threats, hate speech, etc.).
These can block or rewrite responses, or escalate to human review, without altering the underlying model’s long‑term memory.
You can think of this as a distinction between the model’s brain (its weights), which is updated in controlled training runs, and its reflexes, which are governed by separate runtime systems.
Emerging work: online learning and “growing” models
Researchers are exploring online learning, where models are updated incrementally from ongoing feedback rather than only in big periodic training runs.
The appeal is obvious:
The system could adapt quickly to new facts, norms, or user needs.
You could imagine a model that genuinely “learns” from corrections mid‑conversation.
The risks are equally clear:
A coordinated group could try to push the model toward harmful behavior.
Noise, spam, or adversarial inputs could corrupt its behavior.
It becomes harder to reproduce or audit what the model “knew” at a given time.
To mitigate this, proposed algorithms try to:
Weight feedback by source reliability and diversity.
Detect and discount strategic or adversarial feedback.
Keep strong regularization (staying close to a baseline model) so a bad day on the internet does not rewrite the model’s personality.
These ideas are still largely in the research stage; the big commercial models remain mostly on the “snapshot plus periodic update” regime.
Who keeps it “on path”?
All of this raises the question: if these systems can, in principle, absorb huge amounts of human input, what keeps them from drifting into madness or manipulation?
There are several layers of steering:
1. Data curation
Training data is filtered, deduplicated, and scrubbed to remove certain categories of content (spam, obvious abuse, personal data, known falsehoods where possible).
Feedback used for RLHF is screened; not every user rating is trusted equally.
2. Alignment objectives
Reward models encode desired behavior: helpful, harmless, honest.
During RL, penalties discourage unsafe, evasive, or hallucinated answers, and regularization prevents radical personality shifts.
3. Evaluation and red‑teaming
Before deployment, models are stress‑tested with adversarial prompts to see how they handle jailbreak attempts, misinformation, or harmful instructions.
Standardized evaluation suites and internal “red teams” help identify and patch failure modes.
4. Runtime monitoring and versioning
Live traffic is monitored for emerging issues; if a new exploit or pattern appears, filters can be updated quickly.
Models are versioned. If a new release misbehaves, providers can roll back to a prior version while investigating.
None of this is perfect. But it means that the evolution of these systems is directed rather than purely organic: there are humans (and meta‑models) in the loop deciding what counts as good behavior and what gets optimized for.
From geometry to governance
The first article explained how AI “thinks” in terms of geometry: concepts as points and regions in an abstract space. The training story adds another layer. Those spaces are not static landscapes; they are engineered terrains shaped by data, feedback, and explicit goals.
Pre‑training discovers a wild, high‑dimensional jungle of patterns.
Fine‑tuning cuts paths through that jungle so the system can follow instructions.
RLHF and safety layers build fences, bridges, and warning signs.
Seen this way, a large AI model is less like an alien mind and more like a public work, an evolving piece of infrastructure built on mathematical foundations, fed by human language, and steered by human judgments about what counts as helpful, truthful, and safe.
As these systems become more capable, the interesting questions shift from “How do they work?” to “Who gets to shape their training diet, their reward signals, and their guardrails?” That is where technical details about pre‑training and fine‑tuning spill over into law, politics, and, as Philip K. Dick might have put it, the fragile question of what is real enough to live by.


Copyright 2026 Brighid Media
Contact
Reach out for collaborations or questions.
aldousgerbrot@gmail.com
© Brighid Media 2026. All rights reserved.