LLM Training Explained
A technical deep-dive into how Large Language Models actually work, explained for developers who want to understand the magic behind AI.
📖 The Story Begins...
Imagine you're texting with a friend who's incredibly good at predicting what you're going to say next. Not because they're psychic, but because they've read every book, article, and conversation in existence.
That's essentially what an LLM is — a statistical prediction machine that learned patterns from trillions of words. But how does it actually work under the hood? Let's dive into the technical magic, step by step.
What is an LLM, Actually?
The Next Token Prediction Machine
At its core, an LLM (Large Language Model) is doing one thing and one thing only: predicting the next token.
🎯 The Core Task
"The capital of France is"
"Paris" (probability: 99.9%)
That's it. The entire "intelligence" of ChatGPT, Claude, or any LLM comes from doing this one task extremely well, billions of times over.
💡 Why This Works
If you can predict "Paris" after "The capital of France is", and you can predict "def" after "class MyClass:", and you can predict "sincerely" after "Yours", then you've learned the patterns of language, facts about the world, programming syntax, and letter-writing etiquette — all from next-token prediction.
The Autoregressive Loop
Here's the clever part: once the model predicts "Paris", it adds that to the context and predicts the NEXT token:
Step 1: "The capital of France is" → predicts "Paris"
Step 2: "The capital of France is Paris" → predicts ","
Step 3: "The capital of France is Paris," → predicts "a"
Step 4: "The capital of France is Paris, a" → predicts "city"
Step 5: ...continues until it predicts an "end" token
This is called autoregressive generation — the model feeds its own predictions back as input to generate the next part. This is how it writes essays, answers questions, or generates code one piece at a time.
Tokens: The Building Blocks
Not Words, Not Characters
LLMs don't actually work with words or characters. They work with tokens — pieces of text that are somewhere in between.
🔧 What is a Tokenizer?
A tokenizer is like a smart text splitter. It's an algorithm (not AI, just clever math) that:
- Breaks text into pieces: "ChatGPT" → ["Chat", "G", "PT"]
- Assigns each piece a number: "Chat" = 15496, "G" = 47, "PT" = 1234
- Converts numbers back to text: When the model generates [47, 1234], the tokenizer turns it back into "GPT"
Think of it as a translator between human text ↔ machine numbers. Every LLM has its own tokenizer trained on its vocabulary.
How Tokenization Works
Original text:
"ChatGPT is amazing!"
6 tokens total
Why Tokens?
✅ Efficient
Common words like "the", "and", "is" are single tokens. Rare words get broken into subword pieces. This gives a good balance between vocabulary size and sequence length.
✅ Handles Any Text
By breaking unknown words into pieces (like "unbelievable" → "un" + "believable"), the model can handle words it's never seen before.
📊 Token Count Examples
| Text | Tokens |
|---|---|
| "Hello" | 1 |
| "Hello world" | 2 |
| "The quick brown fox" | 4 |
| "uncharacteristically" | 4 (broken into pieces) |
| A full paragraph (~100 words) | ~130-150 tokens |
💡 Why This Matters for You
When you're charged by the token, or when your model has a "context window" of 4096 tokens, you're not being limited by words or characters — you're being limited by these token pieces. That's why a 100-word paragraph might be 130 tokens, not 100.
Neural Networks: Pattern Recognizers
The Pattern Recognition Machine
At the heart of an LLM is a neural network — a massive system of interconnected nodes that learns patterns from data. Think of it like a giant sieve that filters information, learning which patterns are important.
How a Neural Network Works (Simplified)
Token IDs: [15496, 11, 616, 329, 11406]
("The cat sat...")
Millions of mathematical operations
Matrix multiplications, activations, transformations through layers
Probability distribution over all tokens
"on" = 45%, "down" = 30%, "there" = 15%, ...
The Key Idea: Learning from Examples
The network doesn't "know" anything initially. It starts with random values (weights). During training, it sees millions of examples like:
Input: "The cat sat on the"
Expected Output: "mat"
Network's Guess: "floor" (wrong!)
→ Adjust weights slightly to do better next time
After seeing this pattern millions of times across different contexts, the network learns:
- Cats often sit on things
- "Mat" commonly follows "sat on the"
- Grammar patterns (articles, prepositions, word order)
- World knowledge (cats are pets, mats are for sitting)
🧠 The Emergence of "Understanding"
Notice how the model doesn't have a "cat database" or a "mat definition." It just learned statistical patterns. Yet from these patterns, complex behaviors emerge — answering questions, writing code, reasoning through problems. This emergent complexity is what makes LLMs so powerful (and surprising).
The Transformer Architecture
"Attention Is All You Need"
In 2017, Google researchers published a paper with that title. It revolutionized AI. The key insight: attention mechanisms allow models to focus on relevant parts of the input when making predictions.
The Attention Analogy
Imagine you're reading a long paragraph and encounter the word "it" at the end:
"The computer was old and slow. The user tried to run a new program on it, but..."
To understand what "it" refers to, your brain looks back and attends to the most relevant words: "computer", "old", "slow". You don't equally consider every word — you focus on what matters.
Self-Attention Mechanism
The transformer uses self-attention to let every token "look at" every other token and decide which ones are important for understanding its meaning.
How It Works
Query, Key, Value
For each token, the model creates three vectors:
- Query: "What am I looking for?"
- Key: "What do I contain?"
- Value: "What information do I have?"
Compute Attention Scores
Each token's Query is compared to every other token's Key. High match = high attention weight. The model learns which relationships matter.
Weighted Sum
Each token's new representation becomes a weighted combination of all tokens' Values, weighted by attention scores.
Multi-Head Attention
The model doesn't just do this once — it runs multiple "attention heads" in parallel. Each head can learn different types of relationships:
Head A: Syntax
Learns grammatical relationships — subjects match with verbs, pronouns match with nouns.
Head B: Semantics
Learns meaning relationships — "king" relates to "queen", "Paris" relates to "France".
Head C: Long-Range
Learns connections across long distances in text — a character introduced in paragraph 1 mentioned again in paragraph 5.
Head D: Context
Learns task-specific patterns — in code, variable definitions match with usages.
Visual: Attention Pattern
"The animal didn't cross the street because it was too tired."
When processing "it", the model's attention might look like:
The model learns "it" refers to "animal" (75%), with some attention to "tired" (3%) to understand context.
The Transformer Stack
A modern LLM like GPT-4 or Llama has dozens of these attention layers stacked on top of each other. Each layer refines the understanding:
Lower layers handle local patterns (words, grammar). Higher layers handle global patterns (meaning, reasoning, context).
How Training Actually Works
From Random to Brilliant
Training an LLM is like teaching a student who starts knowing nothing. You show them examples, correct their mistakes, and gradually they improve.
The Training Loop
Feed Input
Give the model a sequence: "The capital of France is"
Make Prediction
Model runs through layers and guesses: "Paris" (or maybe "London" if it's early in training)
Compare to Truth
We know the answer should be "Paris". Calculate how wrong the model was.
Adjust Weights
Use calculus (backpropagation) to figure out which weights to tweak so the model does better next time.
Repeat Billions of Times
Do this for trillions of tokens across the entire internet. Gradually, the model gets better.
Key Concepts
Loss Function
A mathematical measure of "how wrong" the model was. Lower loss = better predictions. Training tries to minimize this.
Learning Rate
How big of adjustments to make. Too big = unstable. Too small = slow. Like turning the steering wheel when driving.
Epochs
How many times the model sees the entire dataset. More epochs = more learning, but too many = overfitting.
Batch Size
How many examples to process before updating weights. Larger batches = more stable but need more memory.
⚠️ Why This Takes So Long
GPT-3 was trained on ~500 billion tokens. That's like reading the entire written works of humanity hundreds of times. Each token requires running billions of mathematical operations through the network. Even with thousands of GPUs, this takes weeks or months. That's why pre-trained models are so valuable — you're leveraging weeks of computation!
Fine-Tuning: Teaching the Specialist
Why Fine-Tune?
A pre-trained model knows general language and facts, but it doesn't know YOUR specific domain. Fine-tuning is like giving it specialized training.
The Analogy: Medical School
Pre-training = College
The model learns general knowledge, critical thinking, and how to communicate. Like a college graduate who knows a bit about everything.
Fine-tuning = Medical School
Now you give them specialized training. Thousands of examples of medical cases, diagnoses, patient interactions. They become a doctor.
How Fine-Tuning Works
Instead of training from scratch (which takes weeks and costs millions), you start with a pre-trained model and continue training on your specific dataset. This is much faster because:
- The model already knows language, grammar, and general facts
- You only need to teach it your specific domain
- Training takes hours or days, not weeks
- Costs hundreds of dollars, not millions
- Needs hundreds or thousands of examples, not billions
✨ What Changes During Fine-Tuning?
The model's weights adjust to better predict your specific examples. It learns:
- Your terminology and jargon
- Your preferred response style and tone
- Patterns specific to your domain
- How to format responses the way you want
- Your specific knowledge base
🎯 Why Your Dataset Matters
Every example in your dataset is teaching the model: "When you see this kind of input, produce this kind of output." The quality and diversity of your examples directly determines the quality of your fine-tuned model. That's why edukaAI focuses so much on helping you create great examples — they become the training signal that shapes your AI's behavior.
Understanding Model Sizes
What Does "7B" Mean?
When you see "Llama 2 7B" or "GPT-3 175B", the "B" stands for billion parameters. Think of parameters as the "knobs" or "dials" inside the neural network that get adjusted during training. More parameters = more capacity to learn, but also more compute needed.
The Parameter Scale
Small (1B - 7B)
Examples: TinyLlama, Phi-2, Llama 2 7B
Good for: Testing, edge devices, simple tasks
Hardware: Runs on consumer GPUs (RTX 3060)
Speed: Very fast, low latency
Medium (13B - 30B)
Examples: Llama 2 13B, CodeLlama 13B, Mistral 7B (punches above its weight!)
Good for: Production use, most practical applications
Hardware: RTX 3090, RTX 4090, or cloud A10G
Speed: Good balance of quality and speed
Large (70B - 175B)
Examples: Llama 2 70B, GPT-3, Claude 2
Good for: Complex reasoning, research, maximum capability
Hardware: Multiple GPUs, A100s, or API access only
Speed: Slower but smartest
Bigger Isn't Always Better
It's tempting to think "bigger model = better," but that's not always true. A well-trained 13B model can outperform a poorly-trained 70B model on specific tasks. Plus, bigger models have downsides:
❌ Large Model Problems
- Higher inference costs (more $ per request)
- Slower responses
- Requires expensive hardware
- Higher energy consumption
- Harder to deploy on edge devices
✅ Right-Size Benefits
- Faster responses = better UX
- Lower costs = scalable
- Runs on affordable hardware
- Easier to fine-tune
- Can deploy anywhere
💡 The Sweet Spot for Beginners
For your first fine-tuning project, we recommend starting with 7B-13B models. They're big enough to learn your domain well, small enough to train affordably, and can run on consumer hardware. Once you master these, you can experiment with larger models.
Quantization: Making Models Smaller
The Magic of Model Compression
Remember those billions of parameters? Each one is stored as a number (usually 16 or 32 bits). Quantization is a technique that reduces the precision of these numbers, making the model smaller and faster while keeping most of its intelligence. Think of it like compressing an MP3 — smaller file, same song.
How It Works (The Simple Version)
Normal (FP16) - 16-bit precision
Weight value: 0.3847265849234712
Very precise, but takes 16 bits to store. A 7B model needs ~14GB RAM.
Quantized (INT8) - 8-bit precision
Weight value: 0.38
Less precise, but only 8 bits. Same 7B model now needs ~7GB RAM — half the size!
Highly Quantized (INT4) - 4-bit precision
Weight value: 0.4
Even less precise, only 4 bits. Same 7B model now needs ~3.5GB RAM — quarter the size!
Common Quantization Formats
| Format | Bits | Size (7B model) | Quality Loss | Use Case |
|---|---|---|---|---|
| FP16 | 16 | ~14 GB | None | Training, max quality |
| INT8 | 8 | ~7 GB | Minimal | Production inference |
| INT4 (Q4) | 4 | ~3.5 GB | Small | Consumer hardware |
| INT4 (Q2/Q3) | 2-3 | ~2-2.5 GB | Noticeable | Edge devices, testing |
When to Use Quantization
✅ Quantize When:
- Running inference (generating responses)
- Deploying to consumer hardware
- API cost reduction is important
- Mobile/edge device deployment
- Speed is critical
⚠️ Don't Quantize When:
- Training/fine-tuning (use FP16)
- Maximum accuracy is required
- Complex reasoning tasks
- Medical/legal applications
- You have plenty of GPU memory
🎯 Practical Example
Let's say you want to run Llama 2 13B on your laptop:
Result: By quantizing to 4-bit, you can run a 13B model on a laptop with 8GB VRAM (like an RTX 3070) with minimal quality loss!
🔧 Tools for Quantization
Popular tools for quantizing models:
- llama.cpp — Most popular, supports GGUF format
- AutoGPTQ — Easy quantization for HuggingFace models
- BitsAndBytes — 8-bit quantization for training
- ExLlama — Fast inference for 4-bit models
Good news: Many pre-quantized models are already available on HuggingFace — just download and use!
🎓 What You Now Understand
They predict one token at a time, feeding predictions back as input.
Not words or characters, but pieces somewhere in between.
They adjust millions of weights to get better at predictions.
Lets tokens focus on other relevant tokens in the context.
Show example, predict, compare to truth, adjust, repeat billions of times.
Start with general knowledge, train on your specific examples.
7B-13B is the sweet spot. Bigger = smarter but slower and costlier.
Compress 16-bit weights to 4-bit. Run big models on consumer hardware!
Ready to Apply This Knowledge?
Now you understand how LLMs work. Time to build your dataset!
Further Reading
- "Attention Is All You Need" — Vaswani et al. (2017) — The original transformer paper
- The Illustrated Transformer — Jay Alammar — Visual guide to how transformers work
- Dive into Deep Learning — Attention Mechanisms — Technical deep dive