AI Research

Building an LLM from Scratch — How Large Language Models Actually Work

📅 May 2026 ⏱ 12 min read ✍️ AsmiAI Editorial 🎓 Based on peer-reviewed research

ChatGPT, Claude, Gemini — you use them every day. But how do they actually work under the hood? A research team from Maharaja Agrasen Institute of Technology broke down the full pipeline: from raw internet data all the way to a reasoning AI. Here's everything you need to know.

Phase 1 — Pretraining on Massive Data

Before a model can answer a single question, it needs to be exposed to an enormous amount of text. This is called pretraining, and it's where the foundational knowledge of an LLM comes from.

Researchers typically use datasets like FineWeb — hosted on Hugging Face — which contains roughly 15 trillion tokens (~44 TB of disk space). But raw internet data is messy. A multi-stage cleaning pipeline is applied before any training begins:

1

URL Filtering

Blocklists remove malicious, adult, and racially inappropriate sources before any content is downloaded.

2

Text Extraction

Raw HTML pages are stripped down to clean text — removing tags, markup, and non-linguistic patterns.

3

Language Filtering

A classifier keeps only pages with more than 65% English content (or the target language).

4

PII Removal

Phone numbers, addresses, and social security numbers are scrubbed to comply with ethical AI standards.

Key insight: For non-English LLMs (e.g., Hindi), FineWeb isn't appropriate. Custom datasets need to be built that adequately represent the target language — otherwise model performance suffers significantly.

How Tokenization Works

Neural networks can't process raw text — they operate on numbers. So every word (or piece of a word) must be converted into a number first. These numbers are called tokens.

The challenge is balancing vocabulary size against representational power. Too small a vocabulary and the model can't express nuance. Too large, and training becomes computationally prohibitive.

Byte Pair Encoding (BPE)

The dominant solution is Byte Pair Encoding (BPE). BPE works by iteratively merging the most frequently occurring character pairs into a single token. A word like "unbelievable" might be broken into un, believ, able — each a distinct token.

GPT-4 uses approximately 100,277 unique token combinations. Every sentence you type gets silently split into these subword units before the model ever sees it.

Why does this matter for you? Tokenization explains why LLMs struggle with unusual spellings, counting letters in words, and rare proper nouns — the tokens don't always align with how humans naturally think about characters.

Model Training and Inference

Once data is tokenized, training begins. The model is given a sliding window of tokens and tasked with one job: predict the next token.

For each position in the input, the model outputs a probability distribution over the entire vocabulary. If the correct next word is "cat" and the model assigns it a 0.02% probability, the error is large and the weights get adjusted. Do this billions of times across trillions of tokens, and patterns in language begin to emerge.

z = W₁X₁ + W₂X₂ + … + WₙXₙ + b   →   output = f(z)

This formula — a weighted sum of inputs plus a bias, passed through an activation function — is the core computation of every neuron in the network. During training, W (weights) and b (bias) are continuously adjusted to minimize prediction error.

Inference: What Happens When You Type a Message

During inference, the model doesn't "look up" an answer. It generates each token one at a time based on probability distributions learned during training. The output you receive is a statistically likely continuation — not a retrieved fact. This is why LLMs can sound confident while being completely wrong.

GPT-2 vs LLaMA 3.1 — A Quick Comparison

Model Parameters Context Length Training Data Training Cost (2019)
GPT-2 1.5 billion 1,024 tokens 40 GB (WebText) $50K–$100K
LLaMA 3.1 (400B) 400 billion 8,192 tokens 15 trillion tokens Significantly higher

GPT-2, trained on 256 V100 GPUs over several weeks, would cost just $10,000–$20,000 today thanks to better hardware, mixed-precision training, and distributed training frameworks. The economics of AI training have shifted dramatically.

Both GPT-2 and LLaMA 3.1 are fundamentally token simulators — at the base model level, they predict the next token. The intelligence that feels like understanding is an emergent property of scale and training data.

Transformer Architecture Explained

Every modern LLM is built on the Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. Before transformers, models processed text sequentially — slow and unable to capture long-range dependencies. Transformers changed everything by processing all tokens in parallel using self-attention.

Three Types of Transformer Models

Encoder-Only

Converts input to dense vector representations. Best for understanding tasks: text classification, sentiment analysis, named entity recognition. Examples: BERT, RoBERTa, DeBERTa. Cannot generate text.

Encoder-Decoder

Encoder captures context, decoder generates output token-by-token. Best for translation, summarization, captioning. Examples: original Transformer, T5, BART.

Decoder-Only

Autoregressive models that generate text based on prior tokens. Best for chatbots, creative writing, Q&A. Examples: GPT-3, PaLM, LLaMA. This is what ChatGPT uses.

How Self-Attention Works

Self-attention allows each token to "look at" every other token in the sequence simultaneously and decide which ones are most relevant to its meaning. When processing the word "bank" in a sentence, attention determines whether the nearby words are about rivers or money — and weights the representation accordingly.

This is what allows transformers to model context at scale, and why they outperform every previous architecture on language tasks.

Post-Training: Turning a Base Model into an Assistant

A pretrained base model is not ChatGPT. It's a powerful pattern-matcher that will continue any text you give it — including continuing a hate speech prompt or a coding tutorial with equal indifference.

Post-training transforms this into a helpful assistant. Here's how:

  1. Human annotators write ideal responses to thousands of diverse prompts.
  2. These annotated conversations are tokenized with special role markers (user / assistant).
  3. The base model is fine-tuned on this curated dataset — adjusting its statistical weights toward helpful, assistant-like behavior.
  4. During inference, the model now generates responses that resemble the patterns from annotated conversations, not raw internet text.

OpenAI's 2022 InstructGPT paper revealed that roughly 40 contractors were hired through platforms like Upwork and ScaleAI to create this initial dataset. The dataset was never made public — but the open-source community responded with alternatives like OpenAssistant's OASST1 (160,000+ messages, 10,000+ conversation trees).

Synthetic data shift: Since InstructGPT, the field has moved toward LLM-generated training data (like UltraChat), with human annotators providing quality control rather than generating from scratch. Faster, cheaper, and increasingly indistinguishable in quality from human-written data.

Why LLMs Hallucinate — And What's Being Done About It

Hallucination happens when a model generates confident-sounding information that is simply wrong or fabricated. The root cause is structural: human annotators write confident, authoritative responses, and the model learns to imitate that confidence — even when it's guessing.

Meta's Approach in LLaMA 3

Meta addressed this directly in the LLaMA 3 paper with a principle called "knowing what it knows":

  1. Take small chunks from the pretraining data.
  2. Ask the model to generate questions based on those chunks.
  3. Ask the model to answer those questions.
  4. Use the model as a judge: if it consistently gives confident but wrong answers, train it to refuse instead.

The result: a model that says "I don't know" more often but says it accurately. This is more useful than a model that always answers — and always sounds certain.

The Tool Use Solution

The second approach is giving the model access to real-time tools. Certain trigger tokens initiate a web search, the retrieved content is inserted into the context window, and the model reasons over fresh, verified information. This is how Perplexity and ChatGPT's web browsing work.

Reinforcement Learning and DeepSeek R1

Supervised fine-tuning teaches a model to imitate good responses. Reinforcement learning takes a different approach: rather than being told what to do, the model discovers which behaviors lead to correct outcomes through trial and error.

The process:

  1. Generate 100+ different responses to a single prompt.
  2. Evaluate which responses are correct.
  3. Train repeatedly on the correct responses.
  4. Scale this to tens of thousands of prompts.

DeepSeek R1 — RL in Practice

DeepSeek R1 is the clearest demonstration of RL's power. The model was evaluated on AIME — advanced mathematics competition problems — during training. Early in training, accuracy was low. As RL progressed, accuracy climbed steadily.

What made this remarkable wasn't just the numbers. The model began producing longer, more detailed solutions — exploring multiple approaches, reconsidering assumptions, and arriving at answers through visible reasoning steps. This behavior was never programmed — it emerged from reinforcement learning alone.

🧠 The Key Insight About RL

Chain-of-thought reasoning — where the model "thinks out loud" before answering — isn't just a prompting trick. In models trained with RL, it emerges naturally as the strategy that leads to correct answers. The model discovers that thinking longer produces better results, and so it does.

RLHF — Teaching AI with Human Preferences

RL works well for verifiable tasks (math, coding, logic) where correct answers exist. But what about subjective prompts like "write me a poem" or "what are the best holiday destinations"? There's no single right answer.

This is where Reinforcement Learning from Human Feedback (RLHF) comes in, first proposed in the paper "Fine-Tuning Language Models from Human Preferences":

  1. Human annotators score multiple model responses to subjective prompts.
  2. A separate neural network — the reward model — is trained to predict those human scores.
  3. RL is then performed against the reward model, letting the LLM optimize for responses that the reward model (as a proxy for humans) rates highly.

Advantages of RLHF

  • Aligns models with human values in open-ended scenarios where ground truth doesn't exist.
  • Enables training on tasks that are impossible to supervise directly.
  • Significantly improves user satisfaction and helpfulness.

Limitations of RLHF

  • Reward hacking: Models learn to exploit imperfections in the reward model to achieve high scores without genuinely being helpful.
  • Bias amplification: If human annotators are biased, the reward model learns those biases — and RL then optimizes for them at scale.
  • Cost: High-quality human feedback is expensive and time-consuming to collect.

📋 The Full LLM Pipeline

Pretraining on massive data → Tokenization (BPE) → Model training (next-token prediction) → Post-training on annotated conversations → Reinforcement Learning on verifiable tasks → RLHF for subjective alignment. Each stage builds on the last, and the result is what you interact with every day.