ChatGPT, Claude, Gemini — you use them every day. But how do they actually work under the hood? A research team from Maharaja Agrasen Institute of Technology broke down the full pipeline: from raw internet data all the way to a reasoning AI. Here's everything you need to know.
Before a model can answer a single question, it needs to be exposed to an enormous amount of text. This is called pretraining, and it's where the foundational knowledge of an LLM comes from.
Researchers typically use datasets like FineWeb — hosted on Hugging Face — which contains roughly 15 trillion tokens (~44 TB of disk space). But raw internet data is messy. A multi-stage cleaning pipeline is applied before any training begins:
Blocklists remove malicious, adult, and racially inappropriate sources before any content is downloaded.
Raw HTML pages are stripped down to clean text — removing tags, markup, and non-linguistic patterns.
A classifier keeps only pages with more than 65% English content (or the target language).
Phone numbers, addresses, and social security numbers are scrubbed to comply with ethical AI standards.
Key insight: For non-English LLMs (e.g., Hindi), FineWeb isn't appropriate. Custom datasets need to be built that adequately represent the target language — otherwise model performance suffers significantly.
Neural networks can't process raw text — they operate on numbers. So every word (or piece of a word) must be converted into a number first. These numbers are called tokens.
The challenge is balancing vocabulary size against representational power. Too small a vocabulary and the model can't express nuance. Too large, and training becomes computationally prohibitive.
The dominant solution is Byte Pair Encoding (BPE). BPE works by iteratively merging the most frequently occurring character pairs into a single token. A word like "unbelievable" might be broken into un, believ, able — each a distinct token.
GPT-4 uses approximately 100,277 unique token combinations. Every sentence you type gets silently split into these subword units before the model ever sees it.
Why does this matter for you? Tokenization explains why LLMs struggle with unusual spellings, counting letters in words, and rare proper nouns — the tokens don't always align with how humans naturally think about characters.
Once data is tokenized, training begins. The model is given a sliding window of tokens and tasked with one job: predict the next token.
For each position in the input, the model outputs a probability distribution over the entire vocabulary. If the correct next word is "cat" and the model assigns it a 0.02% probability, the error is large and the weights get adjusted. Do this billions of times across trillions of tokens, and patterns in language begin to emerge.
This formula — a weighted sum of inputs plus a bias, passed through an activation function — is the core computation of every neuron in the network. During training, W (weights) and b (bias) are continuously adjusted to minimize prediction error.
During inference, the model doesn't "look up" an answer. It generates each token one at a time based on probability distributions learned during training. The output you receive is a statistically likely continuation — not a retrieved fact. This is why LLMs can sound confident while being completely wrong.
| Model | Parameters | Context Length | Training Data | Training Cost (2019) |
|---|---|---|---|---|
| GPT-2 | 1.5 billion | 1,024 tokens | 40 GB (WebText) | $50K–$100K |
| LLaMA 3.1 (400B) | 400 billion | 8,192 tokens | 15 trillion tokens | Significantly higher |
GPT-2, trained on 256 V100 GPUs over several weeks, would cost just $10,000–$20,000 today thanks to better hardware, mixed-precision training, and distributed training frameworks. The economics of AI training have shifted dramatically.
Both GPT-2 and LLaMA 3.1 are fundamentally token simulators — at the base model level, they predict the next token. The intelligence that feels like understanding is an emergent property of scale and training data.
Every modern LLM is built on the Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. Before transformers, models processed text sequentially — slow and unable to capture long-range dependencies. Transformers changed everything by processing all tokens in parallel using self-attention.
Converts input to dense vector representations. Best for understanding tasks: text classification, sentiment analysis, named entity recognition. Examples: BERT, RoBERTa, DeBERTa. Cannot generate text.
Encoder captures context, decoder generates output token-by-token. Best for translation, summarization, captioning. Examples: original Transformer, T5, BART.
Autoregressive models that generate text based on prior tokens. Best for chatbots, creative writing, Q&A. Examples: GPT-3, PaLM, LLaMA. This is what ChatGPT uses.
Self-attention allows each token to "look at" every other token in the sequence simultaneously and decide which ones are most relevant to its meaning. When processing the word "bank" in a sentence, attention determines whether the nearby words are about rivers or money — and weights the representation accordingly.
This is what allows transformers to model context at scale, and why they outperform every previous architecture on language tasks.
A pretrained base model is not ChatGPT. It's a powerful pattern-matcher that will continue any text you give it — including continuing a hate speech prompt or a coding tutorial with equal indifference.
Post-training transforms this into a helpful assistant. Here's how:
OpenAI's 2022 InstructGPT paper revealed that roughly 40 contractors were hired through platforms like Upwork and ScaleAI to create this initial dataset. The dataset was never made public — but the open-source community responded with alternatives like OpenAssistant's OASST1 (160,000+ messages, 10,000+ conversation trees).
Synthetic data shift: Since InstructGPT, the field has moved toward LLM-generated training data (like UltraChat), with human annotators providing quality control rather than generating from scratch. Faster, cheaper, and increasingly indistinguishable in quality from human-written data.
Hallucination happens when a model generates confident-sounding information that is simply wrong or fabricated. The root cause is structural: human annotators write confident, authoritative responses, and the model learns to imitate that confidence — even when it's guessing.
Meta addressed this directly in the LLaMA 3 paper with a principle called "knowing what it knows":
The result: a model that says "I don't know" more often but says it accurately. This is more useful than a model that always answers — and always sounds certain.
The second approach is giving the model access to real-time tools. Certain trigger tokens initiate a web search, the retrieved content is inserted into the context window, and the model reasons over fresh, verified information. This is how Perplexity and ChatGPT's web browsing work.
Supervised fine-tuning teaches a model to imitate good responses. Reinforcement learning takes a different approach: rather than being told what to do, the model discovers which behaviors lead to correct outcomes through trial and error.
The process:
DeepSeek R1 is the clearest demonstration of RL's power. The model was evaluated on AIME — advanced mathematics competition problems — during training. Early in training, accuracy was low. As RL progressed, accuracy climbed steadily.
What made this remarkable wasn't just the numbers. The model began producing longer, more detailed solutions — exploring multiple approaches, reconsidering assumptions, and arriving at answers through visible reasoning steps. This behavior was never programmed — it emerged from reinforcement learning alone.
Chain-of-thought reasoning — where the model "thinks out loud" before answering — isn't just a prompting trick. In models trained with RL, it emerges naturally as the strategy that leads to correct answers. The model discovers that thinking longer produces better results, and so it does.
RL works well for verifiable tasks (math, coding, logic) where correct answers exist. But what about subjective prompts like "write me a poem" or "what are the best holiday destinations"? There's no single right answer.
This is where Reinforcement Learning from Human Feedback (RLHF) comes in, first proposed in the paper "Fine-Tuning Language Models from Human Preferences":
Pretraining on massive data → Tokenization (BPE) → Model training (next-token prediction) → Post-training on annotated conversations → Reinforcement Learning on verifiable tasks → RLHF for subjective alignment. Each stage builds on the last, and the result is what you interact with every day.