What is NLP?
1. What Problem Does NLP Solve?
Computers only understand numbers — 0s and 1s. But humans communicate through language — a complex system of words, sentences, grammar, tone, context, and meaning. The fundamental problem NLP solves is this: How do we bridge the gap between human language (text and speech) and computers that only process numbers?
Without NLP, a computer cannot understand what you write in an email, what you ask in a search box, or what you say to a voice assistant. NLP gives computers the ability to read, understand, generate, and interact in human language.
2. Why Was NLP Invented?
In the early days of computing (1940s–1960s), scientists dreamed of building machines that could communicate in human language. The original motivation was Machine Translation — translating Russian scientific documents to English during the Cold War. The US government funded massive research into getting computers to "understand" text.
As the internet grew (1990s–2000s), humans created vast amounts of text data (emails, websites, documents). Organizations needed automated ways to process, search, and analyze this text. That's when NLP became an industrial necessity, not just academic curiosity.
3. Historical Background
- 1950 — Alan Turing's Test: Turing proposed the famous "Turing Test" — if a computer can convince a human it's also human through conversation, it's "intelligent".
- 1957 — Chomsky's Grammar: Noam Chomsky showed that language has formal, hierarchical structure — this shaped how early NLP was built (rule-based systems).
- 1966 — ELIZA: First chatbot created at MIT by Joseph Weizenbaum, simulating a psychotherapist using pattern matching.
- 1980s–1990s — Statistical NLP: Shift from hand-written rules to learning from data. Hidden Markov Models, probabilistic methods emerged.
- 2001 — Neural Language Models: Bengio et al. showed neural networks could learn word representations.
- 2013 — Word2Vec: Google's Tomas Mikolov created word embeddings — words as vectors in space. Massive breakthrough.
- 2017 — Transformer Architecture: Google Brain published "Attention Is All You Need" — changed everything.
- 2018 — BERT & GPT: The era of Large Language Models begins.
- 2022+ — ChatGPT, Claude, Llama: LLMs become consumer products.
4. Real-World Analogy
Think of NLP like a Universal Translator from Star Trek. When Captain Kirk speaks English, the device instantly translates it for alien species. Similarly, NLP translates human language (which computers don't naturally understand) into a form computers can process (numbers, vectors, patterns), and then back into human language for the response.
5. Explain Like I'm 10
You know how your calculator can do math — add 2+3 and get 5? But if you wrote "two plus three" in words, the calculator would get confused and show an error. NLP is like giving computers a superpower so they can understand words, just like how you understand your teacher's instructions! It's the technology that makes Siri, Google Translate, and ChatGPT understand what you're saying.
6. Explain Like a College Student
NLP is the subfield of Artificial Intelligence that deals with the interaction between computers and human (natural) language. It combines linguistics (study of language), computer science (algorithms and data structures), and machine learning (learning patterns from data).
Modern NLP is mostly data-driven: instead of writing rules like "if the word 'not' appears before an adjective, flip its sentiment", we feed millions of examples to statistical/neural models and let them discover patterns themselves. This approach — called supervised learning and self-supervised learning — has proven far more powerful than hand-crafted rules.
7. Explain Like an AI Engineer
NLP is the set of techniques for processing, analyzing, and generating human language data. Modern NLP pipelines typically involve: tokenization → embedding → encoding → task head.
Pre-2017, this meant bag-of-words features fed to logistic regression or SVMs. Post-2017, it means transformer-based models (BERT family for understanding, GPT family for generation) fine-tuned on task-specific data. Production NLP systems now often use foundation models (GPT-4, Claude, Llama) with prompt engineering or fine-tuning, plus RAG for knowledge-intensive tasks.
8. Terminology Breakdown
| Term | Simple Meaning |
|---|---|
| NLP | Natural Language Processing — teaching computers to understand and generate human text |
| Corpus | A large collection of text used for training (plural: corpora) |
| Token | A basic unit of text — usually a word or sub-word piece |
| Model | A mathematical system that has "learned" patterns from data |
| Training | The process of showing a model examples so it can learn patterns |
| Inference | Using a trained model to make predictions on new data |
| LLM | Large Language Model — a very large model trained on massive text data |
| Pipeline | A series of steps that text goes through for processing |
11. Visual: NLP vs Traditional Programming
TRADITIONAL PROGRAMMING:
┌─────────────┐ ┌───────────────┐ ┌──────────┐
│ Input │───▶│ RULES (you │───▶│ Output │
│ "Hello!" │ │ write them) │ │ (result) │
└─────────────┘ └───────────────┘ └──────────┘
NLP / MACHINE LEARNING:
┌─────────────┐ ┌───────────────┐ ┌──────────┐
│ Input │───▶│ MODEL learns │───▶│ Output │
│ + Examples │ │ rules itself │ │ (result) │
└─────────────┘ └───────────────┘ └──────────┘
NLP vs LLM:
┌──────────────────────────────────────────────────────┐
│ NLP (broad field) │
│ ┌──────────────────────────────────────────────┐ │
│ │ Classical NLP: Rules, TF-IDF, SVM, LSTM │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Modern NLP: Transformers, BERT, GPT, LLMs │ │ ◀── This is what we focus on
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
Real-World Applications
Google uses NLP to understand your query and match it to relevant pages — even if you don't use exact keywords.
Siri, Alexa, ChatGPT, Claude — all use NLP to understand questions and generate appropriate responses.
Google Translate converts text between 100+ languages using neural NLP models.
Analyzing product reviews, social media posts, or customer feedback for positive/negative/neutral sentiment.
Gmail's spam filter uses NLP to classify emails as spam or not-spam based on content patterns.
GitHub Copilot uses NLP trained on code to suggest code completions and write functions from comments.
Interview Questions & Answers
Practice Questions
The NLP Pipeline
What Is A Pipeline?
A pipeline is a series of steps where the output of one step becomes the input of the next — like an assembly line in a factory. In NLP, raw text enters the pipeline and a useful prediction or generated output comes out the other end.
The 5-Stage NLP Pipeline
Raw Text Input
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 1: TEXT INPUT │
│ "The cat sat on the mat. It was happy." │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 2: PREPROCESSING │
│ • Cleaning: remove HTML tags, punctuation noise │
│ • Normalization: lowercase, fix typos │
│ • Tokenization: split into ["The","cat","sat",...] │
│ • Stop word removal (optional) │
│ • Stemming/Lemmatization (optional) │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 3: REPRESENTATION │
│ Convert tokens → numbers computers can process │
│ • One-Hot Encoding → [0,0,1,0,0,...,0] │
│ • Bag of Words → [2,1,0,3,...] │
│ • Word Embeddings → [0.23,-0.45,0.12,...] │
│ • Transformer Enc. → contextual vector │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 4: MODELING │
│ The mathematical model processes representations │
│ • Classic: Logistic Regression, SVM, Naive Bayes │
│ • Neural: LSTM, CNN for text │
│ • Modern: BERT, GPT, T5, LLaMA │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 5: OUTPUT │
│ • Classification: "POSITIVE" / "NEGATIVE" │
│ • Generation: "The quick brown fox..." │
│ • Extraction: [("Apple", ORG), ("Tim Cook", PER)] │
│ • Translation: "El gato se sentó en la alfombra" │
└─────────────────────────────────────────────────────┘
Step-by-Step Worked Example
Task: Sentiment analysis on the sentence: "The food was absolutely terrible!"
- 1InputRaw text: "The food was absolutely terrible!"
- 2PreprocessingLowercase → "the food was absolutely terrible!" → Tokenize → ["the", "food", "was", "absolutely", "terrible"]
- 3RepresentationConvert each token to a number vector. "terrible" → [-0.85, 0.12, -0.67, ...] (a vector pointing toward "negative" words)
- 4ModelingModel processes vectors, learns "absolutely terrible" pattern → assigns high probability to NEGATIVE class
- 5OutputClassification: "NEGATIVE" (with 96% confidence)
Common Misconceptions
"NLP models understand language like humans do."
False. Models learn statistical patterns over text — they've seen "terrible" co-occur with negative reviews millions of times, so they associate the word with negativity. They don't "understand" the word the way you do. They have no concept of food, taste, or emotions. This distinction matters when models fail in unexpected ways (they can be fooled by novel phrasing they've never seen).
NLP Challenges
THE 7 CORE NLP CHALLENGES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. AMBIGUITY
"I shot the elephant in my pajamas"
├─ I was wearing pajamas while shooting → ✓ likely
└─ The elephant was in my pajamas → ✓ grammatically valid
2. SYNONYMS
"car", "automobile", "vehicle", "wheels" all mean ≈ same
Model must know these are semantically related
3. POLYSEMY (one word, many meanings)
"bank" → financial institution
"bank" → river bank
"bank" → to bank on something (rely on)
4. CONTEXT DEPENDENCY
"it" in: "The trophy didn't fit in the suitcase because it was too big"
WHAT is "it"? → trophy (too big to fit)
"it" in: "The trophy didn't fit in the suitcase because it was too small"
WHAT is "it"? → suitcase (too small to hold trophy)
5. SARCASM / IRONY
"Oh great, another Monday!" → NEGATIVE sentiment
"Oh great, the presentation went perfectly!" → POSITIVE sentiment
Same words, opposite meaning based on tone/context
6. LONG DOCUMENTS
A 100-page contract — the model must track information
mentioned on page 2 when answering a question on page 87
7. WORLD KNOWLEDGE
"Paris is the capital of..." requires world knowledge
"The bank approved the loan after it verified the..."
requires knowing banks approve loans before verifying docs
All these challenges are exactly what modern LLMs like GPT-4 and Claude are designed to handle better. The Transformer's attention mechanism (Module 8) specifically addresses the context dependency and long document problems. This is why understanding NLP challenges first makes the "why" of transformer design obvious.
Worked Examples for Each Challenge
| Challenge | Example | Why It's Hard |
|---|---|---|
| Ambiguity | "I saw her duck" | Did she dodge (duck = verb) or was there a duck (noun)? |
| Synonyms | "The dog barked" vs "The canine made noise" | Model must know dog=canine, barked≈made noise |
| Polysemy | "Can you pass the salt?" | Literally asking ability, but actually a polite request |
| Context | "He gave John his book" | Whose book? His (speaker's) or John's? |
| Sarcasm | "What a wonderful day!" (said in a storm) | Requires knowing it's stormy to detect sarcasm |
| Long docs | Legal contract: Clause 1 defines a term used in Clause 47 | Must track information across thousands of tokens |
Text as Data, Corpus, Vocabulary & OOV
ELI10: What is Text as Data?
Imagine you have a big box of LEGO bricks. Individual bricks are like characters. Groups of bricks that form a shape are like words. A complete LEGO scene is like a sentence. Your entire LEGO collection is like a document. A library of instruction manuals is like a corpus. The total set of all unique brick types you own is your vocabulary.
Text Granularity Levels
TEXT: "NLP is fun!"
LEVEL 1 — CHARACTERS (smallest unit):
['N', 'L', 'P', ' ', 'i', 's', ' ', 'f', 'u', 'n', '!']
• Pro: No unknown characters (all text = chars we know)
• Con: Very long sequences, loses word meaning
LEVEL 2 — WORDS (most common):
['NLP', 'is', 'fun']
• Pro: Preserves word meaning, natural unit
• Con: Huge vocabulary, rare/new words cause problems
LEVEL 3 — SENTENCES:
["NLP is fun!", "I love learning."]
• Used for document retrieval, semantic search
LEVEL 4 — DOCUMENTS:
Complete Wikipedia article, entire book chapter
• Used in document classification, summarization
CORPUS → A collection of documents:
Wikipedia = 6 million+ English documents = one corpus
Common Crawl = 80+ billion web pages = one (huge) corpus
GPT-3 trained on ~570GB of text data
What is a Vocabulary?
The vocabulary (often written as V) is the set of all unique words (or tokens) that a model knows. Think of it as the model's "dictionary".
When you train a model, you first scan all the training text and collect every unique word. That becomes your vocabulary. The size of the vocabulary — called |V| — is important:
- Small vocabulary (e.g., 10,000 words) → Many words labeled "unknown"
- Large vocabulary (e.g., 100,000 words) → Better coverage but uses more memory
- GPT-4's tokenizer vocabulary: ~100,000 tokens (not words, but sub-word pieces)
Example corpus: 3 sentences
"I love cats"
"I love dogs"
"cats and dogs"
Vocabulary = {I, love, cats, dogs, and} → |V| = 5
Each word gets an index:
I → 0
love → 1
cats → 2
dogs → 3
and → 4
"I love cats" → [0, 1, 2] ← Computer can now process this!
OOV — Out of Vocabulary Problem
OOV stands for "Out Of Vocabulary" — it refers to words that appear at inference (test) time that the model has never seen during training.
Why is OOV a problem? Imagine you built a model and your vocabulary is {cat, dog, bird}. Now someone inputs the word "ferret" — your model has no representation for it! It's like asking someone to recognize a face they've never seen before.
TRAINING VOCABULARY: {cat, dog, bird, run, jump}
INPUT AT INFERENCE TIME: "The ferret jumped over the fence"
Word check:
"The" → OOV! Not in vocabulary
"ferret" → OOV! Not in vocabulary
"jumped" → OOV! "jump" is there but "jumped" (past tense) is not!
"over" → OOV!
"the" → OOV! (lowercase "the" different from "The")
"fence" → OOV!
RESULT: Model maps all these to [UNK] token → loses meaning!
SOLUTION PREVIEW → Subword Tokenization (Module 3)
"ferret" → "fer" + "ret" (sub-pieces it knows!)
"jumped" → "jump" + "##ed" (root + suffix)
The OOV problem is precisely WHY modern tokenizers like BPE and WordPiece (used by GPT and BERT) use subword tokenization. By breaking rare words into smaller pieces, every word — even completely new words — can be represented. This is one of the most important innovations in modern NLP.
Interview Q&A
Complete Tokenization Guide: BPE, WordPiece, SentencePiece
Why Models Cannot Read Text Directly
Neural networks are mathematical functions. They take numbers as input and produce numbers as output. A sentence like "I love NLP" is a string of characters — NOT numbers. Tokenization is the process of converting this string into a sequence of integer IDs that the model can actually process.
Raw Text: "I love NLP"
│
▼
Tokenizer
│
▼
Tokens: ["I", "love", "NLP"]
│
▼
Token IDs: [40, 1842, 27207] ← These are actual numbers GPT-2 uses!
│
▼
Model processes these integers → generates output integers
│
▼
Decode: "It is fascinating" ← Convert IDs back to text
3 Types of Tokenization — Comparison
INPUT TEXT: "unhappiness"
1. WORD TOKENIZATION
→ ["unhappiness"]
✓ Simple, preserves words
✗ "unhappiness" might be OOV if not in vocabulary!
✗ Vocabulary can have 1M+ words
2. CHARACTER TOKENIZATION
→ ["u","n","h","a","p","p","i","n","e","s","s"]
✓ Never OOV — only 26 letters + punctuation
✓ Very small vocabulary (256 chars)
✗ Very long sequences (3x–10x longer than words)
✗ Each character alone has little meaning
3. SUBWORD TOKENIZATION (BPE / WordPiece)
→ ["un", "##happiness"] ← WordPiece style
→ ["un", "happ", "iness"] ← BPE style
✓ Balances vocabulary size and sequence length
✓ Rare words decomposed into known sub-pieces
✓ Common words kept whole: "the" → ["the"]
✓ Never truly OOV (worst case: character level)
← USED BY ALL MODERN LLMs!
BPE — Byte Pair Encoding (Used by GPT)
BPE was originally a data compression algorithm, adapted for NLP by Sennrich et al. in 2016. The key insight is: instead of having a fixed word vocabulary, start with individual characters and iteratively merge the most frequent adjacent pairs into new tokens.
- 1Start with character vocabularyEvery unique character in your training data = initial vocabulary. For English: a-z, A-Z, 0-9, punctuation ≈ ~256 tokens
- 2Count all adjacent character pairsIn "the cat sat", count: ('t','h')=1, ('h','e')=1, ('c','a')=1, ('a','t')=2, ('s','a')=1 → ('a','t') most frequent!
- 3Merge the most frequent pair'a'+'t' → 'at'. Now vocabulary includes 'at' as a token. Update all occurrences in text.
- 4Repeat until vocabulary size reachedKeep merging most frequent pairs. GPT-2 does ~50,000 merge operations to reach 50,257 tokens.
BPE STEP-BY-STEP WORKED EXAMPLE
Corpus: "aab aac ab ac"
Initial vocabulary: {a, b, c, space}
STEP 1: Count pair frequencies
(a,a) = 2 ← most frequent
(a,b) = 2
(a,c) = 2
(space, a) = 3 ← most frequent overall!
Let's merge (space,a) → " a" (just showing logic)
Actually let's merge (a,b):
After merge: "aab aac [ab] ac"
vocab += "ab"
STEP 2: Count again in updated corpus:
"a","a","b" → (a,a)=2, (a,b)=1
"a","a","c" → (a,a)=2, (a,c)=1
"[ab]" → treated as single token now
"a","c" → (a,c)=1
Merge (a,a) → "aa":
After merge: "[aa]b [aa]c [ab] ac"
vocab += "aa"
CONTINUE until target vocab size reached...
RESULT: Frequent sequences become single tokens
"the" → one token (very common)
"ing" → one token (common suffix)
"xyzzy" → "x" + "y" + "z" + "z" + "y" (rare = char-level)
WordPiece (Used by BERT)
WordPiece is similar to BPE but uses a different criterion for merging: instead of frequency, it maximizes the likelihood of the training data given the vocabulary. In practice, it tends to create more linguistically meaningful pieces.
Key difference: WordPiece marks continuation pieces with ##. So "playing" might tokenize as ["play", "##ing"] where ##ing means "this piece continues a word from the previous token".
WordPiece tokenization examples (BERT-style):
"playing" → ["play", "##ing"]
"unbelievable" → ["un", "##believ", "##able"]
"ChatGPT" → ["Chat", "##GP", "##T"]
"COVID-19" → ["CO", "##VID", "-", "19"]
"hello" → ["hello"] ← common word, whole token
"the" → ["the"] ← very common, whole token
The ## prefix tells the model: "I am a continuation of the
previous token, not the start of a new word"
SentencePiece (Used by T5, LLaMA, Mistral)
SentencePiece, developed by Google, solves a key problem: BPE and WordPiece require pre-tokenization (splitting text into words first using spaces), which is language-specific. Chinese, Japanese, Thai don't use spaces between words!
SentencePiece treats the input as a raw character stream (including spaces). It uses ▁ (a special underscore character) to mark word boundaries. This makes it language-independent.
SentencePiece treats spaces as characters:
Input: "Hello world"
Tokens: ["▁Hello", "▁world"]
↑ underscore marks start of word
Works for ANY language because no pre-tokenization needed:
Chinese: "我爱NLP" → ["▁我", "爱", "NL", "P"]
Japanese: "自然言語処理" → ["▁自然", "言語", "処理"]
LLaMA uses SentencePiece with vocabulary size 32,000
Mistral uses SentencePiece with vocabulary size 32,000
T5 uses SentencePiece with vocabulary size 32,128
Special Tokens — Critical for Understanding LLMs
Special tokens are reserved tokens with specific roles in the pipeline. They are NOT normal vocabulary words — they signal structural information to the model.
| Token | Full Name | Used By | Purpose |
|---|---|---|---|
| [PAD] | Padding Token | BERT, most models | Used to make sequences the same length in a batch. Model learns to ignore padded positions. |
| [UNK] | Unknown Token | Older models, BERT | Replaces tokens not in vocabulary. Less needed now with subword tokenization. |
| [CLS] | Classification Token | BERT | Prepended to every sequence. BERT learns to put the meaning of the whole sentence in this token's representation — used for classification tasks. |
| [SEP] | Separator Token | BERT | Separates two sentences in a pair (e.g., question vs context in QA). Also marks end of sequence. |
| <BOS> | Beginning of Sequence | GPT, LLaMA | Signals to the model that a new sequence is starting. GPT uses <|endoftext|> for this. |
| <EOS> | End of Sequence | GPT, LLaMA, all models | Signals that the model should stop generating. Critical for knowing when to stop during inference. |
BERT INPUT FORMAT: [CLS] sentence_A [SEP] sentence_B [SEP] [PAD] [PAD] ↑ ↑ ↑ ↑ ↑ classification separator separator padding padding Example: "Is the cat cute?" → "Yes it is" [CLS] Is the cat cute ? [SEP] Yes it is [SEP] [PAD] [PAD] GPT INPUT FORMAT (no [CLS] or [SEP] needed):The cat sat on the mat ↑ ↑ start stop
Context Window & Token Limits
Every LLM has a context window — the maximum number of tokens it can process at once (both input and output). This is not the same as words! Due to subword tokenization:
- 1 word ≈ 1.3–1.5 tokens on average for English
- Code and rare words tokenize into more pieces
- A 4,096-token context ≈ ~3,000 words ≈ ~6 pages of text
16,384 tokens ≈ 12,000 words ≈ 48 pages
128,000 tokens ≈ 96,000 words ≈ 380 pages
200,000 tokens ≈ 150,000 words ≈ 600 pages
1,000,000 tokens ≈ 750,000 words ≈ 3,000 pages
128,000 tokens context window
128,000 tokens context window
Python Implementation — BPE from Scratch
# ============================================================ # BPE (Byte Pair Encoding) Tokenizer - Built from Scratch # ============================================================ from collections import Counter, defaultdict import re class SimpleBPE: """ A minimal BPE tokenizer to understand the core algorithm. NOT optimized for production - purely educational. """ def __init__(self, vocab_size: int = 300): # Target vocabulary size (initial chars + merged pairs) self.vocab_size = vocab_size self.merges = {} # stores all merge operations self.vocab = set() # our complete vocabulary def get_vocab(self, corpus: list) -> dict: """ Convert corpus to word-frequency dict where each word is represented as a tuple of characters + end marker. 'hello' with freq 3 → ('h','e','l','l','o','') : 3 '' marks end of word (helps track word boundaries) """ vocab = Counter() for sentence in corpus: for word in sentence.split(): # Convert word to tuple of chars + end marker char_tuple = tuple(word) + ('</w>',) vocab[char_tuple] += 1 return vocab def get_pair_frequencies(self, vocab: dict) -> dict: """ Count all adjacent pairs across all words in vocab. Example: ('h','e','l','l','o','') with freq 3 Pairs counted: (h,e):3, (e,l):3, (l,l):3, (l,o):3, (o,):3 """ pairs = defaultdict(int) for word_tuple, freq in vocab.items(): # Look at each adjacent pair of tokens for i in range(len(word_tuple) - 1): pair = (word_tuple[i], word_tuple[i+1]) pairs[pair] += freq # weighted by word frequency! return pairs def merge_pair(self, best_pair: tuple, vocab: dict) -> dict: """ Merge best_pair everywhere in the vocabulary. ('h','e') → 'he' everywhere it appears. """ new_vocab = {} left, right = best_pair bigram = left + right # merged token for word_tuple, freq in vocab.items(): # Replace each occurrence of (left, right) with bigram new_word = [] i = 0 while i < len(word_tuple): if (i < len(word_tuple)-1 and word_tuple[i] == left and word_tuple[i+1] == right): new_word.append(bigram) # replace pair with merge i += 2 # skip both tokens else: new_word.append(word_tuple[i]) i += 1 new_vocab[tuple(new_word)] = freq return new_vocab def train(self, corpus: list): """ Main training loop: keep merging most frequent pairs until we reach our target vocabulary size. """ # Step 1: Build initial character-level vocabulary vocab = self.fn_get_vocab(corpus) # Collect all unique characters (initial vocab) initial_tokens = set() for word_tuple in vocab.keys(): initial_tokens.update(word_tuple) self.vocab = initial_tokens.copy() print(f"Initial vocab size: {len(self.vocab)}") print(f"Initial tokens: {sorted(self.vocab)}") # Step 2: Iteratively merge most frequent pairs num_merges = self.vocab_size - len(self.vocab) for i in range(num_merges): # Count all adjacent pairs pair_freqs = self.get_pair_frequencies(vocab) if not pair_freqs: break # No more pairs to merge # Find the most frequent pair best_pair = max(pair_freqs, key=pair_freqs.get) best_freq = pair_freqs[best_pair] # Record this merge operation merged_token = best_pair[0] + best_pair[1] self.merges[best_pair] = merged_token self.vocab.add(merged_token) # Apply the merge to all vocabulary entries vocab = self.merge_pair(best_pair, vocab) if i < 5: # Print first 5 merges for inspection print(f"Merge {i+1}: {best_pair} → '{merged_token}' (freq={best_freq})") print(f"\nFinal vocab size: {len(self.vocab)}") return self # ── DEMO ────────────────────────────────────────── # Simple corpus for demonstration corpus = [ "low low low low low", "lower lower", "newest newest newest newest", "widest widest", ] tokenizer = SimpleBPE(vocab_size=25) tokenizer.fn_get_vocab = tokenizer.get_vocab # alias fix # Actually let's run it directly: bpe = SimpleBPE(vocab_size=20) vocab = bpe.get_vocab(corpus) print("Initial word representations:") for w, f in vocab.items(): print(f" {w}: {f}") pairs = bpe.get_pair_frequencies(vocab) print("\nTop 5 most frequent pairs:") for pair, freq in sorted(pairs.items(), key=lambda x: -x[1])[:5]: print(f" {pair}: {freq}")
Using Hugging Face Tokenizers
# pip install transformers tokenizers from transformers import AutoTokenizer # ── GPT-2 Tokenizer (BPE) ───────────────────────── gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2") text = "Hello, I love Natural Language Processing!" tokens = gpt_tokenizer.tokenize(text) token_ids = gpt_tokenizer.encode(text) print("GPT-2 Tokens:", tokens) # ['Hello', ',', 'Ġ', 'I', 'Ġlove', 'ĠNatural', ...] # Note: Ġ = space (BPE encodes spaces into tokens!) print("Token IDs:", token_ids) # [15496, 11, 314, 1842, 8823, 15417, ...] print("Vocab size:", gpt_tokenizer.vocab_size) # 50257 # Decode back to text decoded = gpt_tokenizer.decode(token_ids) print("Decoded:", decoded) # "Hello, I love Natural Language Processing!" # ── BERT Tokenizer (WordPiece) ───────────────────── bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") tokens_bert = bert_tokenizer.tokenize(text) print("\nBERT Tokens:", tokens_bert) # ['hello', ',', 'i', 'love', 'natural', 'language', 'processing', '!'] # BERT lowercases (bert-base-uncased) # See special tokens encoded_bert = bert_tokenizer(text, return_tensors="pt") print("\nWith special tokens:", bert_tokenizer.convert_ids_to_tokens( encoded_bert['input_ids'][0] )) # ['[CLS]', 'hello', ',', 'i', 'love', ..., '[SEP]'] # ↑ automatically added! ↑ # ── Test OOV handling ───────────────────────────── oov_text = "supercalifragilisticexpialidocious" gpt_oov = gpt_tokenizer.tokenize(oov_text) bert_oov = bert_tokenizer.tokenize(oov_text) print("\nOOV word tokenization:") print("GPT-2:", gpt_oov) # ['super', 'cali', 'fra', 'gil', 'istic', 'exp', 'iali', 'do', 'cious'] print("BERT:", bert_oov) # ['super', '##cal', '##if', '##rag', '##ili', '##stic', ...]
Interview Questions & Traps
BPE: Merges the pair with the highest raw frequency count in the corpus. "How often do these two tokens appear adjacent?"
WordPiece: Merges the pair that maximizes the likelihood of the training data. Specifically, it selects the pair (A, B) where freq(AB) / (freq(A) × freq(B)) is maximized. This tends to merge pairs that appear together MORE than chance would predict.
In practice: BPE is used by GPT (all versions), RoBERTa, and most modern LLMs. WordPiece is used by BERT and its variants.
Practice Questions
Step 1 — Count pairs: (a,a)=2, (a,</w>)=3, (a,b)=1, (b,</w>)=2, (b,a)=2, (b,b)=1. Most frequent: (a,</w>)=3. Merge 'a'+'</w>'→'a</w>'. Now: {('a','a</w>'):2, ('a','b</w>'):...} Wait, let me re-count: 'aa' has (a,a) and then (a,</w>). Three words end in 'a': 'aa'×2 + 'ba'×2 = 4 occurrences of (a,</w>). Merge that: 'a</w>' becomes one token.
Step 2 — Now recount. ('a','a</w>') appears 2 times (from 'aa'×2). ('b','a</w>') appears 2 times (from 'ba'×2). Tie — pick one, say merge ('b','a</w>')→'ba</w>'. New vocabulary now includes 'a', 'b', '</w>', 'a</w>', 'ba</w>'.
1. Use shorter prompts: Avoid verbose system prompts. Every word costs money.
2. Avoid token-inefficient languages: Non-English text often uses more tokens per character. Chinese characters may be 1-2 tokens each, but some scripts use 3-4 tokens per character.
3. Avoid whitespace waste: Extra spaces, newlines, and indentation all consume tokens.
4. Use tiktoken to count first: `import tiktoken; enc = tiktoken.encoding_for_model("gpt-4"); len(enc.encode(text))` — always check before sending.
5. Truncate context: Don't send entire conversation history every time; summarize older turns.
6. Use streaming: Doesn't reduce tokens but improves user experience while generation happens.
Relationships to LLMs
GPT (all versions)
Uses BPE tokenization. GPT-2: 50,257 tokens. GPT-4: ~100,256 tokens. The tokenizer is the very first step before any GPT processing.
Claude
Uses a custom BPE tokenizer with ~100K vocab. Claude 3's context window is 200K tokens — tokenization determines how much text fits.
LLaMA
LLaMA 1/2 uses SentencePiece BPE with 32K vocab. LLaMA 3 expanded to 128K vocabulary, significantly improving multilingual performance.
DeepSeek
DeepSeek uses a custom BPE tokenizer optimized for both English and Chinese. Uses cl100k_base-compatible tokenizer with extended Chinese tokens.
Qwen
Qwen uses tiktoken-based BPE with ~150K vocabulary, heavily optimized for Chinese — Chinese characters get dedicated tokens for efficiency.
Kimi
Kimi (by Moonshot AI) uses a custom tokenizer optimized for long-context Chinese+English tasks with 128K context window support.
Cheat Sheet
Merge most frequent character pairs. Space encoded into tokens. GPT-2: 50K vocab, GPT-4: 100K vocab.
Merge pairs with max likelihood score. ## prefix for continuations. BERT: 30K vocab (uncased).
Language-independent. ▁ for word boundaries. No pre-tokenization. LLaMA 3: 128K vocab.
Max tokens model can see. 1 word ≈ 1.3 tokens. Claude 3.5: 200K, GPT-4o: 128K.
BERT prepends this. The final hidden state of [CLS] represents whole sentence — used for classification.
Subword tokenization means never truly OOV. Worst case: single characters are always in vocabulary.
Mini Project
Compare how different tokenizers handle the same text — great for building intuition about LLM costs and behavior.
- 1Install transformers and tiktoken
pip install transformers tiktoken - 2Load 3 tokenizersGPT-2 (BPE), BERT-base-uncased (WordPiece), and a LLaMA tokenizer
- 3Tokenize the same 10 sentencesInclude: normal English, a technical term, a rare proper noun, code, Chinese/Arabic text
- 4Compare token countsWhich tokenizer is most "efficient" for each type of text? Build a comparison table.
- 5Estimate API costsGiven GPT-4 costs $0.03/1K tokens, calculate cost for processing your 10 sentences 1000 times.
One-Hot, BoW, N-Grams & TF-IDF
One-Hot Encoding
One-hot encoding is the simplest way to represent a word as a number vector. Each word gets a unique position in a vector, and only that position is "1" — everything else is "0".
Vocabulary: {cat:0, dog:1, bird:2, runs:3, jumps:4}
Vocabulary size |V| = 5
One-hot vectors:
"cat" → [1, 0, 0, 0, 0]
"dog" → [0, 1, 0, 0, 0]
"bird" → [0, 0, 1, 0, 0]
"runs" → [0, 0, 0, 1, 0]
"jumps" → [0, 0, 0, 0, 1]
CRITICAL PROBLEMS:
1. For 50,000 word vocab → each vector has 50,000 dimensions!
99.998% of each vector is zeros → SPARSE & WASTEFUL
2. "cat" and "dog" are equally "far apart" as "cat" and "airplane"
The vectors don't capture that cat/dog are both animals!
3. No way to compute meaningful similarity:
cat · dog = [1,0,0,0,0] · [0,1,0,0,0] = 0 + 0 = 0
cat · airplane = 0 too! ← SAME DISTANCE = MEANINGLESS
Bag of Words (BoW)
Bag of Words represents an entire document (not just a word) as a vector by counting how many times each vocabulary word appears. It's called "bag" because it ignores word ORDER — it just counts.
Vocabulary: {I:0, love:1, NLP:2, hate:3, Python:4}
Document 1: "I love NLP and I love Python"
"I" appears 2 times
"love" appears 2 times
"NLP" appears 1 time
"Python" appears 1 time
BoW vector: [2, 2, 1, 0, 1]
I love NLP hate Python
Document 2: "I hate Python but I love NLP"
BoW vector: [2, 1, 1, 1, 1]
I love NLP hate Python
Similarity: Both have [I×2, NLP×1] → related topics ✓
PROBLEM — Order is completely lost:
"Dog bites man" → [1,1,1] (dog, bites, man counts)
"Man bites dog" → [1,1,1] (SAME VECTOR!)
But these mean very different things!
N-Grams — Capturing Some Context
N-grams capture some word order by creating features from sequences of N consecutive words. Instead of individual words, you count sequences.
Text: "I love natural language processing"
UNIGRAMS (N=1) — individual words:
{I, love, natural, language, processing}
BIGRAMS (N=2) — pairs of adjacent words:
{(I,love), (love,natural), (natural,language), (language,processing)}
TRIGRAMS (N=3) — triplets:
{(I,love,natural), (love,natural,language), (natural,language,processing)}
WHY N-GRAMS HELP:
"not good" as a bigram captures negation that unigrams miss
"New York" as bigram = city name; "New" + "York" alone = misleading
"not bad" bigram → positive sentiment (double negation!)
WHY N-GRAMS STILL HAVE LIMITS:
N=2: captures 2-word context
N=3: captures 3-word context
N=10: captures 10-word context... but vocabulary explodes!
With 50K words: unigrams=50K features, bigrams=2.5 BILLION possible features
(Though most don't appear in practice → sparsity again)
TF-IDF — The Classic Information Retrieval Method
TF-IDF (Term Frequency — Inverse Document Frequency) is still used today in search engines and information retrieval. It solves a key problem with BoW: common words like "the", "is", "and" appear in every document and are useless for distinguishing content. TF-IDF weights words by how distinctive they are.
Intuition: A word is important to a document if it appears frequently IN THAT document BUT rarely across ALL documents.
| t | = a specific term (word) |
| d | = the specific document we're scoring |
| D | = the entire collection of documents (corpus) |
| TF(t,d) | = how often term t appears in document d (normalized by document length) |
| N | = total number of documents in corpus D |
| df(t) | = document frequency = how many documents contain term t at least once |
| log | = natural logarithm (used to dampen extreme values) |
WORKED EXAMPLE:
Corpus of 1,000 documents.
Word "the":
TF: appears 50 times in a 500-word document = 50/500 = 0.1
IDF: appears in ALL 1,000 documents = log(1000/1000) = log(1) = 0
TF-IDF = 0.1 × 0 = 0 ← "the" gets ZERO weight! ✓
Word "photosynthesis":
TF: appears 10 times in a 500-word biology document = 10/500 = 0.02
IDF: appears in only 5 documents = log(1000/5) = log(200) = 5.3
TF-IDF = 0.02 × 5.3 = 0.106 ← HIGH weight! ✓
Word "cancer" in a medical report:
TF: appears 20 times in 1000-word document = 0.02
IDF: appears in 100 documents = log(1000/100) = log(10) = 2.3
TF-IDF = 0.02 × 2.3 = 0.046 ← moderate weight ✓
TF-IDF + cosine similarity is still the backbone of many production search systems (including some parts of Elasticsearch). BM25, a TF-IDF variant, is used in retrieval stages of RAG pipelines right now. Don't dismiss these "classical" methods — they're fast, interpretable, and often competitive with expensive neural approaches for keyword-heavy search.
Python Implementation
import numpy as np from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity corpus = [ "The cat sat on the mat", "The dog sat on the floor", "Cats and dogs are common pets", "NLP is the study of language", ] # ── 1. Bag of Words ────────────────────────────── bow_vectorizer = CountVectorizer() bow_matrix = bow_vectorizer.fit_transform(corpus) print("BOW Vocabulary:", bow_vectorizer.vocabulary_) print("BOW Matrix shape:", bow_matrix.shape) # shape = (4 documents, N unique words) # ── 2. TF-IDF ───────────────────────────────────── tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(corpus) print("\nTF-IDF Matrix shape:", tfidf_matrix.shape) # Find most similar documents to query query = "cats and dogs" query_vec = tfidf_vectorizer.transform([query]) similarities = cosine_similarity(query_vec, tfidf_matrix) print("\nSimilarity to 'cats and dogs':") for i, sim in enumerate(similarities[0]): print(f" Doc {i}: {corpus[i][:40]}... → {sim:.3f}") # ── 3. N-Grams ──────────────────────────────────── ngram_vectorizer = CountVectorizer(ngram_range=(1,2)) # unigrams + bigrams ngram_matrix = ngram_vectorizer.fit_transform(corpus) # Show bigram features features = ngram_vectorizer.get_feature_names_out() bigrams = [f for f in features if ' ' in f] print("\nBigrams found:", bigrams[:10])
Interview Q&A
Word2Vec, GloVe, FastText & Embedding Space
The Problem with One-Hot Encoding
Problem 1: DIMENSIONALITY
Vocab size = 100,000 words
Each word = 100,000-dimensional vector
99.999% of each vector = zeros!
MASSIVE memory waste → [0,0,0,...,1,...,0,0,0]
Problem 2: NO SEMANTIC SIMILARITY
cat = [1,0,0,0,...]
kitten = [0,1,0,0,...]
car = [0,0,1,0,...]
distance(cat, kitten) = distance(cat, car)
BUT cat and kitten are much more related!
Problem 3: NO RELATIONSHIPS
king - man + woman = ???
One-hot can't do word arithmetic.
Word vectors CAN:
vec("king") - vec("man") + vec("woman") ≈ vec("queen") ← MAGIC!
Distributional Semantics — The Core Idea
"You shall know a word by the company it keeps."
Words that appear in similar contexts tend to have similar meanings. "cat" and "dog" both appear near "pet", "feed", "vet", "cute" → they should have similar vector representations. We don't need to hand-code that cats and dogs are both animals — we can LEARN this from billions of sentences!
ELI10: What is an Embedding?
Imagine you want to describe your classmates. You could give each one a unique number (1, 2, 3...) — but that number says nothing about them. Or you could describe each person using 3 attributes: height (0-1), age (0-1), how funny they are (0-1). Now two similar people would have similar numbers! Embeddings do the same for words — but with 300 or 768 "attribute" dimensions instead of 3, capturing aspects of meaning we can't even name.
Word2Vec — How It Works
Word2Vec, introduced by Mikolov et al. at Google in 2013, learns word vectors by training a simple neural network on a "fake" task. There are two variants:
CBOW (Continuous Bag of Words):
TASK: Predict CENTER word from SURROUNDING words
Context: ["The", "_?_", "sat", "on"]
Target: "cat"
Input: embed(The) + embed(sat) + embed(on) → average
↓
Small neural network
↓
Output: probability distribution over all words
P("cat" | context) should be highest!
Training: Adjust all word embeddings to make the model
predict the correct center word from context.
After training, the embeddings ARE the representation.
Use case: Works better for frequent words, faster training.
═══════════════════════════════════════════════════════════
SKIP-GRAM:
TASK: Predict SURROUNDING words from CENTER word (opposite of CBOW!)
Input: embed("cat")
↓
Small neural network
↓
Output: P("The"), P("sat"), P("on") should all be high!
This is why it's called Skip-Gram: the center word "skips"
to predict words in its context window.
Use case: Works better for rare words, captures more semantics.
NEGATIVE SAMPLING — Why it's needed:
Naively, each training step requires computing probabilities
over the ENTIRE vocabulary (50,000 words):
P("cat" | context) = softmax over 50,000 outputs
= VERY SLOW
Negative Sampling shortcut:
Instead of asking "what's the probability for ALL words?"
Ask: "Is 'cat' a real context word OR a random (negative) sample?"
Positive: ("The", "cat") → label 1 (real context pair)
Negatives: ("The", "pizza"), ("The", "quantum"), ("The", "treaty")
→ label 0 (fake pairs, randomly sampled)
Train a binary classifier on 1 positive + ~5-20 negatives
FAR cheaper than softmax over 50K words!
Works because: "don't need to learn ALL wrong answers,
just enough negatives to learn good representations"
Mathematical Explanation
For Skip-Gram with negative sampling, the objective is to maximize:
| J | = objective function (what we maximize) |
| v_w | = embedding vector of center word w (the "input" embedding) |
| v'_c | = context embedding of word c (the "output" embedding) |
| σ | = sigmoid function: σ(x) = 1/(1+e^-x), outputs value between 0 and 1 |
| K | = number of negative samples (typically 5-20) |
| w_k | = the k-th negative sample (random word) |
| P(w) | = unigram distribution raised to 3/4 power (sampling distribution) |
| · | = dot product (element-wise multiply and sum) |
The first term: maximize σ(v'_c · v_w) = maximize the dot product between actual context word c and center word w. High dot product = vectors point in similar direction = words are "close" in embedding space.
The second term: maximize σ(-v'_{w_k} · v_w) = maximize σ of NEGATIVE dot product for random words = push random words AWAY from center word in embedding space.
Net result: context words cluster together, random words are pushed apart → geometry of embedding space captures meaning!
GloVe — Global Vectors for Word Representation
GloVe (Pennington et al., Stanford, 2014) takes a different approach. Instead of a sliding window (local context), GloVe uses global co-occurrence statistics — how often does word A appear in the same document as word B, across the ENTIRE corpus?
Build a co-occurrence matrix X where X_ij = how many times word i appears in the context of word j globally. Then factorize this matrix to get embeddings.
| X_{ij} | = co-occurrence count of word i with word j in corpus |
| w_i | = word vector for word i |
| w̃_j | = context word vector for word j |
| b_i, b̃_j | = bias terms for word i and context j |
| f(X_{ij}) | = weighting function (reduces weight of very common co-occurrences) |
| log X_{ij} | = log of co-occurrence count (target value) |
Word2Vec: local context windows, predict-based, captures syntactic relationships well
GloVe: global corpus statistics, count-based, captures word association patterns well
In practice: very similar quality. GloVe embeddings are easier to train reproducibly. Both are static (one vector per word, regardless of context) — superseded by contextual embeddings from BERT.
FastText — Handling Rare & Morphological Words
FastText (Facebook AI, 2017) extends Word2Vec by representing each word as a bag of character n-grams. The embedding for a word is the sum of embeddings for all its character n-grams.
FastText character n-grams for "playing" (n=3):
Special boundaries: <playing>
Trigrams: <pl, pla, lay, ayi, yin, ing, ng>
Embedding("playing") = Σ embedding(trigram) for all trigrams
= embed(<pl) + embed(pla) + embed(lay) + ...
ADVANTAGES:
1. "played", "playing", "plays" share many trigrams → similar vectors!
(All contain "play", "lay", "ayi" etc.)
2. Rare word "photosynthesizing" gets a good vector even if it
appeared only once, because its character pieces appear elsewhere
3. Works great for morphologically rich languages (German, Finnish, Turkish)
where words change form dramatically with suffixes/prefixes
USED IN: Facebook's production NLP systems, multilingual tasks,
languages where word boundaries are complex
Embedding Space — The Magic Properties
2D PROJECTION of 300D word embedding space:
(Real embeddings are 300D, this is conceptual illustration)
Animals Royalty
│ │
cat ● │ dog ● king ●
│ │
kitten ● │ puppy ● queen ●
│ │
feline ● │ canine ● prince ●
│
Countries princess ●
│
France ●
Paris ● ← France + capital → Paris
Germany ●
Berlin ● ← Germany + capital → Berlin
Japan ●
Tokyo ●
ANALOGIES (semantic algebra!):
king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
doctor - man + woman ≈ nurse (controversial! shows bias in data)
CLUSTERING:
Sports words cluster together
Food words cluster together
Medical terms cluster together
Code keywords cluster together
SIMILARITY:
cos_sim(cat, dog) ≈ 0.85 (very similar)
cos_sim(cat, car) ≈ 0.15 (not similar)
cos_sim(Paris, Tokyo) ≈ 0.72 (both capital cities)
Python Implementation
# pip install gensim numpy import gensim.downloader as api import numpy as np # ── Load pre-trained Word2Vec (Google News, 300D) ─ # This downloads 1.6GB - for demo use smaller model: model = api.load("word2vec-google-news-300") # OR smaller: model = api.load("glove-wiki-gigaword-50") # ── Basic operations ────────────────────────────── # Get vector for a word cat_vec = model['cat'] print("Shape of 'cat' vector:", cat_vec.shape) # (300,) print("First 5 dims:", cat_vec[:5]) # ── Semantic similarity ─────────────────────────── sim_cat_dog = model.similarity('cat', 'dog') sim_cat_car = model.similarity('cat', 'car') print(f"\nSimilarity(cat, dog): {sim_cat_dog:.3f}") print(f"Similarity(cat, car): {sim_cat_car:.3f}") # ── Most similar words ──────────────────────────── similar_to_king = model.most_similar('king', topn=5) print("\nMost similar to 'king':", similar_to_king) # ── FAMOUS ANALOGY: king - man + woman = queen ──── result = model.most_similar( positive=['king', 'woman'], # add these negative=['man'], # subtract this topn=1 ) print(f"\nking - man + woman = {result[0][0]}") # Expected: queen! # ── Word doesn't belong ────────────────────────── odd_one_out = model.doesnt_match(['cat', 'dog', 'bird', 'car']) print(f"\nDoesn't belong: {odd_one_out}") # 'car' # ── Manual cosine similarity computation ───────── def cosine_sim(v1, v2): """ Cosine similarity = how aligned are two vectors? Range: -1 (opposite) to +1 (identical) 0 = perpendicular (unrelated) """ dot_product = np.dot(v1, v2) # v1 · v2 norm_v1 = np.linalg.norm(v1) # |v1| norm_v2 = np.linalg.norm(v2) # |v2| return dot_product / (norm_v1 * norm_v2) manual_sim = cosine_sim(model['paris'], model['france']) print(f"\nManual cosine_sim(paris, france): {manual_sim:.3f}") # ── Train your own Word2Vec on custom data ─────── from gensim.models import Word2Vec sentences = [ ["i", "love", "natural", "language", "processing"], ["nlp", "is", "the", "study", "of", "language"], ["transformers", "revolutionized", "nlp"], ["bert", "and", "gpt", "are", "transformer", "models"], ] custom_model = Word2Vec( sentences=sentences, vector_size=50, # embedding dimensions window=3, # context window size min_count=1, # include words seen at least once workers=4, # parallel training sg=1 # 0=CBOW, 1=Skip-Gram ) print("\nCustom model - 'nlp' vector shape:", custom_model.wv['nlp'].shape) print("Most similar to 'nlp':", custom_model.wv.most_similar('nlp', topn=3))
Critical Limitation: Static Embeddings
Word2Vec and GloVe give EVERY word ONE fixed vector, regardless of context. This is called a "static embedding".
Consider "bank":
— "I went to the bank to deposit money" → financial institution
— "The fish swam near the river bank" → geographical feature
Word2Vec has ONE vector for "bank" — it's the average of both meanings. But BERT and GPT produce contextual embeddings — different vectors for "bank" depending on the surrounding context. This is why transformers are so much more powerful than Word2Vec.
Interview Questions
1. Word frequency affects vector magnitude — frequent words tend to have larger magnitude vectors. "the" would be "far" from many words in Euclidean space just because it's a very common word.
2. Cosine similarity captures semantic similarity regardless of how often words appear.
3. Example: vec("cat") might have magnitude 2.1 and vec("dog") magnitude 3.4, but if they point in the same direction (angle ≈ 0°), cosine similarity ≈ 1.0 (very similar) even though their Euclidean distance is large.
Practice Questions
SBERT, BGE, E5 & Similarity Measures
Why Sentence Embeddings?
Word embeddings give us one vector per word. But we often need ONE vector for an entire sentence or paragraph. This is needed for:
- Semantic search: "Find all documents about neural networks" — compare query vector to document vectors
- RAG: Convert knowledge base documents to vectors, find relevant ones for a user query
- Duplicate detection: "Is this question already answered in our FAQ?"
- Clustering: Group similar customer feedback together
The naive approach: average all word vectors in a sentence. Problem: "The dog bit the man" and "The man bit the dog" have the same average word vector but different meanings!
SBERT — Sentence-BERT
SBERT (Reimers & Gurevych, 2019) — Siamese Network Architecture:
Sentence A Sentence B
"The cat sat on mat" "A feline rested on rug"
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ BERT │ │ BERT │ ← SAME WEIGHTS (siamese!)
│ Encoder │ │ Encoder │
└─────────────┘ └─────────────┘
│ │
▼ ▼
Mean Pooling Mean Pooling
(avg all token (avg all token
hidden states) hidden states)
│ │
▼ ▼
Sentence Vector u Sentence Vector v
[0.23, -0.12, ...] [0.21, -0.15, ...]
│ │
▼ ▼
Cosine Similarity(u, v) = 0.94 ← HIGH!
(These sentences ARE semantically similar)
TRAINING:
SBERT uses pairs/triplets of sentences:
- Positive pair (similar): ("The cat sat", "A feline rested") → high similarity
- Negative pair (dissimilar): ("The cat sat", "I love Python") → low similarity
- Uses Triplet Loss or Cosine Similarity loss to train
Production Embedding Models
| Model | By | Dimensions | Best For |
|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | General purpose, cost-effective, great for RAG |
| text-embedding-3-large | OpenAI | 3072 | High accuracy tasks, multilingual |
| BGE-large-en | BAAI | 1024 | Top-performing open source English embeddings |
| E5-large-v2 | Microsoft | 1024 | Strong cross-lingual retrieval |
| Instructor-XL | HKU | 768 | Task-specific embeddings with instructions |
| all-MiniLM-L6-v2 | SBERT | 384 | Fast, small, good quality — great for edge deployment |
Similarity Measures — Choosing the Right One
| u, v | = two embedding vectors being compared |
| u · v | = dot product: element-wise multiplication then sum |
| ||u|| | = L2 norm (length/magnitude) of vector u = √(u₁² + u₂² + ... + u_n²) |
| θ | = angle between the two vectors |
WHEN TO USE WHICH:
COSINE SIMILARITY: Range [-1, +1]
✓ Use for MOST text similarity tasks
✓ Not affected by vector magnitude (length)
✓ 1.0 = identical direction, 0 = perpendicular, -1 = opposite
✓ Standard choice for semantic search
DOT PRODUCT: Range (-∞, +∞)
✓ Faster to compute (no normalization)
✓ Used when vectors are already L2-normalized (then = cosine!)
✓ OpenAI recommends this for their normalized embeddings
✗ Affected by magnitude — longer vectors score higher even if angle is same
EUCLIDEAN DISTANCE: Range [0, +∞)
✓ Intuitive: physical distance in space
✓ Used in some clustering algorithms (k-means)
✗ More affected by vector dimension count and magnitude
✗ Less common for semantic similarity
PRACTICAL NOTE:
Most embedding models output NORMALIZED vectors (||v|| = 1)
When both vectors are normalized:
cos(θ) = u · v (they become equivalent!)
So check if your embedding model normalizes outputs!
Hugging Face Implementation
# pip install sentence-transformers from sentence_transformers import SentenceTransformer, util import numpy as np # ── Load SBERT model ────────────────────────────── model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, fast # For production quality: 'BAAI/bge-large-en-v1.5' # ── Encode sentences ────────────────────────────── sentences = [ "The cat is sitting on the mat.", "A feline is resting on a rug.", # same meaning, different words "I love programming in Python.", # different topic "Machine learning is fascinating.", "Deep learning is a subset of ML.", # related to ML ] embeddings = model.encode(sentences, normalize_embeddings=True) print("Embedding shape:", embeddings.shape) # (5, 384) # ── Pairwise similarity ─────────────────────────── cos_sim_matrix = util.cos_sim(embeddings, embeddings) print("\nSimilarity matrix (rounded):") for i, sent_i in enumerate(sentences): for j, sent_j in enumerate(sentences): if i < j: sim = cos_sim_matrix[i][j].item() print(f" [{sim:.2f}] '{sent_i[:30]}...' ↔ '{sent_j[:30]}...'") # ── Semantic search ─────────────────────────────── knowledge_base = [ "Python is a high-level programming language.", "Transformers use self-attention mechanisms.", "RAG combines retrieval with generation.", "BERT is an encoder-only transformer model.", "GPT uses autoregressive generation.", ] kb_embeddings = model.encode(knowledge_base, normalize_embeddings=True) query = "How does BERT work?" query_emb = model.encode([query], normalize_embeddings=True) # Find most similar documents similarities = util.cos_sim(query_emb, kb_embeddings)[0] top_results = sorted(enumerate(similarities), key=lambda x: -x[1]) print(f"\nQuery: '{query}'") print("Top 3 most relevant documents:") for rank, (idx, score) in enumerate(top_results[:3], 1): print(f" {rank}. [{score:.3f}] {knowledge_base[idx]}") # Expected: BERT document should score highest!
Relationship to RAG & Semantic Search
RAG Systems
Sentence embeddings ARE the backbone of RAG. Every document chunk in your knowledge base is stored as an embedding vector. User queries become embedding vectors. Cosine similarity finds relevant chunks. This is exactly Module 6 in production!
Semantic Search
Unlike keyword search (TF-IDF), semantic search finds documents that MEAN the same thing as the query, even if they use different words. "What are ML models?" would find "deep learning algorithms" because their embeddings are similar.
Claude
Claude's API can be paired with embedding models for RAG. Anthropic doesn't provide their own embeddings API yet — practitioners use OpenAI embeddings or open-source SBERT models.
OpenAI Embeddings
OpenAI's text-embedding-3 models are the most popular commercial embeddings for RAG. text-embedding-3-small costs $0.02/1M tokens — very cheap for building knowledge bases.
N-gram LMs, Perplexity & Text Generation
What is a Language Model?
A language model is a probability distribution over sequences of tokens. Given any sequence of words, it assigns a probability to that sequence. This seemingly simple idea is the foundation of all GPT, Claude, and LLaMA.
A language model assigns probabilities to sentences:
P("The cat sat on the mat") = 0.0000023 (reasonable sentence)
P("The mat sat on the cat") = 0.0000001 (weird but grammatical)
P("sat the mat on cat the") = 0.0000000001 (not a sentence)
P("I love eating pizza") = 0.0000087 (very natural)
P("I pizza eating love") = 0.0000000003 (unnatural)
KEY INSIGHT: Language models learn "what is natural English"
from billions of examples. They assign HIGH probability to
natural sequences and LOW probability to unnatural ones.
APPLICATION: Given "The sky is ___", predict most likely next word:
P("blue" | "The sky is") = 0.23 ← high
P("clear" | "The sky is") = 0.18 ← high
P("falling" | "The sky is") = 0.01 ← low
P("pizza" | "The sky is") = 0.0001 ← very low
N-Gram Language Models
Before neural networks, N-gram models were the standard language models. They estimate the probability of a word based only on the previous N-1 words (the Markov assumption).
| P(w₁, w₂, ..., wₙ) | = probability of an entire sentence |
| Π | = product (multiply all terms together) |
| P(wᵢ | wᵢ₋₁) | = probability of word wᵢ given the previous word wᵢ₋₁ |
| count(wᵢ₋₁, wᵢ) | = how many times words wᵢ₋₁ and wᵢ appear consecutively in training data |
| count(wᵢ₋₁) | = how many times word wᵢ₋₁ appears in training data |
BIGRAM EXAMPLE:
Training corpus: "I love NLP. I love Python. I enjoy coding."
Count pairs:
(I, love) = 2
(I, enjoy) = 1
(love, NLP) = 1
(love, Python) = 1
(enjoy, coding) = 1
Compute bigram probabilities:
P(love | I) = count(I,love) / count(I) = 2/3 = 0.667
P(enjoy | I) = count(I,enjoy) / count(I) = 1/3 = 0.333
Compute sentence probability:
P("I love NLP") = P(I) × P(love|I) × P(NLP|love)
= 0.25 × 0.667 × 1.0 = 0.167
LIMITATIONS:
"I love NLP" — bigram only looks back 1 word
Can't capture: "The man who ate sushi ... enjoyed IT"
"IT" = "sushi" but it's 7 words back! Bigram can't know.
Perplexity — Evaluating Language Models
Perplexity measures how "confused" or "surprised" a language model is by a test set. Lower perplexity = model is less surprised = model is better at predicting language.
| PP(W) | = perplexity of word sequence W |
| P(w₁...wₙ) | = probability the model assigns to the test sequence |
| N | = number of tokens in the sequence |
| ^(-1/N) | = raise to the power of -1/N (geometric mean normalization) |
INTUITION:
PP = 1 → Perfect model! Always predicts the correct next word.
PP = 10 → On average, model is choosing between 10 equally likely words.
PP = 100 → On average, 100 equally likely choices. Not great.
PP = 50000 → Random guessing over full vocabulary. Terrible model.
Real language model perplexities:
GPT-2 (small, 117M params): PP ≈ 29 on Penn Treebank
GPT-2 (large, 774M params): PP ≈ 22 on Penn Treebank
GPT-3 (175B params): PP ≈ 8.5 on Penn Treebank
Human: PP ≈ 60-80 on reading tasks (humans are uncertain too!)
WHY LOWER IS BETTER:
A model with PP=10 needs 10 "guesses" on average to get it right.
A model with PP=5 only needs 5 guesses.
If your autocomplete model has PP=5, it's twice as good as PP=10!
Text Generation Strategies — How GPT Decides What to Say
SETUP: Model outputs probability distribution over vocabulary at each step.
Input: "The sky is"
Output probabilities:
blue: 0.30, clear: 0.20, beautiful: 0.15, falling: 0.05, ...
1. GREEDY SEARCH — Always pick most likely token
Step 1: "blue" (0.30) → "The sky is blue"
Step 2: "and" (0.25) → "The sky is blue and"
Step 3: "blue" (0.30) → "The sky is blue and blue" ← LOOPS!
✓ Fast, deterministic
✗ Can get stuck in repetitive loops
2. BEAM SEARCH — Keep top-K partial sequences simultaneously
Width=2: Track 2 candidate sequences at each step:
Step 1: Keep ["blue" (0.30), "clear" (0.20)]
Step 2 from "blue": ["blue and" (0.3×0.25=0.075), "blue sky" (0.3×0.12=0.036)]
Step 2 from "clear": ["clear blue" (0.2×0.22=0.044), "clear sky" (0.2×0.18=0.036)]
Keep top 2: ["blue and" (0.075), "clear blue" (0.044)]
✓ Better quality than greedy
✗ Still can sound generic and boring
3. TOP-K SAMPLING — Sample from top-K most likely tokens
K=5: Only consider ["blue", "clear", "beautiful", "bright", "vast"]
Sample randomly from these 5 (weighted by probability)
✓ Introduces diversity/creativity
✗ K=5 might be too few (misses good options) or too many (K=100 includes weird tokens)
4. TOP-P (NUCLEUS) SAMPLING — Sample from tokens covering top P% probability
P=0.9: Add tokens until cumulative probability ≥ 0.9
blue=0.30 (total 0.30) → clear=0.20 (0.50) → beautiful=0.15 (0.65)
→ bright=0.12 (0.77) → vast=0.08 (0.85) → calm=0.06 (0.91) STOP
Sample from these 6 tokens proportionally.
✓ Dynamic vocabulary size — automatically adjusts to model's uncertainty
✓ Used by most LLMs in production (including GPT, Claude)
5. TEMPERATURE — Controls randomness
T=0.0: Always pick max probability (= greedy)
T=1.0: Sample from original distribution (default)
T=2.0: Flatten distribution → more random/creative
T=0.1: Sharpen distribution → very focused/deterministic
Logit scaling: adjusted_logit = raw_logit / temperature
Then softmax to get probabilities.
Most production LLM APIs (OpenAI, Anthropic, Together.ai) expose temperature and top_p parameters. Common settings:
— Creative writing: temperature=0.9, top_p=0.95
— Code generation: temperature=0.2, top_p=0.95
— Factual Q&A: temperature=0.1
— Brainstorming: temperature=1.2, top_p=0.99
Interview Q&A
Temperature=1: The model samples from the probability distribution directly. Higher probability tokens are still more likely to be chosen, but lower probability tokens can also appear. Non-deterministic. Useful for: creative writing, brainstorming, generating diverse outputs.
Temperature=2: The probability distribution is flattened (more uniform). All tokens become more equally likely. Results are very random and often incoherent.
Temperature between 0 and 1 (e.g., 0.7) is common: adds some diversity while maintaining coherence.
Self-Attention, Multi-Head Attention & Contextual Embeddings
Why RNNs Failed — The Problem Attention Solves
RNN (Recurrent Neural Network) — The Old Way:
Text: "The animal didn't cross the street because it was too tired"
RNN processes LEFT TO RIGHT, one token at a time:
"The" → h₁
"animal" → h₂ (modified by h₁)
"didn't" → h₃ (modified by h₂)
...
"it" → h₁₀ (modified by h₉)
...
"tired" → h₁₄ (final state)
PROBLEM 1: VANISHING GRADIENTS
Information from early tokens (h₁) gets diluted/lost
by the time we reach h₁₄.
"animal" information is hard to recover at "tired"
→ RNNs struggle with long-range dependencies
PROBLEM 2: SEQUENTIAL PROCESSING
Can't process "The" and "animal" simultaneously
Must wait: token 1 → token 2 → token 3 → ...
→ SLOW! Can't parallelize on GPUs
WHAT "IT" REFERS TO?
"it" = animal (not street)
To understand this, model needs to connect "it" (position 10)
to "animal" (position 2) — 8 tokens apart!
RNN information of "animal" is mostly gone by "it"...
ATTENTION SOLUTION:
Let "it" directly attend to "animal" regardless of distance!
No information decay — any token can directly look at any other!
Real-World Analogy for Attention
Imagine you're a researcher. You have a Query (your research question). The library has many books, each with a Key (the label on the spine). When you search, you compare your query against all keys. The Values are the actual content inside matching books.
Attention works the same way: for each token (the query), it looks at all other tokens (the keys), computes how relevant each is, then blends their information (values) weighted by relevance.
Query, Key, Value — Deep Dive
For each token in the input, we create THREE vectors:
Q (Query): "What information am I looking for?"
K (Key): "What information do I contain?"
V (Value): "What information do I actually provide?"
These are created by multiplying the token embedding by
learned weight matrices:
Q = x · Wq (x = token embedding, Wq = learned matrix)
K = x · Wk
V = x · Wv
Example: "The cat sat on the mat"
Token: "sat"
Q_sat = "I need to know WHO sat and WHERE they sat"
K_sat = "I contain information about the sitting action"
V_sat = "My actual content: verb, past tense, sitting"
How "sat" attends to "cat":
Score = Q_sat · K_cat (dot product)
High score → "sat" will use a lot of "cat"'s value
Low score → "sat" will mostly ignore "cat"'s value
Scaled Dot-Product Attention — The Formula
| Q | = Query matrix: shape [n_tokens × d_k], stacks all Q vectors for all tokens |
| K | = Key matrix: shape [n_tokens × d_k], stacks all K vectors for all tokens |
| V | = Value matrix: shape [n_tokens × d_v], stacks all V vectors for all tokens |
| K^T | = Transpose of K matrix (flip rows and columns): shape [d_k × n_tokens] |
| QK^T | = Attention score matrix: shape [n_tokens × n_tokens]. Entry (i,j) = how much token i attends to token j |
| d_k | = dimension of each Q/K vector (e.g., 64) |
| √d_k | = scaling factor (prevents very large dot products causing softmax to saturate) |
| softmax | = converts raw scores to probabilities that sum to 1.0 for each query token |
| × V | = multiply the attention weights by the Value matrix to get the output |
STEP-BY-STEP FOR "The cat sat" (simplified, 3 tokens, d_k=2):
TOKEN EMBEDDINGS (input):
x_the = [1.0, 0.5]
x_cat = [0.8, 0.9]
x_sat = [0.6, 0.7]
WEIGHT MATRICES (learned):
Wq = [[0.1, 0.2], [0.3, 0.1]]
Wk = [[0.2, 0.1], [0.1, 0.3]]
Wv = [[0.5, 0.1], [0.2, 0.5]]
COMPUTE Q, K, V:
Q = X · Wq (multiply each token emb by Wq)
K = X · Wk
V = X · Wv
ATTENTION SCORES:
Scores = Q × K^T (3×2) × (2×3) = (3×3) matrix
Score[i,j] = how much token i should attend to token j:
the cat sat
the [0.85, 0.78, 0.62]
cat [0.79, 0.91, 0.74] ← "cat" attends most to itself
sat [0.63, 0.88, 0.70] ← "sat" attends most to "cat" (subject!)
SCALE by √d_k = √2 = 1.41:
Divide all scores by 1.41
SOFTMAX (convert each row to probabilities summing to 1):
the cat sat
the [0.33, 0.37, 0.30]
cat [0.28, 0.42, 0.30]
sat [0.24, 0.44, 0.32] ← "sat" gives 44% weight to "cat"
OUTPUT = Attention_weights × V:
Each token's output = weighted blend of ALL value vectors
Output_sat = 0.24×V_the + 0.44×V_cat + 0.32×V_sat
= "sat" gets a vector that blends info from all tokens,
but especially from "cat" (the subject who sat!)
Multi-Head Attention — The Power Upgrade
Instead of ONE attention function, use H parallel attention heads, each with their own Q, K, V weight matrices. Each head can focus on different aspects of the relationships between words.
MULTI-HEAD ATTENTION with H=3 heads:
Input: "I saw her duck"
HEAD 1 (Syntactic attention — "who does what"):
"saw" ←→ "I" (subject-verb)
"duck" ←→ "saw" (verb-object)
Learns: "her duck" → she owns a duck (grammatical parsing)
HEAD 2 (Coreference — "what refers to what"):
"her" → "woman" (previously mentioned context)
Learns: pronoun resolution
HEAD 3 (Semantic attention — "what concepts are related"):
"saw" ←→ "duck" (actions + animals are related somehow)
Learns: contextual word meaning
Each head produces its own output vector.
Concatenate all head outputs → project to final output.
Multi-head output = Concat(head₁, head₂, head₃) × W_o
WHERE:
headᵢ = Attention(Q·Wqᵢ, K·Wkᵢ, V·Wvᵢ)
Each head has its own Wqᵢ, Wkᵢ, Wvᵢ matrices
W_o = output projection matrix
BERT-base: 12 attention heads, 64 dims each = 768 total
GPT-2: 12 attention heads, 64 dims each = 768 total
GPT-3: 96 attention heads, 128 dims each = 12,288 total
LLaMA-3-70B: 64 attention heads
Contextual Embeddings — The Result of Attention
After attention, each token has a new representation that incorporates information from ALL other tokens. This is called a contextual embedding — unlike Word2Vec which gives "bank" the same vector always, BERT/GPT give "bank" a different vector based on surrounding context.
"I went to the bank to get money" → bank vector points toward financial concepts
"The boat docked on the river bank" → bank vector points toward geographic concepts
This is the FUNDAMENTAL advantage of transformers over all previous approaches.
Interview Q&A — High Frequency Questions
— 1K tokens: 1M operations
— 4K tokens: 16M operations
— 128K tokens: 16 BILLION operations
This is why extending context length is expensive. GPT-4's 128K context requires massive compute. Claude's 200K context required innovative architectural choices. Researchers are actively working on linear attention variants (Mamba, RetNet, RWKV) that achieve O(n) complexity. This is one of the most active research areas in LLMs today.
Cross-attention: Q comes from one sequence, K and V come from a DIFFERENT sequence. "I attend to someone else." Used in encoder-decoder models (T5, BART) in the decoder: the decoder's Q queries attend to the encoder's K and V keys/values. This is how the decoder "looks at" the input when generating the output (used in translation, summarization).
Practice Questions
Complete Transformer Architecture
Full Transformer Architecture
ORIGINAL TRANSFORMER (Vaswani et al., 2017)
Used for: Machine Translation (English → German)
INPUT SIDE (ENCODER): OUTPUT SIDE (DECODER):
"How are you?" "Wie geht es Ihnen?"
┌──────────────────┐ ┌──────────────────┐
│ Input Tokens │ │ Output Tokens │
│ [How, are, you] │ │ [Wie, geht, ...] │
└────────┬─────────┘ └────────┬─────────┘
│ │
┌────────▼─────────┐ ┌────────▼─────────┐
│ Token │ │ Token │
│ Embeddings │ │ Embeddings │
└────────┬─────────┘ └────────┬─────────┘
│ │
┌────────▼─────────┐ ┌────────▼─────────┐
│ Positional │ │ Positional │
│ Encoding (+) │ │ Encoding (+) │
└────────┬─────────┘ └────────┬─────────┘
│ │
┌────────▼─────────┐ ┌────────▼─────────┐
│ ENCODER │ │ DECODER │
│ BLOCK × N │ │ BLOCK × N │
│ │ │ │
│ [Multi-Head │ │ [Masked Multi- │
│ Self-Attention] │ │ Head Self-Attn] │
│ ↓ │ │ ↓ │
│ [Add & LayerNorm]│ │ [Add & LayerNorm]│
│ ↓ │ │ ↓ │
│ [Feed Forward] │ ┌───────│ [Cross-Attention]│
│ ↓ │ │ │ (Q from decoder │
│ [Add & LayerNorm]│ │ │ K,V from encdr)│
└────────┬─────────┘ │ │ ↓ │
│ │ │ [Add & LayerNorm]│
└──────────────┘ │ ↓ │
(encoder outputs │ [Feed Forward] │
→ K, V for │ ↓ │
cross-attention) │ [Add & LayerNorm]│
└────────┬─────────┘
│
┌────────▼─────────┐
│ Linear + Softmax │
│ (predict next │
│ word probs) │
└──────────────────┘
Positional Encoding — Teaching Transformers About Order
Self-attention has no sense of order — it looks at ALL tokens simultaneously. "cat ate mouse" and "mouse ate cat" would produce the same attention scores if we don't tell the model about position. Positional encoding adds position information to each token's embedding.
| pos | = position of the token in the sequence (0, 1, 2, ...) |
| i | = dimension index (0, 1, 2, ..., d_model/2) |
| d_model | = model embedding dimension (e.g., 512) |
| sin/cos | = sine and cosine functions — alternate between even and odd dimensions |
| 10000 | = base constant — creates different frequencies at different dimensions |
POSITIONAL ENCODING INTUITION:
Each position gets a unique "fingerprint" vector
Made of sine and cosine waves at different frequencies
Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
= [0, 1, 0, 1, ...]
Position 1: [sin(1/1), cos(1/1), sin(1/10000), cos(1/10000), ...]
= [0.84, 0.54, 0.0001, 1.0, ...]
WHY SINE/COSINE?
1. Bounded: always between -1 and +1
2. Unique: each position gets a unique pattern
3. Relative: model can learn "position A is 3 steps before position B"
because sin(a+b) = sin(a)cos(b) + cos(a)sin(b) — linear relationship!
4. Works for unseen lengths: can extrapolate to sequences longer than training
MODERN ALTERNATIVE: RoPE (Rotary Positional Embedding)
Used by: LLaMA, Mistral, Qwen, Falcon, Phi
Encodes position as rotation of Q and K vectors
Better for long contexts — doesn't degrade with position distance
RoPE(q, position) = q × rotation_matrix(position)
Feed-Forward Layer, Residual Connections & LayerNorm
TRANSFORMER BLOCK COMPONENTS:
1. FEED-FORWARD NETWORK (FFN):
Applied to each token INDEPENDENTLY (no cross-token interaction):
FFN(x) = max(0, x·W₁ + b₁) · W₂ + b₂
Typical sizes: d_model=512, FFN inner dim=2048 (4× expansion)
GPT-3: d_model=12288, FFN=49152
What does FFN do?
Attention = "relate tokens to each other"
FFN = "think about each token individually"
FFN stores factual knowledge! (demonstrated by meng et al. 2022)
"Paris is the capital of ___" → knowledge stored in FFN weights
2. RESIDUAL CONNECTIONS (Skip Connections):
Output = x + SubLayer(x)
↑ original input is ADDED back
Why? Prevents vanishing gradients in deep networks!
Gradient can flow DIRECTLY through addition → no degradation
Allows training 12, 24, 96+ layers deep
3. LAYER NORMALIZATION:
LayerNorm(x) = γ × (x - μ)/σ + β
μ = mean of x's values
σ = standard deviation of x's values
γ, β = learned scale and shift parameters
Stabilizes training by normalizing activations.
Each token's vector is normalized to mean≈0, std≈1
then rescaled by learned γ, β
FINAL TRANSFORMER BLOCK (Post-LN style):
x → [Multi-Head Self-Attention] → + x → LayerNorm →
→ [Feed Forward Network] → + x → LayerNorm → output
Three Transformer Flavors
╔═══════════════╦═══════════════════╦══════════════════════╗
║ TYPE ║ ENCODER-ONLY ║ DECODER-ONLY ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Models ║ BERT, RoBERTa, ║ GPT (all), Claude, ║
║ ║ DeBERTa, ALBERT ║ LLaMA, Mistral, ║
║ ║ ║ DeepSeek, Qwen ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Attention ║ Full bidirectional║ Causal (left-only) ║
║ ║ Every token sees ║ Each token only sees ║
║ ║ all other tokens ║ past tokens ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Training ║ MLM: predict ║ Next token prediction║
║ Objective ║ masked tokens ║ (autoregressive) ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Best For ║ Understanding ║ Generation: ║
║ ║ tasks: ║ chatbots, writing, ║
║ ║ classification, ║ code, QA, everything ║
║ ║ NER, QA extract. ║ ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ TYPE ║ ENCODER-DECODER ║ ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Models ║ T5, BART, mT5 ║ ║
║ ║ Pegasus, MarianMT ║ ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Best For ║ Seq2Seq tasks: ║ ║
║ ║ translation, ║ ║
║ ║ summarization, ║ ║
║ ║ question gen. ║ ║
╚═══════════════╩═══════════════════╩══════════════════════╝
Hugging Face — Using Any Transformer Model
from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import pipeline import torch # ── EASY WAY: Pipeline API ─────────────────────────────── # Handles tokenization + model + postprocessing automatically # Sentiment analysis (encoder-only: uses BERT under the hood) sentiment = pipeline("sentiment-analysis") result = sentiment("I absolutely love this course on NLP!") print(result) # [{'label': 'POSITIVE', 'score': 0.9998}] # Text generation (decoder-only: uses GPT-2) generator = pipeline("text-generation", model="gpt2") result = generator( "The future of AI is", max_new_tokens=50, temperature=0.8, do_sample=True, top_p=0.9 ) print(result[0]['generated_text']) # Summarization (encoder-decoder: uses BART) summarizer = pipeline("summarization", model="facebook/bart-large-cnn") long_text = """ Transformers were introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. They use self-attention mechanisms to process sequences in parallel, solving the vanishing gradient problem that plagued RNNs. The architecture consists of an encoder and decoder, each containing multiple layers of multi-head self-attention and feed-forward networks. Transformers have since become the dominant architecture in NLP. """ summary = summarizer(long_text, max_length=60, min_length=20) print("\nSummary:", summary[0]['summary_text']) # ── MANUAL WAY: Full control ────────────────────────────── model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english" ) model.eval() text = "Transformers are amazing!" inputs = tokenizer(text, return_tensors="pt", truncation=True) with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = torch.softmax(logits, dim=-1) pred_class = torch.argmax(probs).item() labels = model.config.id2label print(f"\nPrediction: {labels[pred_class]} ({probs[0][pred_class]:.3f})")
MLM, NSP, CLS Token & Fine-Tuning
BERT Architecture Overview
BERT = Bidirectional Encoder Representations from Transformers
= ENCODER-ONLY Transformer (uses only the Encoder stack)
BERT-base: 12 transformer encoder layers
768 hidden dimensions
12 attention heads
110M parameters
BERT-large: 24 transformer encoder layers
1024 hidden dimensions
16 attention heads
340M parameters
KEY INNOVATION: BIDIRECTIONAL attention
Previous models (GPT, ELMo): read text left→right OR right→left
BERT: reads text in BOTH directions simultaneously!
"The bank can guarantee deposits will eventually cover..."
When processing "bank", BERT sees:
← "The" (left) AND "can guarantee deposits" (right) →
Both directions together let BERT figure out "bank" = financial!
INPUT FORMAT:
[CLS] token_1 token_2 ... [SEP] token_A token_B ... [SEP]
↑ ↑ ↑
always first separates sentences always last
MLM — Masked Language Modeling (BERT's Training Task)
BERT is trained on a "fill in the blank" task:
Original: "The cat sat on the mat"
Masked: "The [MASK] sat on the mat"
Task: Predict what [MASK] is → "cat"
MASKING STRATEGY (15% of tokens are selected):
80% replaced with [MASK]: "The [MASK] sat"
10% replaced with random word: "The dog sat" (still predict "cat"!)
10% kept unchanged: "The cat sat" (but still predict "cat"!)
Why NOT just always use [MASK]?
At fine-tuning/inference, [MASK] never appears!
If model only ever sees [MASK], it won't learn good representations
for non-masked tokens. The random replacement forces the model
to develop good representations for ALL tokens.
TRAINING OBJECTIVE:
Loss = Cross-entropy on masked positions ONLY
Don't penalize predictions on non-masked positions
RESULT: BERT learns deep bidirectional representations because
to predict [MASK], it must understand context from both sides!
NSP — Next Sentence Prediction
NSP is BERT's second pre-training task:
Task: Given two sentences, does sentence B follow sentence A naturally?
Positive example (IsNext=True):
Sentence A: "The man went to the store."
Sentence B: "He bought a gallon of milk."
Label: 1 (IsNext)
Negative example (IsNext=False):
Sentence A: "The man went to the store."
Sentence B: "Penguins live in Antarctica."
Label: 0 (NotNext) ← randomly sampled, unrelated!
50% of training pairs are IsNext, 50% are NotNext.
The [CLS] token's final representation is used to predict
IsNext/NotNext → BERT learns sentence-level coherence!
NOTE: NSP has been questioned in later research.
RoBERTa (Facebook) removed NSP entirely and got better results!
But [CLS] token's usefulness for classification tasks remains.
CLS Token — BERT's Secret Weapon for Classification
CLS TOKEN MECHANICS:
Input: [CLS] I love NLP [SEP]
Position: 0 1 2 3 4
After 12 transformer layers of attention, EACH token has
a contextual representation. The [CLS] token is special:
• It attends to ALL other tokens (via self-attention)
• All other tokens can also influence [CLS]
• After training, [CLS] learns to aggregate the meaning
of the ENTIRE SEQUENCE into one vector
During fine-tuning for classification:
[CLS] representation (768-dim vector)
↓
Linear layer (768 → num_classes)
↓
Softmax → class probabilities
So for sentiment analysis:
[CLS] "I love NLP" [SEP]
After BERT → CLS vector = [0.23, -0.12, 0.89, ...]
Linear layer → [0.05, 0.95] (neg, pos)
→ POSITIVE (95% confidence)
Fine-Tuning BERT
BERT's power comes from pre-training + fine-tuning. Pre-training on 3 billion words gives BERT general language understanding. Fine-tuning on a small task-specific dataset adapts it to your specific problem.
FINE-TUNING STRATEGY:
PRE-TRAINED BERT (general knowledge)
↓ (add task-specific head)
┌─────────────────────────────────────────────────────┐
│ Task │ Head │ Input │
│──────────────────────────────────────────────────────│
│ Sentiment Analysis │ [CLS] → linear │ [CLS] S │
│ Named Entity Recog. │ Each token → linear │ tokens │
│ Question Answering │ Start/End logits │ Q [SEP] C│
│ Sentence Similarity │ [CLS] → regression │ S₁[SEP]S₂│
│ Text Classification │ [CLS] → linear │ [CLS] S │
└─────────────────────────────────────────────────────┘
Training settings for fine-tuning:
Learning rate: 2e-5 to 5e-5 (small! don't destroy pre-trained weights)
Epochs: 2-4 (don't overfit on small dataset)
Batch size: 16-32
Sequence length: up to 512 tokens
Why so few epochs?
BERT already knows language. You're just teaching it your
specific task, not starting from scratch.
Fine-tuning on too many epochs → "catastrophic forgetting"!
Fine-Tuning BERT for Sentiment Analysis
# pip install transformers datasets torch from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer ) from datasets import load_dataset import numpy as np from sklearn.metrics import accuracy_score # ── 1. Load Dataset ─────────────────────────────────────── dataset = load_dataset("imdb") # 25K training, 25K test reviews print("Dataset loaded:", dataset) # ── 2. Load Tokenizer ───────────────────────────────────── MODEL_NAME = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) def tokenize_function(examples): """Tokenize texts and return input_ids, attention_mask, etc.""" return tokenizer( examples['text'], truncation=True, # cut off if too long max_length=512, # BERT max sequence length padding='max_length' # pad shorter sequences ) # Apply tokenization to entire dataset tokenized = dataset.map(tokenize_function, batched=True) # ── 3. Load BERT Model ──────────────────────────────────── model = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=2 # POSITIVE, NEGATIVE ) # This adds a classification head on top of BERT's [CLS] output # ── 4. Define Metrics ───────────────────────────────────── def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return {'accuracy': accuracy_score(labels, predictions)} # ── 5. Training Arguments ───────────────────────────────── training_args = TrainingArguments( output_dir="./bert-sentiment", num_train_epochs=3, # usually 2-5 for fine-tuning per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=2e-5, # small! don't destroy pre-training evaluation_strategy="epoch", # evaluate at end of each epoch save_strategy="epoch", load_best_model_at_end=True, warmup_steps=500, # gradually increase LR at start weight_decay=0.01, # regularization ) # ── 6. Train! ───────────────────────────────────────────── trainer = Trainer( model=model, args=training_args, train_dataset=tokenized['train'].select(range(5000)), # small for demo eval_dataset=tokenized['test'].select(range(1000)), compute_metrics=compute_metrics ) trainer.train() # Fine-tuning takes 10-30 min on GPU, hours on CPU # ── 7. Inference ────────────────────────────────────────── from transformers import pipeline sentiment_pipeline = pipeline( "text-classification", model=model, tokenizer=tokenizer ) test_texts = [ "This movie was absolutely fantastic!", "Worst film I've ever seen. Terrible.", "It was okay, nothing special." ] for text in test_texts: result = sentiment_pipeline(text)[0] print(f"{result['label']} ({result['score']:.3f}): {text}")
Interview Q&A
GPT is decoder-only with CAUSAL (unidirectional) attention — each token only attends to PAST tokens (left only). This makes GPT great for GENERATION tasks.
Training objectives also differ: BERT uses MLM (fill in blanks) + NSP. GPT uses autoregressive next-token prediction (predict next word from all previous words).
Result: BERT better at classification, NER, extractive QA. GPT better at chat, writing, code generation, reasoning.
This is also why modern models use RoPE (Rotary Position Embedding) — it doesn't have this hard limit and can generalize to longer sequences than seen during training. LLaMA 3 with RoPE can handle up to 128K tokens.
GPT Architecture, In-Context Learning & Chat Models
GPT Architecture
GPT = Decoder-only Transformer (stacks ONLY decoder blocks)
TOKENS: "The" "cat" "sat" "on" "the" "mat"
MASKS: ─────────────────────────────────────────
✓ ✓✓ ✓✓✓ ✓✓✓✓ ✓✓✓✓✓ ✓✓✓✓✓✓
(1) (1,2) (1-3) (1-4) (1-5) (1-6)
CAUSAL MASKING = "Autoregressive mask":
When processing "sat", it can ONLY see:
→ "The" (position 1) ✓
→ "cat" (position 2) ✓
→ "sat" (position 3) ✓ (itself)
→ "on" (position 4) ✗ BLOCKED! Future token!
→ "the" (position 5) ✗ BLOCKED!
→ "mat" (position 6) ✗ BLOCKED!
WHY CAUSAL MASKING?
During TRAINING: "The cat sat on the mat" is given
GPT is trained to predict: cat|The, sat|The cat, on|The cat sat...
Without masking, GPT could "cheat" by looking at future words!
During INFERENCE: We don't HAVE future tokens — we're generating them!
So causal masking matches inference reality.
GENERATION PROCESS (AUTOREGRESSIVE):
Input: "The cat"
Step 1: Generate next token → "sat" (append to input)
Input: "The cat sat"
Step 2: Generate next token → "on" (append)
Input: "The cat sat on"
Step 3: Generate "the" → and so on...
STOP when token is generated
In-Context Learning — GPT's Emergent Magic
In-Context Learning (ICL) means GPT can learn to do a new task just from examples given in the prompt — without updating any weights. You don't need to fine-tune. Just show examples in the prompt and the model follows the pattern.
ZERO-SHOT:
Prompt: "Translate English to French: Hello"
GPT: "Bonjour" ← No examples given, just instruction
ONE-SHOT:
Prompt: "Translate English to French:
English: Good morning → French: Bonjour matin
English: Hello →"
GPT: "Bonjour"
FEW-SHOT:
Prompt: "Classify sentiment (POSITIVE/NEGATIVE):
'I loved it!' → POSITIVE
'Terrible food' → NEGATIVE
'Best movie ever!' →"
GPT: "POSITIVE"
WHY DOES THIS WORK?
GPT has seen BILLIONS of examples during training.
It has learned patterns like: "question → answer",
"English: X → French: Y", "input: X, output: Y".
Given a few examples, it recognizes the pattern and continues it.
This is why bigger models (GPT-4 vs GPT-2) are better at ICL —
they've compressed more patterns from more data.
RLHF — How GPT Becomes ChatGPT
Raw GPT → text completion model (predicts next token)
"Tell me how to make a bomb" → GPT just continues the text!
ChatGPT = GPT + RLHF (Reinforcement Learning from Human Feedback)
STAGE 1: SUPERVISED FINE-TUNING (SFT)
Human trainers write ideal prompt-response pairs:
Prompt: "Explain quantum physics"
Response: "Quantum physics is the branch of physics..."
Fine-tune GPT on these demonstrations.
STAGE 2: REWARD MODEL TRAINING
Show humans multiple GPT outputs for same prompt.
Humans rank them: Response A > Response C > Response B
Train a "reward model" to predict human preferences.
STAGE 3: RL OPTIMIZATION (PPO)
Generate responses → reward model scores them
→ PPO algorithm updates GPT to maximize reward
→ GPT learns to produce responses humans prefer
RESULT: GPT that follows instructions, refuses harmful requests,
stays on topic, is helpful, harmless, and honest!
Claude (Anthropic) uses Constitutional AI instead of pure RLHF:
A set of principles ("be helpful, harmless, honest") guides the model
The model critiques and revises its own outputs against the constitution
GPT Versions — Evolution
| Model | Year | Params | Key Innovation |
|---|---|---|---|
| GPT-1 | 2018 | 117M | First GPT: decoder-only transformer pre-trained on BooksCorpus |
| GPT-2 | 2019 | 1.5B | Zero-shot task performance; OpenAI initially withheld it as "dangerous" |
| GPT-3 | 2020 | 175B | Few-shot learning emerges; in-context learning discovered |
| InstructGPT | 2022 | 175B | RLHF applied — follows instructions, much safer |
| ChatGPT | 2022 | ~175B | Chat interface + RLHF; 100M users in 2 months |
| GPT-4 | 2023 | ~1.8T? | Multimodal, expert-level reasoning, 128K context |
| GPT-4o | 2024 | — | Omni-modal: text, audio, vision in single model |
GPT API Usage + Streaming
# pip install openai from openai import OpenAI client = OpenAI() # uses OPENAI_API_KEY env variable # ── Basic Chat Completion ────────────────────────────────── response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are an expert NLP teacher."}, {"role": "user", "content": "Explain transformers in 2 sentences."} ], temperature=0.7, max_tokens=200 ) print(response.choices[0].message.content) print(f"Tokens used: {response.usage.total_tokens}") # ── Few-Shot Prompting ───────────────────────────────────── few_shot_messages = [ {"role": "system", "content": "You classify sentiment."}, {"role": "user", "content": "I loved this movie!"}, {"role": "assistant", "content": "POSITIVE"}, {"role": "user", "content": "Terrible service."}, {"role": "assistant", "content": "NEGATIVE"}, {"role": "user", "content": "It was okay I guess."}, ] resp = client.chat.completions.create(model="gpt-4o-mini", messages=few_shot_messages) print("Few-shot result:", resp.choices[0].message.content) # NEUTRAL # ── Streaming Response ───────────────────────────────────── stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a haiku about NLP."}], stream=True ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) # ── Using open-source GPT-style model (local) ───────────── # pip install transformers torch from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "microsoft/phi-2" # small 2.7B GPT-style model tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16) prompt = "Transformers in NLP are" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
All LLMs Are GPT-Style Decoder-Only Transformers
GPT (OpenAI)
The original. Pre-train on next-token prediction → RLHF → ChatGPT. GPT-4 is multimodal with ~1.8T params (rumored MoE architecture).
Claude (Anthropic)
Decoder-only transformer. Constitutional AI instead of pure RLHF. Claude 3.5 Sonnet: 200K context, strong reasoning and coding.
LLaMA (Meta)
Open-weights GPT-style model. LLaMA 3: 8B/70B/405B. Uses RoPE, GQA (Grouped Query Attention), SwiGLU activation.
DeepSeek
Decoder-only. DeepSeek V3 (671B MoE, 37B active). DeepSeek R1 adds chain-of-thought reasoning. MLA (Multi-head Latent Attention) to reduce KV cache.
Qwen (Alibaba)
Decoder-only, heavily optimized for Chinese+English. Qwen2.5: 0.5B to 72B. Uses GQA, RoPE, strong code and math capabilities.
Kimi (Moonshot AI)
Decoder-only optimized for very long context (1M tokens). Strong at document analysis, research tasks, long-form Chinese content.
Interview Q&A
BM25, Dense Retrieval, Vector Databases & ANN
Sparse vs Dense Retrieval
SPARSE RETRIEVAL (BM25, TF-IDF):
Query: "What is machine learning?"
Method: Count keyword overlaps between query and documents
Finds: Docs containing "machine", "learning" (exact match)
Misses: "What is ML?" → "ML" ≠ "machine learning" in keyword space
Vector: [0, 0, 1, 0, 0, 1, 0, ...] ← mostly zeros (sparse)
DENSE RETRIEVAL (Neural Embeddings):
Query: "What is machine learning?"
Method: Embed query → find closest embedding vectors
Finds: Docs about "ML", "AI training", "supervised algorithms"
even if they never use the exact phrase "machine learning"!
Vector: [0.23, -0.12, 0.87, ...] ← dense (no zeros)
HYBRID RETRIEVAL (Best of both):
Score = α × BM25_score + (1-α) × Dense_score
Use BM25 for keyword precision + dense for semantic recall
Used by: Elasticsearch, Weaviate, Qdrant in production
WHEN TO USE WHICH:
Keyword search (exact product names, IDs): BM25 wins
Semantic search (meaning, paraphrase): Dense wins
General enterprise search: Hybrid wins
BM25 — The Gold Standard Sparse Retrieval
| qᵢ | = each query term |
| f(qᵢ,d) | = frequency of term qᵢ in document d |
| |d| | = length of document d (in words) |
| avgdl | = average document length in the corpus |
| k₁ | = term saturation parameter (typically 1.2-2.0). Controls how much repeated terms boost score. |
| b | = length normalization (typically 0.75). Higher b = more penalty for long documents. |
| IDF | = Inverse Document Frequency (same as TF-IDF — rare terms score higher) |
BM25 adds two improvements over TF-IDF: (1) Term frequency saturation — mentioning a word 20 times vs 10 times doesn't double the score; there's diminishing returns. (2) Document length normalization — a short document with "machine learning" mentioned once is more relevant than a 10-page document with it mentioned once in passing.
Vector Databases — Storing and Searching Embeddings
VECTOR DATABASE WORKFLOW:
INDEXING PHASE (one-time):
Documents → Embedding Model → Vectors → Store in Vector DB
"Paris is in France" → [0.23, -0.12, 0.87, ...] → stored at id=1
QUERY PHASE (real-time):
User Query → Embedding Model → Query Vector
"Where is Paris?" → [0.21, -0.10, 0.89, ...]
Vector DB: Find k nearest vectors to query vector
Returns: Top-5 most similar document IDs + scores
NAIVE APPROACH — Exact Nearest Neighbor:
Compare query vector to EVERY vector in database
100M docs × 1536 dims = 154 BILLION comparisons per query
At 1 ns/comparison = 154 seconds per query 😱 WAY TOO SLOW!
ANN — Approximate Nearest Neighbor:
Sacrifice tiny bit of accuracy for 100-1000× speed gain
HNSW, IVF, PQ — different algorithms to find "good enough" neighbors
Typical: find 95-99% of true nearest neighbors in milliseconds!
HNSW — The Standard ANN Algorithm
HNSW = Hierarchical Navigable Small World
STRUCTURE: Multi-layer graph (like a highway system)
Layer 2 (highway): Few nodes, long jumps
A ─────────────────────── E
Layer 1 (roads): More nodes, medium connections
A ──── B ──── C ──── D ── E
Layer 0 (streets): All nodes, local connections
A ─ B ─ C ─ D ─ E ─ F ─ G ─ H ─ I
SEARCH ALGORITHM:
1. Start at top layer (Layer 2), find closest node to query
2. Drop down to Layer 1, search around that node
3. Drop down to Layer 0, do fine-grained local search
4. Return k nearest neighbors found
Like navigating a city: take the highway to the right
neighborhood, then local streets to the exact address.
PERFORMANCE:
Index build: O(n log n)
Search: O(log n) ← logarithmic! Very fast.
Memory: O(n × M × d) where M = neighbors per node
USED BY: Faiss (Facebook), Weaviate, Qdrant, Pinecone,
Chroma, Milvus — all major vector databases!
Popular Vector Databases
| Database | Type | Best For | Notes |
|---|---|---|---|
| Chroma | Open source, local | Prototyping, local dev | Easiest to start with, Python-native, stores on disk |
| Faiss | Library (Meta) | Research, large scale | Not a full DB, just ANN library. Very fast. Used internally at Meta. |
| Pinecone | Managed cloud | Production RAG | Fully managed, easy API, expensive at scale |
| Weaviate | Open source/cloud | Hybrid search | Built-in BM25 + dense hybrid, GraphQL API |
| Qdrant | Open source/cloud | High performance | Rust-based, very fast, good filtering support |
| pgvector | PostgreSQL extension | Existing Postgres users | Add vector search to your existing Postgres DB |
Building Semantic Search with Chroma
# pip install chromadb sentence-transformers import chromadb from sentence_transformers import SentenceTransformer # ── 1. Setup ────────────────────────────────────────────── client = chromadb.Client() # in-memory for demo # For persistence: chromadb.PersistentClient(path="./my_db") collection = client.create_collection("nlp_knowledge_base") model = SentenceTransformer("all-MiniLM-L6-v2") # ── 2. Index Documents ──────────────────────────────────── documents = [ "BERT is an encoder-only transformer model trained with MLM.", "GPT uses autoregressive decoder-only architecture.", "Transformers use self-attention to process sequences in parallel.", "BPE tokenization splits rare words into subword pieces.", "RAG combines retrieval with language model generation.", "Word2Vec learns word embeddings using context window prediction.", "LLaMA is Meta's open-source large language model.", "Fine-tuning adapts pre-trained models to specific tasks.", ] embeddings = model.encode(documents).tolist() collection.add( documents=documents, embeddings=embeddings, ids=[f"doc_{i}" for i in range(len(documents))] ) print(f"Indexed {collection.count()} documents") # ── 3. Semantic Search ──────────────────────────────────── queries = [ "How does BERT learn language representations?", "What is the architecture of GPT models?", "How are words split into tokens?", ] for query in queries: query_emb = model.encode([query]).tolist() results = collection.query( query_embeddings=query_emb, n_results=2 ) print(f"\nQuery: '{query}'") for doc, dist in zip(results['documents'][0], results['distances'][0]): sim = 1 - dist # Chroma returns distance, convert to similarity print(f" [{sim:.3f}] {doc}")
Interview Q&A
Complete RAG Pipeline
Why RAG? The Problem It Solves
1. Knowledge cutoff: GPT-4's training ended in April 2023. It doesn't know about events after that date.
2. Hallucination: LLMs confidently state incorrect facts when they don't know the answer.
3. Private data: LLMs don't have access to your company's documents, databases, or proprietary knowledge.
4. No citations: Can't easily trace WHERE the information came from.
RAG = Retrieval-Augmented Generation. First retrieve relevant documents from a knowledge base, then augment the LLM's prompt with those documents, then generate an answer grounded in retrieved evidence. The LLM only needs to reason and summarize — all facts come from retrieved documents.
Full RAG Pipeline Architecture
INDEXING PHASE (offline, one-time):
─────────────────────────────────────────────────────────────
Documents (PDFs, websites, docs)
│
▼
[Chunking] → Split into 256-512 token overlapping chunks
│
▼
[Embedding Model] → Each chunk → dense vector
│
▼
[Vector Database] → Store (chunk_text, vector, metadata)
─────────────────────────────────────────────────────────────
QUERY PHASE (real-time, each user query):
─────────────────────────────────────────────────────────────
User Query: "What is LLaMA's context window?"
│
▼
[Query Embedding] → Query → dense vector via same embedding model
│
▼
[Retrieval]
├── Dense: Find top-k chunks by cosine similarity
└── Sparse: BM25 keyword matching (optional)
→ Merge, rerank, return top-5 most relevant chunks
│
▼
[Context Assembly]:
System: "Answer using ONLY the provided context."
Context: [chunk1: "LLaMA 3.1 supports 128K tokens..."]
[chunk2: "Meta released LLaMA 3 with..."]
Question: "What is LLaMA's context window?"
│
▼
[LLM Generation] (GPT-4, Claude, LLaMA, etc.)
│
▼
Answer: "LLaMA 3.1 supports a context window of 128,000 tokens
according to the documentation. [Source: chunk1]"
─────────────────────────────────────────────────────────────
Extractive vs Abstractive QA
EXTRACTIVE QA (BERT-style):
Context: "The Eiffel Tower was completed in 1889 in Paris."
Question: "When was the Eiffel Tower completed?"
Model output: [start_pos=5, end_pos=6] → "1889"
Just EXTRACTS a span from the context. No generation!
Use BERT + span prediction head.
Fast, factual, no hallucination (can't generate text not in context).
Limitation: Can't synthesize across multiple paragraphs.
ABSTRACTIVE QA (GPT-style):
Context: "The Eiffel Tower was completed in 1889..."
Question: "When was the Eiffel Tower completed?"
Model GENERATES: "The Eiffel Tower was completed in the year 1889,
during the World Fair in Paris."
Can synthesize, paraphrase, summarize multiple sources.
More flexible but can hallucinate.
RAG QA = Abstractive QA + Retrieved Context:
Best of both worlds:
→ Retrieved context grounds the generation (reduces hallucination)
→ LLM generation allows synthesis across multiple chunks
Production RAG Pipeline — Full Code
# pip install chromadb sentence-transformers openai import chromadb from sentence_transformers import SentenceTransformer from openai import OpenAI import textwrap # ── INDEXING PHASE ──────────────────────────────────────── KNOWLEDGE_BASE = [ "LLaMA 3.1 by Meta supports a 128,000 token context window and comes in 8B, 70B, and 405B parameter sizes.", "Claude 3.5 Sonnet by Anthropic has a 200,000 token context window and excels at coding and reasoning tasks.", "GPT-4o by OpenAI is multimodal (text, image, audio) with a 128,000 token context window.", "DeepSeek V3 uses a Mixture-of-Experts architecture with 671B total parameters but only 37B active per token.", "RAG (Retrieval Augmented Generation) reduces hallucinations by grounding LLM responses in retrieved documents.", "BPE tokenization was introduced for NMT by Sennrich et al. in 2016 and is used by GPT models.", "BERT uses WordPiece tokenization and a vocabulary of 30,000 tokens. It processes text bidirectionally.", "Transformers were introduced in 'Attention Is All You Need' by Vaswani et al. at Google in 2017.", "Qwen 2.5 by Alibaba supports both English and Chinese and comes in sizes from 0.5B to 72B parameters.", "Fine-tuning adapts a pre-trained model to a specific task using a small labeled dataset.", ] embed_model = SentenceTransformer("all-MiniLM-L6-v2") chroma = chromadb.Client() collection = chroma.create_collection("llm_knowledge") # Embed and store all documents embeddings = embed_model.encode(KNOWLEDGE_BASE).tolist() collection.add( documents=KNOWLEDGE_BASE, embeddings=embeddings, ids=[f"doc_{i}" for i in range(len(KNOWLEDGE_BASE))] ) print(f"✓ Indexed {collection.count()} documents into vector store") # ── RAG QUERY FUNCTION ──────────────────────────────────── openai_client = OpenAI() def rag_query(question: str, top_k: int = 3) -> str: """ Full RAG pipeline: 1. Embed the question 2. Retrieve top-k similar chunks 3. Build prompt with retrieved context 4. Generate answer with LLM """ # Step 1: Embed query query_emb = embed_model.encode([question]).tolist() # Step 2: Retrieve top-k documents results = collection.query( query_embeddings=query_emb, n_results=top_k ) retrieved_docs = results['documents'][0] # Step 3: Build augmented prompt context = "\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(retrieved_docs)]) system_prompt = """You are a helpful AI assistant. Answer questions based ONLY on the provided context. If the context doesn't contain enough information, say 'I don't have enough information in my knowledge base to answer this.' Always cite which context number [1], [2], [3] supports your answer.""" user_prompt = f"""Context: {context} Question: {question} Answer:""" # Step 4: Generate answer response = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ], temperature=0.1 # low temperature for factual tasks ) return response.choices[0].message.content # ── DEMO ────────────────────────────────────────────────── questions = [ "What context window does Claude 3.5 support?", "Which models use Mixture of Experts?", "How does RAG help reduce hallucinations?", ] for q in questions: print(f"\n{'='*60}") print(f"Q: {q}") answer = rag_query(q) print(f"A: {answer}")
Advanced RAG Techniques
Hypothetical Document Embeddings: Generate a hypothetical answer first, embed THAT, then search. Better semantic match.
After BM25/dense retrieval, use a cross-encoder (like BGE-Reranker) to re-score top-100 → keep top-5. More accurate.
Store small chunks for retrieval but return parent (larger) context. Best of both: precise retrieval, rich context.
Rephrase the user query into 3-5 variants, retrieve for all, merge results. Reduces single-query biases.
Forward-Looking Active Retrieval: Retrieve on-demand when model is uncertain during generation, not just once upfront.
Let the LLM decide WHEN and WHAT to retrieve using tool calls. Multiple retrieval steps in one answer session.
Interview Q&A
Complete Evaluation Metrics Guide
Classification Metrics — Accuracy, Precision, Recall, F1
CONFUSION MATRIX (for binary classification):
PREDICTED
Positive Negative
ACTUAL Positive TP=80 FN=20
Negative FP=10 TN=90
TP = True Positives: Predicted positive, actually positive ✓
FP = False Positives: Predicted positive, actually negative ✗ (false alarm)
FN = False Negatives: Predicted negative, actually positive ✗ (missed it)
TN = True Negatives: Predicted negative, actually negative ✓
ACCURACY = (TP + TN) / Total = (80+90)/200 = 85%
Problem: Misleading for imbalanced datasets!
If 95% of emails are NOT spam, always predicting "not spam" = 95% accuracy
but you've built a useless spam filter!
PRECISION = TP / (TP + FP) = 80/(80+10) = 88.9%
"Of all the things I said were positive, how many actually were?"
High precision = few false alarms
RECALL = TP / (TP + FN) = 80/(80+20) = 80%
"Of all the actual positives, how many did I catch?"
High recall = few misses
F1 SCORE = 2 × (Precision × Recall) / (Precision + Recall)
= 2 × (0.889 × 0.80) / (0.889 + 0.80) = 84.2%
Harmonic mean — penalizes extreme imbalance between P and R
Use F1 when both precision AND recall matter equally
BLEU — Evaluating Machine Translation
BLEU (Bilingual Evaluation Understudy) measures how much overlap there is between a model's output and human-written reference translations. It counts n-gram overlaps.
| BP | = Brevity Penalty: penalizes translations that are too short. BP = 1 if output ≥ reference length, else e^(1-r/c) |
| pₙ | = modified n-gram precision: how many n-grams in output appear in reference |
| wₙ | = weights for each n-gram order (typically 0.25 each for n=1,2,3,4) |
BLEU WORKED EXAMPLE:
Reference: "The cat sat on the mat"
Candidate: "The cat is on the mat"
Unigram precision (1-gram):
Candidate words: The, cat, is, on, the, mat (6 words)
Words in reference: The✓, cat✓, is✗, on✓, the✓, mat✓ = 5/6 = 83.3%
Bigram precision (2-gram):
Candidate bigrams: (The,cat)✓, (cat,is)✗, (is,on)✗, (on,the)✓, (the,mat)✓
= 3/5 = 60%
BLEU-1: 83.3% BLEU-2: 60% → Combined BLEU ≈ 70%
(Actual BLEU also considers 3-grams and 4-grams)
BLEU LIMITATIONS:
✗ Ignores semantic similarity: "automobile" ≠ "car" even if synonymous
✗ Doesn't capture fluency well
✗ Multiple references needed for reliability
✓ Still industry standard for MT benchmarks
Used by: WMT translation benchmarks, academic comparisons
ROUGE — Evaluating Summarization
ROUGE = Recall-Oriented Understudy for Gisting Evaluation
(Used for summarization evaluation)
KEY DIFFERENCE FROM BLEU:
BLEU focuses on PRECISION (how much of output is in reference)
ROUGE focuses on RECALL (how much of reference is in output)
→ For summarization, recall matters more: did we capture key info?
ROUGE-N: n-gram RECALL
Reference summary: "The transformer architecture uses attention."
Model summary: "Transformers use self-attention mechanisms."
ROUGE-1 (unigram recall):
Reference words: {The, transformer, architecture, uses, attention}
Found in output: {transformer(transformers)≈, uses, attention≈} = ~3/5 = 60%
ROUGE-L: Longest Common Subsequence
Measures longest matching word sequence (allows gaps)
Reference: "The transformer architecture uses attention"
Output: "Transformers use self-attention mechanisms"
LCS: "transformer ... uses ... attention" = 3 words
ROUGE-L = 3/5 = 60%
Better than ROUGE-N because it considers word order
ROUGE IN PRACTICE:
ROUGE-1: used for general content overlap
ROUGE-2: stricter, requires bigram matches
ROUGE-L: used for quality of sentence structure preservation
LLM-as-Judge — The Modern Evaluation Approach
BLEU and ROUGE fail for open-ended generation. If you ask GPT-4 "Explain gravity" and it gives an excellent explanation using different words than the reference answer, BLEU might score it near 0. Modern LLM evaluation uses a stronger LLM (GPT-4 or Claude) to judge the quality of outputs.
LLM-AS-JUDGE WORKFLOW:
User Question: "What is attention in transformers?"
Reference Answer: "Attention allows tokens to focus on relevant..."
Model Output: "The attention mechanism enables each token to..."
Judge Prompt:
"Rate this answer from 1-10 for:
- Factual accuracy
- Completeness
- Clarity
Reference: [reference answer]
Model output: [model output]
Provide score and brief justification."
GPT-4 Judge Output:
Factual accuracy: 9/10
Completeness: 8/10
Clarity: 9/10
Overall: 8.7/10
"The answer correctly explains attention but misses..."
FRAMEWORKS:
RAGAS: Evaluates RAG pipelines (faithfulness, answer relevancy,
context precision, context recall)
MT-Bench: Multi-turn conversation quality evaluation
Alpaca Eval: Pairwise comparison against GPT-4 responses
Evaluation in Python
# pip install evaluate scikit-learn rouge-score import evaluate from sklearn.metrics import classification_report # ── CLASSIFICATION METRICS ───────────────────────────────── y_true = [1, 0, 1, 1, 0, 1, 0, 0] # actual labels y_pred = [1, 0, 1, 0, 0, 1, 1, 0] # model predictions print(classification_report(y_true, y_pred, target_names=['NEG', 'POS'])) # ── BLEU SCORE ────────────────────────────────────────────── bleu = evaluate.load("bleu") references = [["the cat sat on the mat"]] # list of lists predictions = ["the cat is on the mat"] result = bleu.compute(predictions=predictions, references=references) print(f"\nBLEU score: {result['bleu']:.4f}") # ── ROUGE SCORE ───────────────────────────────────────────── rouge = evaluate.load("rouge") reference_summaries = ["The transformer architecture uses self-attention mechanisms."] model_summaries = ["Transformers leverage attention to process sequences in parallel."] result = rouge.compute( predictions=model_summaries, references=reference_summaries ) print(f"ROUGE-1: {result['rouge1']:.4f}") print(f"ROUGE-L: {result['rougeL']:.4f}") # ── RAGAS — RAG EVALUATION ────────────────────────────────── # pip install ragas from ragas import evaluate from ragas.metrics import ( faithfulness, # is answer grounded in context? answer_relevancy, # does answer address the question? context_precision, # are retrieved chunks relevant? context_recall # do chunks contain the answer? ) from datasets import Dataset eval_data = { "question": ["What context window does Claude 3.5 have?"], "answer": ["Claude 3.5 Sonnet supports a 200,000 token context window."], "contexts": [["Claude 3.5 Sonnet by Anthropic has a 200,000 token context window."]], "ground_truth": ["Claude 3.5 Sonnet has a 200K token context window."] } dataset = Dataset.from_dict(eval_data) # score = evaluate(dataset, metrics=[faithfulness, answer_relevancy]) # print(score) # Runs LLM-based evaluation print("RAGAS evaluation code ready (requires OpenAI API key)")
Interview Q&A
Practice Questions — All Levels
Accuracy = (45+90)/150 = 90%
Precision = TP/(TP+FP) = 45/(45+10) = 81.8%
Recall = TP/(TP+FN) = 45/(45+5) = 90%
F1 = 2×(0.818×0.90)/(0.818+0.90) = 85.7%
Complete Cheat Sheet — All Metrics
Correct/Total. Good for balanced classes. Misleading for imbalanced data.
TP/(TP+FP). "Of my positive predictions, how many were right?" Use when FP is costly.
TP/(TP+FN). "Of actual positives, how many did I find?" Use when FN is costly.
Harmonic mean of P&R. Use for imbalanced classes or when both P and R matter.
N-gram precision. Used for translation. BLEU-1 to BLEU-4. Higher = better. Max=1.0.
N-gram recall. Used for summarization. ROUGE-L uses longest common subsequence.
How "surprised" the LM is. Lower = better. PP=10 means ~10 choices per token on average.
RAG evaluation: Faithfulness + Answer Relevancy + Context Precision + Context Recall.
And why is subword tokenization used?
With explanation of each part
Core architectural difference
And its 4 main steps
And which models use it?
Formula + when to use it
3 stages
What does PP=10 mean?
What problem does BERT solve?
In attention mechanism
When does accuracy lie?
What problem does it solve?