NLP for AI Engineers — Complete Roadmap Notes

1.1

What is NLP?

+

🔴

1. What Problem Does NLP Solve?

Computers only understand numbers — 0s and 1s. But humans communicate through language — a complex system of words, sentences, grammar, tone, context, and meaning. The fundamental problem NLP solves is this: How do we bridge the gap between human language (text and speech) and computers that only process numbers?

Without NLP, a computer cannot understand what you write in an email, what you ask in a search box, or what you say to a voice assistant. NLP gives computers the ability to read, understand, generate, and interact in human language.

💡

2. Why Was NLP Invented?

In the early days of computing (1940s–1960s), scientists dreamed of building machines that could communicate in human language. The original motivation was Machine Translation — translating Russian scientific documents to English during the Cold War. The US government funded massive research into getting computers to "understand" text.

As the internet grew (1990s–2000s), humans created vast amounts of text data (emails, websites, documents). Organizations needed automated ways to process, search, and analyze this text. That's when NLP became an industrial necessity, not just academic curiosity.

📜

3. Historical Background

1950 — Alan Turing's Test: Turing proposed the famous "Turing Test" — if a computer can convince a human it's also human through conversation, it's "intelligent".
1957 — Chomsky's Grammar: Noam Chomsky showed that language has formal, hierarchical structure — this shaped how early NLP was built (rule-based systems).
1966 — ELIZA: First chatbot created at MIT by Joseph Weizenbaum, simulating a psychotherapist using pattern matching.
1980s–1990s — Statistical NLP: Shift from hand-written rules to learning from data. Hidden Markov Models, probabilistic methods emerged.
2001 — Neural Language Models: Bengio et al. showed neural networks could learn word representations.
2013 — Word2Vec: Google's Tomas Mikolov created word embeddings — words as vectors in space. Massive breakthrough.
2017 — Transformer Architecture: Google Brain published "Attention Is All You Need" — changed everything.
2018 — BERT & GPT: The era of Large Language Models begins.
2022+ — ChatGPT, Claude, Llama: LLMs become consumer products.

🎯

4. Real-World Analogy

Analogy

Think of NLP like a Universal Translator from Star Trek. When Captain Kirk speaks English, the device instantly translates it for alien species. Similarly, NLP translates human language (which computers don't naturally understand) into a form computers can process (numbers, vectors, patterns), and then back into human language for the response.

👦

5. Explain Like I'm 10

ELI10

You know how your calculator can do math — add 2+3 and get 5? But if you wrote "two plus three" in words, the calculator would get confused and show an error. NLP is like giving computers a superpower so they can understand words, just like how you understand your teacher's instructions! It's the technology that makes Siri, Google Translate, and ChatGPT understand what you're saying.

🎓

6. Explain Like a College Student

NLP is the subfield of Artificial Intelligence that deals with the interaction between computers and human (natural) language. It combines linguistics (study of language), computer science (algorithms and data structures), and machine learning (learning patterns from data).

Modern NLP is mostly data-driven: instead of writing rules like "if the word 'not' appears before an adjective, flip its sentiment", we feed millions of examples to statistical/neural models and let them discover patterns themselves. This approach — called supervised learning and self-supervised learning — has proven far more powerful than hand-crafted rules.

⚙️

7. Explain Like an AI Engineer

NLP is the set of techniques for processing, analyzing, and generating human language data. Modern NLP pipelines typically involve: tokenization → embedding → encoding → task head.

Pre-2017, this meant bag-of-words features fed to logistic regression or SVMs. Post-2017, it means transformer-based models (BERT family for understanding, GPT family for generation) fine-tuned on task-specific data. Production NLP systems now often use foundation models (GPT-4, Claude, Llama) with prompt engineering or fine-tuning, plus RAG for knowledge-intensive tasks.

📖

8. Terminology Breakdown

Term	Simple Meaning
NLP	Natural Language Processing — teaching computers to understand and generate human text
Corpus	A large collection of text used for training (plural: corpora)
Token	A basic unit of text — usually a word or sub-word piece
Model	A mathematical system that has "learned" patterns from data
Training	The process of showing a model examples so it can learn patterns
Inference	Using a trained model to make predictions on new data
LLM	Large Language Model — a very large model trained on massive text data
Pipeline	A series of steps that text goes through for processing

🗺️

11. Visual: NLP vs Traditional Programming

TRADITIONAL PROGRAMMING:
┌─────────────┐    ┌───────────────┐    ┌──────────┐
│    Input    │───▶│  RULES (you   │───▶│  Output  │
│  "Hello!"   │    │   write them) │    │ (result) │
└─────────────┘    └───────────────┘    └──────────┘

NLP / MACHINE LEARNING:
┌─────────────┐    ┌───────────────┐    ┌──────────┐
│    Input    │───▶│  MODEL learns │───▶│  Output  │
│ + Examples  │    │  rules itself │    │ (result) │
└─────────────┘    └───────────────┘    └──────────┘

NLP vs LLM:
┌──────────────────────────────────────────────────────┐
│ NLP (broad field)                                    │
│  ┌──────────────────────────────────────────────┐   │
│  │ Classical NLP: Rules, TF-IDF, SVM, LSTM      │   │
│  └──────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────┐   │
│  │ Modern NLP: Transformers, BERT, GPT, LLMs    │   │  ◀── This is what we focus on
│  └──────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────┘

🌍

Real-World Applications

Search Engines

Google uses NLP to understand your query and match it to relevant pages — even if you don't use exact keywords.

Chatbots & Assistants

Siri, Alexa, ChatGPT, Claude — all use NLP to understand questions and generate appropriate responses.

Machine Translation

Google Translate converts text between 100+ languages using neural NLP models.

Sentiment Analysis

Analyzing product reviews, social media posts, or customer feedback for positive/negative/neutral sentiment.

Spam Detection

Gmail's spam filter uses NLP to classify emails as spam or not-spam based on content patterns.

Code Completion

GitHub Copilot uses NLP trained on code to suggest code completions and write functions from comments.

🎙️

Interview Questions & Answers

Q

What is the difference between NLP and LLMs?

NLP (Natural Language Processing) is the broad field of teaching computers to understand and work with human language. LLMs (Large Language Models) are a specific, modern type of NLP model that uses the Transformer architecture and is trained on massive amounts of text data. All LLMs use NLP, but NLP includes many older techniques (TF-IDF, SVMs, word2vec) that are not LLMs. Think of NLP as the entire science, and LLMs as the most powerful current tool within that science.

Q

Why is NLP hard for computers?

Language is ambiguous, context-dependent, and full of nuances. The same word can mean different things ("bank" = financial institution OR river bank). Sarcasm means the opposite of what's said. Pronouns like "it" refer to things mentioned earlier. Long documents require tracking information across many sentences. Cultural context and world knowledge are often required to understand simple statements. These challenges make NLP fundamentally harder than, say, image recognition.

📝

Practice Questions

Easy Name 3 real-world applications of NLP you use daily.

1. Google Search — understanding your search query to return relevant results. 2. Autocorrect/Autocomplete on your phone — predicting your next word. 3. Email spam filter — classifying emails as spam or not. Others: voice assistants (Siri, Alexa), translation apps, customer service chatbots.

Medium Explain why "I saw the man with the telescope" is ambiguous, and why this makes NLP hard.

This sentence has two valid interpretations: (1) "I used a telescope to see the man" or (2) "I saw a man who was holding a telescope". Humans use context and world knowledge to determine which is more likely. But computers need to explicitly resolve this ambiguity — called "structural ambiguity" or "PP-attachment ambiguity". This is one of many reasons NLP is hard: the same string of tokens can have multiple valid parse trees, and choosing the right one requires real-world common sense.

1.2

The NLP Pipeline

+

🔴

What Is A Pipeline?

A pipeline is a series of steps where the output of one step becomes the input of the next — like an assembly line in a factory. In NLP, raw text enters the pipeline and a useful prediction or generated output comes out the other end.

🗺️

The 5-Stage NLP Pipeline

Raw Text Input
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  STAGE 1: TEXT INPUT                                │
│  "The cat sat on the mat. It was happy."            │
└─────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  STAGE 2: PREPROCESSING                             │
│  • Cleaning: remove HTML tags, punctuation noise    │
│  • Normalization: lowercase, fix typos              │
│  • Tokenization: split into ["The","cat","sat",...] │
│  • Stop word removal (optional)                     │
│  • Stemming/Lemmatization (optional)                │
└─────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  STAGE 3: REPRESENTATION                            │
│  Convert tokens → numbers computers can process     │
│  • One-Hot Encoding  → [0,0,1,0,0,...,0]           │
│  • Bag of Words      → [2,1,0,3,...]               │
│  • Word Embeddings   → [0.23,-0.45,0.12,...]       │
│  • Transformer Enc.  → contextual vector            │
└─────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  STAGE 4: MODELING                                  │
│  The mathematical model processes representations   │
│  • Classic: Logistic Regression, SVM, Naive Bayes  │
│  • Neural:  LSTM, CNN for text                     │
│  • Modern:  BERT, GPT, T5, LLaMA                  │
└─────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  STAGE 5: OUTPUT                                    │
│  • Classification: "POSITIVE" / "NEGATIVE"          │
│  • Generation: "The quick brown fox..."             │
│  • Extraction: [("Apple", ORG), ("Tim Cook", PER)] │
│  • Translation: "El gato se sentó en la alfombra"  │
└─────────────────────────────────────────────────────┘

🎯

Step-by-Step Worked Example

Task: Sentiment analysis on the sentence: "The food was absolutely terrible!"

1
Input
Raw text: "The food was absolutely terrible!"
2
Preprocessing
Lowercase → "the food was absolutely terrible!" → Tokenize → ["the", "food", "was", "absolutely", "terrible"]
3
Representation
Convert each token to a number vector. "terrible" → [-0.85, 0.12, -0.67, ...] (a vector pointing toward "negative" words)
4
Modeling
Model processes vectors, learns "absolutely terrible" pattern → assigns high probability to NEGATIVE class
5
Output
Classification: "NEGATIVE" (with 96% confidence)

🎙️

Common Misconceptions

❌ Misconception

"NLP models understand language like humans do."

False. Models learn statistical patterns over text — they've seen "terrible" co-occur with negative reviews millions of times, so they associate the word with negativity. They don't "understand" the word the way you do. They have no concept of food, taste, or emotions. This distinction matters when models fail in unexpected ways (they can be fooled by novel phrasing they've never seen).

1.3

NLP Challenges

+

THE 7 CORE NLP CHALLENGES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. AMBIGUITY
   "I shot the elephant in my pajamas"
    ├─ I was wearing pajamas while shooting → ✓ likely
    └─ The elephant was in my pajamas      → ✓ grammatically valid

2. SYNONYMS
   "car", "automobile", "vehicle", "wheels" all mean ≈ same
   Model must know these are semantically related

3. POLYSEMY (one word, many meanings)
   "bank" → financial institution
   "bank" → river bank
   "bank" → to bank on something (rely on)

4. CONTEXT DEPENDENCY
   "it" in: "The trophy didn't fit in the suitcase because it was too big"
   WHAT is "it"?  → trophy (too big to fit)
   "it" in: "The trophy didn't fit in the suitcase because it was too small"
   WHAT is "it"?  → suitcase (too small to hold trophy)

5. SARCASM / IRONY
   "Oh great, another Monday!" → NEGATIVE sentiment
   "Oh great, the presentation went perfectly!" → POSITIVE sentiment
   Same words, opposite meaning based on tone/context

6. LONG DOCUMENTS
   A 100-page contract — the model must track information
   mentioned on page 2 when answering a question on page 87

7. WORLD KNOWLEDGE
   "Paris is the capital of..." requires world knowledge
   "The bank approved the loan after it verified the..."
   requires knowing banks approve loans before verifying docs

⚠️ Why This Matters for LLMs

All these challenges are exactly what modern LLMs like GPT-4 and Claude are designed to handle better. The Transformer's attention mechanism (Module 8) specifically addresses the context dependency and long document problems. This is why understanding NLP challenges first makes the "why" of transformer design obvious.

📝

Worked Examples for Each Challenge

Challenge	Example	Why It's Hard
Ambiguity	"I saw her duck"	Did she dodge (duck = verb) or was there a duck (noun)?
Synonyms	"The dog barked" vs "The canine made noise"	Model must know dog=canine, barked≈made noise
Polysemy	"Can you pass the salt?"	Literally asking ability, but actually a polite request
Context	"He gave John his book"	Whose book? His (speaker's) or John's?
Sarcasm	"What a wonderful day!" (said in a storm)	Requires knowing it's stormy to detect sarcasm
Long docs	Legal contract: Clause 1 defines a term used in Clause 47	Must track information across thousands of tokens

2.1–2.4

Text as Data, Corpus, Vocabulary & OOV

+

👦

ELI10: What is Text as Data?

Simple Analogy

Imagine you have a big box of LEGO bricks. Individual bricks are like characters. Groups of bricks that form a shape are like words. A complete LEGO scene is like a sentence. Your entire LEGO collection is like a document. A library of instruction manuals is like a corpus. The total set of all unique brick types you own is your vocabulary.

🗺️

Text Granularity Levels

TEXT: "NLP is fun!"

LEVEL 1 — CHARACTERS (smallest unit):
  ['N', 'L', 'P', ' ', 'i', 's', ' ', 'f', 'u', 'n', '!']
  • Pro: No unknown characters (all text = chars we know)
  • Con: Very long sequences, loses word meaning

LEVEL 2 — WORDS (most common):
  ['NLP', 'is', 'fun']
  • Pro: Preserves word meaning, natural unit
  • Con: Huge vocabulary, rare/new words cause problems

LEVEL 3 — SENTENCES:
  ["NLP is fun!", "I love learning."]
  • Used for document retrieval, semantic search

LEVEL 4 — DOCUMENTS:
  Complete Wikipedia article, entire book chapter
  • Used in document classification, summarization

CORPUS → A collection of documents:
  Wikipedia = 6 million+ English documents = one corpus
  Common Crawl = 80+ billion web pages = one (huge) corpus
  GPT-3 trained on ~570GB of text data

🔴

What is a Vocabulary?

The vocabulary (often written as V) is the set of all unique words (or tokens) that a model knows. Think of it as the model's "dictionary".

When you train a model, you first scan all the training text and collect every unique word. That becomes your vocabulary. The size of the vocabulary — called |V| — is important:

Small vocabulary (e.g., 10,000 words) → Many words labeled "unknown"
Large vocabulary (e.g., 100,000 words) → Better coverage but uses more memory
GPT-4's tokenizer vocabulary: ~100,000 tokens (not words, but sub-word pieces)

Example corpus: 3 sentences
  "I love cats"
  "I love dogs"
  "cats and dogs"

Vocabulary = {I, love, cats, dogs, and}  → |V| = 5

Each word gets an index:
  I     → 0
  love  → 1
  cats  → 2
  dogs  → 3
  and   → 4

"I love cats" → [0, 1, 2]  ← Computer can now process this!

🔴

OOV — Out of Vocabulary Problem

OOV stands for "Out Of Vocabulary" — it refers to words that appear at inference (test) time that the model has never seen during training.

Why is OOV a problem? Imagine you built a model and your vocabulary is {cat, dog, bird}. Now someone inputs the word "ferret" — your model has no representation for it! It's like asking someone to recognize a face they've never seen before.

TRAINING VOCABULARY: {cat, dog, bird, run, jump}

INPUT AT INFERENCE TIME: "The ferret jumped over the fence"

Word check:
  "The"    → OOV! Not in vocabulary
  "ferret" → OOV! Not in vocabulary
  "jumped" → OOV! "jump" is there but "jumped" (past tense) is not!
  "over"   → OOV!
  "the"    → OOV! (lowercase "the" different from "The")
  "fence"  → OOV!

RESULT: Model maps all these to [UNK] token → loses meaning!

SOLUTION PREVIEW → Subword Tokenization (Module 3)
  "ferret" → "fer" + "ret" (sub-pieces it knows!)
  "jumped" → "jump" + "##ed" (root + suffix)

⚠️ Key Insight

The OOV problem is precisely WHY modern tokenizers like BPE and WordPiece (used by GPT and BERT) use subword tokenization. By breaking rare words into smaller pieces, every word — even completely new words — can be represented. This is one of the most important innovations in modern NLP.

🎙️

Interview Q&A

Q

What is the vocabulary size of BERT and GPT-2?

BERT uses WordPiece tokenization with a vocabulary of ~30,000 tokens. GPT-2 uses BPE (Byte Pair Encoding) with a vocabulary of 50,257 tokens. GPT-4 uses a larger BPE vocabulary of approximately 100,256 tokens. Note: these are not "words" — they're sub-word pieces. For example, "unbelievable" might be tokenized as ["un", "believ", "able"] using 3 tokens from the vocabulary.

Q

What is the difference between vocabulary and corpus?

A corpus is the entire collection of text data used for training — it can contain billions of sentences with many repeated words. The vocabulary is the set of UNIQUE words/tokens extracted from the corpus. For example, the word "the" appears billions of times in a corpus, but it appears only ONCE in the vocabulary. Corpus = all the data. Vocabulary = unique tokens dictionary.

3.1–3.9

Complete Tokenization Guide: BPE, WordPiece, SentencePiece

+

🔴

Why Models Cannot Read Text Directly

Core Problem

Neural networks are mathematical functions. They take numbers as input and produce numbers as output. A sentence like "I love NLP" is a string of characters — NOT numbers. Tokenization is the process of converting this string into a sequence of integer IDs that the model can actually process.

Raw Text: "I love NLP"
             │
             ▼
        Tokenizer
             │
             ▼
Tokens:  ["I", "love", "NLP"]
             │
             ▼
Token IDs: [40, 1842, 27207]   ← These are actual numbers GPT-2 uses!
             │
             ▼
Model processes these integers → generates output integers
             │
             ▼
Decode:   "It is fascinating"   ← Convert IDs back to text

📖

3 Types of Tokenization — Comparison

INPUT TEXT: "unhappiness"

1. WORD TOKENIZATION
   → ["unhappiness"]
   ✓ Simple, preserves words
   ✗ "unhappiness" might be OOV if not in vocabulary!
   ✗ Vocabulary can have 1M+ words

2. CHARACTER TOKENIZATION
   → ["u","n","h","a","p","p","i","n","e","s","s"]
   ✓ Never OOV — only 26 letters + punctuation
   ✓ Very small vocabulary (256 chars)
   ✗ Very long sequences (3x–10x longer than words)
   ✗ Each character alone has little meaning

3. SUBWORD TOKENIZATION (BPE / WordPiece)
   → ["un", "##happiness"]  ← WordPiece style
   → ["un", "happ", "iness"] ← BPE style
   ✓ Balances vocabulary size and sequence length
   ✓ Rare words decomposed into known sub-pieces
   ✓ Common words kept whole: "the" → ["the"]
   ✓ Never truly OOV (worst case: character level)
   ← USED BY ALL MODERN LLMs!

🎓

BPE — Byte Pair Encoding (Used by GPT)

BPE was originally a data compression algorithm, adapted for NLP by Sennrich et al. in 2016. The key insight is: instead of having a fixed word vocabulary, start with individual characters and iteratively merge the most frequent adjacent pairs into new tokens.

1
Start with character vocabulary
Every unique character in your training data = initial vocabulary. For English: a-z, A-Z, 0-9, punctuation ≈ ~256 tokens
2
Count all adjacent character pairs
In "the cat sat", count: ('t','h')=1, ('h','e')=1, ('c','a')=1, ('a','t')=2, ('s','a')=1 → ('a','t') most frequent!
3
Merge the most frequent pair
'a'+'t' → 'at'. Now vocabulary includes 'at' as a token. Update all occurrences in text.
4
Repeat until vocabulary size reached
Keep merging most frequent pairs. GPT-2 does ~50,000 merge operations to reach 50,257 tokens.

BPE STEP-BY-STEP WORKED EXAMPLE

Corpus: "aab aac ab ac"
Initial vocabulary: {a, b, c, space}

STEP 1: Count pair frequencies
  (a,a) = 2  ← most frequent
  (a,b) = 2
  (a,c) = 2
  (space, a) = 3  ← most frequent overall!

Let's merge (space,a) → " a" (just showing logic)
Actually let's merge (a,b):
After merge: "aab aac [ab] ac"
vocab += "ab"

STEP 2: Count again in updated corpus:
  "a","a","b" → (a,a)=2, (a,b)=1
  "a","a","c" → (a,a)=2, (a,c)=1
  "[ab]" → treated as single token now
  "a","c" → (a,c)=1

Merge (a,a) → "aa":
After merge: "[aa]b [aa]c [ab] ac"
vocab += "aa"

CONTINUE until target vocab size reached...

RESULT: Frequent sequences become single tokens
  "the" → one token (very common)
  "ing" → one token (common suffix)
  "xyzzy" → "x" + "y" + "z" + "z" + "y" (rare = char-level)

🎓

WordPiece (Used by BERT)

WordPiece is similar to BPE but uses a different criterion for merging: instead of frequency, it maximizes the likelihood of the training data given the vocabulary. In practice, it tends to create more linguistically meaningful pieces.

Key difference: WordPiece marks continuation pieces with ##. So "playing" might tokenize as ["play", "##ing"] where ##ing means "this piece continues a word from the previous token".

WordPiece tokenization examples (BERT-style):

"playing"    → ["play", "##ing"]
"unbelievable" → ["un", "##believ", "##able"]
"ChatGPT"    → ["Chat", "##GP", "##T"]
"COVID-19"   → ["CO", "##VID", "-", "19"]
"hello"      → ["hello"]          ← common word, whole token
"the"        → ["the"]            ← very common, whole token

The ## prefix tells the model: "I am a continuation of the 
previous token, not the start of a new word"

🎓

SentencePiece (Used by T5, LLaMA, Mistral)

SentencePiece, developed by Google, solves a key problem: BPE and WordPiece require pre-tokenization (splitting text into words first using spaces), which is language-specific. Chinese, Japanese, Thai don't use spaces between words!

SentencePiece treats the input as a raw character stream (including spaces). It uses ▁ (a special underscore character) to mark word boundaries. This makes it language-independent.

SentencePiece treats spaces as characters:
  Input: "Hello world"
  Tokens: ["▁Hello", "▁world"]
         ↑ underscore marks start of word

Works for ANY language because no pre-tokenization needed:
  Chinese: "我爱NLP" → ["▁我", "爱", "NL", "P"]
  Japanese: "自然言語処理" → ["▁自然", "言語", "処理"]

LLaMA uses SentencePiece with vocabulary size 32,000
Mistral uses SentencePiece with vocabulary size 32,000
T5 uses SentencePiece with vocabulary size 32,128

📖

Special Tokens — Critical for Understanding LLMs

Special tokens are reserved tokens with specific roles in the pipeline. They are NOT normal vocabulary words — they signal structural information to the model.

Token	Full Name	Used By	Purpose
[PAD]	Padding Token	BERT, most models	Used to make sequences the same length in a batch. Model learns to ignore padded positions.
[UNK]	Unknown Token	Older models, BERT	Replaces tokens not in vocabulary. Less needed now with subword tokenization.
[CLS]	Classification Token	BERT	Prepended to every sequence. BERT learns to put the meaning of the whole sentence in this token's representation — used for classification tasks.
[SEP]	Separator Token	BERT	Separates two sentences in a pair (e.g., question vs context in QA). Also marks end of sequence.
<BOS>	Beginning of Sequence	GPT, LLaMA	Signals to the model that a new sequence is starting. GPT uses <\|endoftext\|> for this.
<EOS>	End of Sequence	GPT, LLaMA, all models	Signals that the model should stop generating. Critical for knowing when to stop during inference.

BERT INPUT FORMAT:
[CLS] sentence_A [SEP] sentence_B [SEP] [PAD] [PAD]
  ↑                 ↑                ↑     ↑     ↑
classification    separator      separator padding padding

Example:
"Is the cat cute?" → "Yes it is"

[CLS] Is the cat cute ? [SEP] Yes it is [SEP] [PAD] [PAD]

GPT INPUT FORMAT (no [CLS] or [SEP] needed):
 The cat sat on the mat 
  ↑                            ↑
start                         stop

📖

Context Window & Token Limits

Every LLM has a context window — the maximum number of tokens it can process at once (both input and output). This is not the same as words! Due to subword tokenization:

1 word ≈ 1.3–1.5 tokens on average for English
Code and rare words tokenize into more pieces
A 4,096-token context ≈ ~3,000 words ≈ ~6 pages of text

GPT-3.5-turbo

16,384 tokens ≈ 12,000 words ≈ 48 pages

GPT-4o

128,000 tokens ≈ 96,000 words ≈ 380 pages

Claude 3.5

200,000 tokens ≈ 150,000 words ≈ 600 pages

Gemini 1.5 Pro

1,000,000 tokens ≈ 750,000 words ≈ 3,000 pages

LLaMA 3.1

128,000 tokens context window

DeepSeek V3

128,000 tokens context window

💻

Python Implementation — BPE from Scratch

Python

# ============================================================
# BPE (Byte Pair Encoding) Tokenizer - Built from Scratch
# ============================================================

from collections import Counter, defaultdict
import re

class SimpleBPE:
    """
    A minimal BPE tokenizer to understand the core algorithm.
    NOT optimized for production - purely educational.
    """
    
    def __init__(self, vocab_size: int = 300):
        # Target vocabulary size (initial chars + merged pairs)
        self.vocab_size = vocab_size
        self.merges = {}      # stores all merge operations
        self.vocab = set()    # our complete vocabulary
    
    def get_vocab(self, corpus: list) -> dict:
        """
        Convert corpus to word-frequency dict where each word
        is represented as a tuple of characters + end marker.
        
        'hello' with freq 3 → ('h','e','l','l','o','') : 3
        '' marks end of word (helps track word boundaries)
        """
        vocab = Counter()
        for sentence in corpus:
            for word in sentence.split():
                # Convert word to tuple of chars + end marker
                char_tuple = tuple(word) + ('</w>',)
                vocab[char_tuple] += 1
        return vocab
    
    def get_pair_frequencies(self, vocab: dict) -> dict:
        """
        Count all adjacent pairs across all words in vocab.
        
        Example: ('h','e','l','l','o','') with freq 3
        Pairs counted: (h,e):3, (e,l):3, (l,l):3, (l,o):3, (o,):3
        """
        pairs = defaultdict(int)
        for word_tuple, freq in vocab.items():
            # Look at each adjacent pair of tokens
            for i in range(len(word_tuple) - 1):
                pair = (word_tuple[i], word_tuple[i+1])
                pairs[pair] += freq  # weighted by word frequency!
        return pairs
    
    def merge_pair(self, best_pair: tuple, vocab: dict) -> dict:
        """
        Merge best_pair everywhere in the vocabulary.
        ('h','e') → 'he'  everywhere it appears.
        """
        new_vocab = {}
        left, right = best_pair
        bigram = left + right  # merged token
        
        for word_tuple, freq in vocab.items():
            # Replace each occurrence of (left, right) with bigram
            new_word = []
            i = 0
            while i < len(word_tuple):
                if (i < len(word_tuple)-1 and 
                    word_tuple[i] == left and 
                    word_tuple[i+1] == right):
                    new_word.append(bigram)  # replace pair with merge
                    i += 2                  # skip both tokens
                else:
                    new_word.append(word_tuple[i])
                    i += 1
            new_vocab[tuple(new_word)] = freq
        return new_vocab
    
    def train(self, corpus: list):
        """
        Main training loop: keep merging most frequent pairs
        until we reach our target vocabulary size.
        """
        # Step 1: Build initial character-level vocabulary
        vocab = self.fn_get_vocab(corpus)  
        
        # Collect all unique characters (initial vocab)
        initial_tokens = set()
        for word_tuple in vocab.keys():
            initial_tokens.update(word_tuple)
        self.vocab = initial_tokens.copy()
        
        print(f"Initial vocab size: {len(self.vocab)}")
        print(f"Initial tokens: {sorted(self.vocab)}")
        
        # Step 2: Iteratively merge most frequent pairs
        num_merges = self.vocab_size - len(self.vocab)
        
        for i in range(num_merges):
            # Count all adjacent pairs
            pair_freqs = self.get_pair_frequencies(vocab)
            
            if not pair_freqs:
                break  # No more pairs to merge
            
            # Find the most frequent pair
            best_pair = max(pair_freqs, key=pair_freqs.get)
            best_freq = pair_freqs[best_pair]
            
            # Record this merge operation
            merged_token = best_pair[0] + best_pair[1]
            self.merges[best_pair] = merged_token
            self.vocab.add(merged_token)
            
            # Apply the merge to all vocabulary entries
            vocab = self.merge_pair(best_pair, vocab)
            
            if i < 5:  # Print first 5 merges for inspection
                print(f"Merge {i+1}: {best_pair} → '{merged_token}' (freq={best_freq})")
        
        print(f"\nFinal vocab size: {len(self.vocab)}")
        return self

# ── DEMO ──────────────────────────────────────────
# Simple corpus for demonstration
corpus = [
    "low low low low low",
    "lower lower",
    "newest newest newest newest",
    "widest widest",
]

tokenizer = SimpleBPE(vocab_size=25)
tokenizer.fn_get_vocab = tokenizer.get_vocab  # alias fix

# Actually let's run it directly:
bpe = SimpleBPE(vocab_size=20)
vocab = bpe.get_vocab(corpus)
print("Initial word representations:")
for w, f in vocab.items():
    print(f"  {w}: {f}")

pairs = bpe.get_pair_frequencies(vocab)
print("\nTop 5 most frequent pairs:")
for pair, freq in sorted(pairs.items(), key=lambda x: -x[1])[:5]:
    print(f"  {pair}: {freq}")

💻

Using Hugging Face Tokenizers

Python

# pip install transformers tokenizers

from transformers import AutoTokenizer

# ── GPT-2 Tokenizer (BPE) ─────────────────────────
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, I love Natural Language Processing!"
tokens = gpt_tokenizer.tokenize(text)
token_ids = gpt_tokenizer.encode(text)

print("GPT-2 Tokens:", tokens)
# ['Hello', ',', 'Ġ', 'I', 'Ġlove', 'ĠNatural', ...]
# Note: Ġ = space (BPE encodes spaces into tokens!)

print("Token IDs:", token_ids)
# [15496, 11, 314, 1842, 8823, 15417, ...]

print("Vocab size:", gpt_tokenizer.vocab_size)
# 50257

# Decode back to text
decoded = gpt_tokenizer.decode(token_ids)
print("Decoded:", decoded)
# "Hello, I love Natural Language Processing!"

# ── BERT Tokenizer (WordPiece) ─────────────────────
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokens_bert = bert_tokenizer.tokenize(text)
print("\nBERT Tokens:", tokens_bert)
# ['hello', ',', 'i', 'love', 'natural', 'language', 'processing', '!']
# BERT lowercases (bert-base-uncased)

# See special tokens
encoded_bert = bert_tokenizer(text, return_tensors="pt")
print("\nWith special tokens:", bert_tokenizer.convert_ids_to_tokens(
    encoded_bert['input_ids'][0]
))
# ['[CLS]', 'hello', ',', 'i', 'love', ..., '[SEP]']
#    ↑ automatically added!                    ↑

# ── Test OOV handling ─────────────────────────────
oov_text = "supercalifragilisticexpialidocious"
gpt_oov = gpt_tokenizer.tokenize(oov_text)
bert_oov = bert_tokenizer.tokenize(oov_text)

print("\nOOV word tokenization:")
print("GPT-2:", gpt_oov)
# ['super', 'cali', 'fra', 'gil', 'istic', 'exp', 'iali', 'do', 'cious']
print("BERT:", bert_oov)
# ['super', '##cal', '##if', '##rag', '##ili', '##stic', ...]

🎙️

Interview Questions & Traps

Q

What is the difference between BPE and WordPiece?

Both are subword tokenization algorithms, but they differ in how they decide which pairs to merge:

BPE: Merges the pair with the highest raw frequency count in the corpus. "How often do these two tokens appear adjacent?"

WordPiece: Merges the pair that maximizes the likelihood of the training data. Specifically, it selects the pair (A, B) where freq(AB) / (freq(A) × freq(B)) is maximized. This tends to merge pairs that appear together MORE than chance would predict.

In practice: BPE is used by GPT (all versions), RoBERTa, and most modern LLMs. WordPiece is used by BERT and its variants.

Q

⚠️ TRAP: "The same word always tokenizes to the same tokens" — Is this true?

FALSE — this is a common trap! Tokenization can be context-dependent based on position and surrounding characters. For example, in GPT tokenizers: "dog" at the start of a sentence might tokenize differently from " dog" (with a leading space) in the middle of a sentence. The space-prefixed version " dog" would be a single token "Ġdog" in GPT-2, while "dog" alone might be "dog". This is why BPE encoding of spaces into tokens matters!

Q

How many tokens does "ChatGPT is amazing!" have in GPT-2?

You can check this with the tokenizer. "ChatGPT is amazing!" would likely tokenize as: ["Chat", "G", "PT", "Ġis", "Ġamazing", "!"] = 6 tokens. Note: "ChatGPT" itself breaks into 3 pieces because it's a proper noun not common in GPT-2's training data (GPT-2 was trained before ChatGPT existed!). Modern GPT-4 tokenizer would handle it better as 1-2 tokens.

📝

Practice Questions

Easy What are the 6 most common special tokens and their purposes?

Medium Given corpus ["aa ab", "ba bb", "aa ba"], perform 2 BPE merge steps manually.

Initial representation (chars): {('a','a','</w>'):2, ('a','b','</w>'):1, ('b','a','</w>'):2, ('b','b','</w>'):1}

Step 1 — Count pairs: (a,a)=2, (a,</w>)=3, (a,b)=1, (b,</w>)=2, (b,a)=2, (b,b)=1. Most frequent: (a,</w>)=3. Merge 'a'+'</w>'→'a</w>'. Now: {('a','a</w>'):2, ('a','b</w>'):...} Wait, let me re-count: 'aa' has (a,a) and then (a,</w>). Three words end in 'a': 'aa'×2 + 'ba'×2 = 4 occurrences of (a,</w>). Merge that: 'a</w>' becomes one token.

Step 2 — Now recount. ('a','a</w>') appears 2 times (from 'aa'×2). ('b','a</w>') appears 2 times (from 'ba'×2). Tie — pick one, say merge ('b','a</w>')→'ba</w>'. New vocabulary now includes 'a', 'b', '</w>', 'a</w>', 'ba</w>'.

Hard Why does tokenization affect the cost of using GPT-4 API, and how can you optimize token usage?

GPT-4 API charges per token (input + output). Cost optimization strategies:

1. Use shorter prompts: Avoid verbose system prompts. Every word costs money.
2. Avoid token-inefficient languages: Non-English text often uses more tokens per character. Chinese characters may be 1-2 tokens each, but some scripts use 3-4 tokens per character.
3. Avoid whitespace waste: Extra spaces, newlines, and indentation all consume tokens.
4. Use tiktoken to count first: `import tiktoken; enc = tiktoken.encoding_for_model("gpt-4"); len(enc.encode(text))` — always check before sending.
5. Truncate context: Don't send entire conversation history every time; summarize older turns.
6. Use streaming: Doesn't reduce tokens but improves user experience while generation happens.

🔗

Relationships to LLMs

GPT

GPT (all versions)

Uses BPE tokenization. GPT-2: 50,257 tokens. GPT-4: ~100,256 tokens. The tokenizer is the very first step before any GPT processing.

Cla

Claude

Uses a custom BPE tokenizer with ~100K vocab. Claude 3's context window is 200K tokens — tokenization determines how much text fits.

Lla

LLaMA

LLaMA 1/2 uses SentencePiece BPE with 32K vocab. LLaMA 3 expanded to 128K vocabulary, significantly improving multilingual performance.

DS

DeepSeek

DeepSeek uses a custom BPE tokenizer optimized for both English and Chinese. Uses cl100k_base-compatible tokenizer with extended Chinese tokens.

Qw

Qwen

Qwen uses tiktoken-based BPE with ~150K vocabulary, heavily optimized for Chinese — Chinese characters get dedicated tokens for efficiency.

Ki

Kimi

Kimi (by Moonshot AI) uses a custom tokenizer optimized for long-context Chinese+English tasks with 128K context window support.

📋

Cheat Sheet

BPE (GPT)

Merge most frequent character pairs. Space encoded into tokens. GPT-2: 50K vocab, GPT-4: 100K vocab.

WordPiece (BERT)

Merge pairs with max likelihood score. ## prefix for continuations. BERT: 30K vocab (uncased).

SentencePiece (LLaMA)

Language-independent. ▁ for word boundaries. No pre-tokenization. LLaMA 3: 128K vocab.

Context Window

Max tokens model can see. 1 word ≈ 1.3 tokens. Claude 3.5: 200K, GPT-4o: 128K.

[CLS] Token

BERT prepends this. The final hidden state of [CLS] represents whole sentence — used for classification.

OOV Solution

Subword tokenization means never truly OOV. Worst case: single characters are always in vocabulary.

🎯

Mini Project

🔬 Build a Tokenization Analyzer

Compare how different tokenizers handle the same text — great for building intuition about LLM costs and behavior.

1
Install transformers and tiktoken
pip install transformers tiktoken
2
Load 3 tokenizers
GPT-2 (BPE), BERT-base-uncased (WordPiece), and a LLaMA tokenizer
3
Tokenize the same 10 sentences
Include: normal English, a technical term, a rare proper noun, code, Chinese/Arabic text
4
Compare token counts
Which tokenizer is most "efficient" for each type of text? Build a comparison table.
5
Estimate API costs
Given GPT-4 costs $0.03/1K tokens, calculate cost for processing your 10 sentences 1000 times.

4.1–4.4

One-Hot, BoW, N-Grams & TF-IDF

+

📖

One-Hot Encoding

One-hot encoding is the simplest way to represent a word as a number vector. Each word gets a unique position in a vector, and only that position is "1" — everything else is "0".

Vocabulary: {cat:0, dog:1, bird:2, runs:3, jumps:4}
Vocabulary size |V| = 5

One-hot vectors:
  "cat"   → [1, 0, 0, 0, 0]
  "dog"   → [0, 1, 0, 0, 0]
  "bird"  → [0, 0, 1, 0, 0]
  "runs"  → [0, 0, 0, 1, 0]
  "jumps" → [0, 0, 0, 0, 1]

CRITICAL PROBLEMS:
  1. For 50,000 word vocab → each vector has 50,000 dimensions!
     99.998% of each vector is zeros → SPARSE & WASTEFUL
     
  2. "cat" and "dog" are equally "far apart" as "cat" and "airplane"
     The vectors don't capture that cat/dog are both animals!
     
  3. No way to compute meaningful similarity:
     cat · dog = [1,0,0,0,0] · [0,1,0,0,0] = 0 + 0 = 0
     cat · airplane = 0 too!  ← SAME DISTANCE = MEANINGLESS

📖

Bag of Words (BoW)

Bag of Words represents an entire document (not just a word) as a vector by counting how many times each vocabulary word appears. It's called "bag" because it ignores word ORDER — it just counts.

Vocabulary: {I:0, love:1, NLP:2, hate:3, Python:4}

Document 1: "I love NLP and I love Python"
  "I" appears 2 times
  "love" appears 2 times
  "NLP" appears 1 time
  "Python" appears 1 time
BoW vector: [2, 2, 1, 0, 1]
             I  love NLP hate Python

Document 2: "I hate Python but I love NLP"
BoW vector: [2, 1, 1, 1, 1]
             I  love NLP hate Python

Similarity: Both have [I×2, NLP×1] → related topics  ✓

PROBLEM — Order is completely lost:
  "Dog bites man" → [1,1,1] (dog, bites, man counts)
  "Man bites dog" → [1,1,1] (SAME VECTOR!)
  But these mean very different things!

📖

N-Grams — Capturing Some Context

N-grams capture some word order by creating features from sequences of N consecutive words. Instead of individual words, you count sequences.

Text: "I love natural language processing"

UNIGRAMS (N=1) — individual words:
  {I, love, natural, language, processing}

BIGRAMS (N=2) — pairs of adjacent words:
  {(I,love), (love,natural), (natural,language), (language,processing)}

TRIGRAMS (N=3) — triplets:
  {(I,love,natural), (love,natural,language), (natural,language,processing)}

WHY N-GRAMS HELP:
  "not good" as a bigram captures negation that unigrams miss
  "New York" as bigram = city name; "New" + "York" alone = misleading
  "not bad" bigram → positive sentiment (double negation!)

WHY N-GRAMS STILL HAVE LIMITS:
  N=2: captures 2-word context
  N=3: captures 3-word context  
  N=10: captures 10-word context... but vocabulary explodes!
  With 50K words: unigrams=50K features, bigrams=2.5 BILLION possible features
  (Though most don't appear in practice → sparsity again)

📖

TF-IDF — The Classic Information Retrieval Method

TF-IDF (Term Frequency — Inverse Document Frequency) is still used today in search engines and information retrieval. It solves a key problem with BoW: common words like "the", "is", "and" appear in every document and are useless for distinguishing content. TF-IDF weights words by how distinctive they are.

Intuition: A word is important to a document if it appears frequently IN THAT document BUT rarely across ALL documents.

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

TF(t, d) = count(t in d) / total_tokens(d)

IDF(t, D) = log( N / df(t) )

t	= a specific term (word)
d	= the specific document we're scoring
D	= the entire collection of documents (corpus)
TF(t,d)	= how often term t appears in document d (normalized by document length)
N	= total number of documents in corpus D
df(t)	= document frequency = how many documents contain term t at least once
log	= natural logarithm (used to dampen extreme values)

WORKED EXAMPLE:
Corpus of 1,000 documents.

Word "the":
  TF: appears 50 times in a 500-word document = 50/500 = 0.1
  IDF: appears in ALL 1,000 documents = log(1000/1000) = log(1) = 0
  TF-IDF = 0.1 × 0 = 0  ← "the" gets ZERO weight! ✓

Word "photosynthesis":
  TF: appears 10 times in a 500-word biology document = 10/500 = 0.02
  IDF: appears in only 5 documents = log(1000/5) = log(200) = 5.3
  TF-IDF = 0.02 × 5.3 = 0.106  ← HIGH weight! ✓

Word "cancer" in a medical report:
  TF: appears 20 times in 1000-word document = 0.02
  IDF: appears in 100 documents = log(1000/100) = log(10) = 2.3
  TF-IDF = 0.02 × 2.3 = 0.046  ← moderate weight ✓

⚠️ TF-IDF in Production

TF-IDF + cosine similarity is still the backbone of many production search systems (including some parts of Elasticsearch). BM25, a TF-IDF variant, is used in retrieval stages of RAG pipelines right now. Don't dismiss these "classical" methods — they're fast, interpretable, and often competitive with expensive neural approaches for keyword-heavy search.

💻

Python Implementation

Python

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
    "The cat sat on the mat",
    "The dog sat on the floor",
    "Cats and dogs are common pets",
    "NLP is the study of language",
]

# ── 1. Bag of Words ──────────────────────────────
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(corpus)

print("BOW Vocabulary:", bow_vectorizer.vocabulary_)
print("BOW Matrix shape:", bow_matrix.shape)
# shape = (4 documents, N unique words)

# ── 2. TF-IDF ─────────────────────────────────────
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

print("\nTF-IDF Matrix shape:", tfidf_matrix.shape)

# Find most similar documents to query
query = "cats and dogs"
query_vec = tfidf_vectorizer.transform([query])
similarities = cosine_similarity(query_vec, tfidf_matrix)
print("\nSimilarity to 'cats and dogs':")
for i, sim in enumerate(similarities[0]):
    print(f"  Doc {i}: {corpus[i][:40]}... → {sim:.3f}")

# ── 3. N-Grams ────────────────────────────────────
ngram_vectorizer = CountVectorizer(ngram_range=(1,2))  # unigrams + bigrams
ngram_matrix = ngram_vectorizer.fit_transform(corpus)

# Show bigram features
features = ngram_vectorizer.get_feature_names_out()
bigrams = [f for f in features if ' ' in f]
print("\nBigrams found:", bigrams[:10])

🎙️

Interview Q&A

Q

What are the main weaknesses of TF-IDF?

1. No semantic understanding: "car" and "automobile" are treated as completely different terms. A document about "automobiles" won't match a query for "cars". 2. No context: "not good" and "very good" have the same TF-IDF weights for "good". 3. Sparsity: Vectors are huge (vocab size) but mostly zeros — memory inefficient. 4. No word order: "dog bites man" = "man bites dog" in TF-IDF. 5. Corpus-dependent: IDF values change if you add/remove documents — you can't easily update in real-time. This is why modern systems use dense neural embeddings (covered in Module 5-6).

Q

Is TF-IDF still used in production in 2024?

Yes! BM25 (a TF-IDF variant) is used as the first-stage retriever in many RAG (Retrieval-Augmented Generation) pipelines. Elasticsearch and OpenSearch use BM25 by default. Hybrid search systems combine BM25 (keyword matching) + dense retrieval (semantic similarity) for best results. TF-IDF is also used for keyword extraction, document clustering, and as a fast baseline. For pure semantic similarity tasks (question answering, semantic search), dense embeddings win. For exact keyword matching, TF-IDF/BM25 often wins or ties.

5.1–5.7

Word2Vec, GloVe, FastText & Embedding Space

+

🔴

The Problem with One-Hot Encoding

Problem 1: DIMENSIONALITY
  Vocab size = 100,000 words
  Each word = 100,000-dimensional vector
  99.999% of each vector = zeros!
  MASSIVE memory waste → [0,0,0,...,1,...,0,0,0]

Problem 2: NO SEMANTIC SIMILARITY
  cat  = [1,0,0,0,...]
  kitten = [0,1,0,0,...] 
  car  = [0,0,1,0,...]
  
  distance(cat, kitten) = distance(cat, car)
  BUT cat and kitten are much more related!

Problem 3: NO RELATIONSHIPS
  king - man + woman = ??? 
  One-hot can't do word arithmetic.
  Word vectors CAN:
  vec("king") - vec("man") + vec("woman") ≈ vec("queen")  ← MAGIC!

💡

Distributional Semantics — The Core Idea

The Distributional Hypothesis (Firth, 1957)

"You shall know a word by the company it keeps."

Words that appear in similar contexts tend to have similar meanings. "cat" and "dog" both appear near "pet", "feed", "vet", "cute" → they should have similar vector representations. We don't need to hand-code that cats and dogs are both animals — we can LEARN this from billions of sentences!

👦

ELI10: What is an Embedding?

Simple Analogy

Imagine you want to describe your classmates. You could give each one a unique number (1, 2, 3...) — but that number says nothing about them. Or you could describe each person using 3 attributes: height (0-1), age (0-1), how funny they are (0-1). Now two similar people would have similar numbers! Embeddings do the same for words — but with 300 or 768 "attribute" dimensions instead of 3, capturing aspects of meaning we can't even name.

🎓

Word2Vec — How It Works

Word2Vec, introduced by Mikolov et al. at Google in 2013, learns word vectors by training a simple neural network on a "fake" task. There are two variants:

CBOW (Continuous Bag of Words):
  TASK: Predict CENTER word from SURROUNDING words
  
  Context: ["The", "_?_", "sat", "on"]
  Target:  "cat"
  
  Input: embed(The) + embed(sat) + embed(on) → average
                    ↓
         Small neural network
                    ↓
         Output: probability distribution over all words
         P("cat" | context) should be highest!
  
  Training: Adjust all word embeddings to make the model
  predict the correct center word from context.
  After training, the embeddings ARE the representation.

  Use case: Works better for frequent words, faster training.

═══════════════════════════════════════════════════════════

SKIP-GRAM:
  TASK: Predict SURROUNDING words from CENTER word (opposite of CBOW!)
  
  Input: embed("cat")
                ↓
       Small neural network
                ↓
  Output: P("The"), P("sat"), P("on") should all be high!
  
  This is why it's called Skip-Gram: the center word "skips"
  to predict words in its context window.
  
  Use case: Works better for rare words, captures more semantics.

NEGATIVE SAMPLING — Why it's needed:

Naively, each training step requires computing probabilities
over the ENTIRE vocabulary (50,000 words):
  P("cat" | context) = softmax over 50,000 outputs
  = VERY SLOW

Negative Sampling shortcut:
  Instead of asking "what's the probability for ALL words?"
  Ask: "Is 'cat' a real context word OR a random (negative) sample?"
  
  Positive: ("The", "cat") → label 1 (real context pair)
  Negatives: ("The", "pizza"), ("The", "quantum"), ("The", "treaty")
             → label 0 (fake pairs, randomly sampled)
  
  Train a binary classifier on 1 positive + ~5-20 negatives
  FAR cheaper than softmax over 50K words!
  Works because: "don't need to learn ALL wrong answers,
                   just enough negatives to learn good representations"

📐

Mathematical Explanation

For Skip-Gram with negative sampling, the objective is to maximize:

J = log σ(v'_c · v_w) + Σ_{k=1}^{K} E_{w_k~P(w)} [log σ(-v'_{w_k} · v_w)]

J	= objective function (what we maximize)
v_w	= embedding vector of center word w (the "input" embedding)
v'_c	= context embedding of word c (the "output" embedding)
σ	= sigmoid function: σ(x) = 1/(1+e^-x), outputs value between 0 and 1
K	= number of negative samples (typically 5-20)
w_k	= the k-th negative sample (random word)
P(w)	= unigram distribution raised to 3/4 power (sampling distribution)
·	= dot product (element-wise multiply and sum)

Intuition of the Formula

The first term: maximize σ(v'_c · v_w) = maximize the dot product between actual context word c and center word w. High dot product = vectors point in similar direction = words are "close" in embedding space.

The second term: maximize σ(-v'_{w_k} · v_w) = maximize σ of NEGATIVE dot product for random words = push random words AWAY from center word in embedding space.

Net result: context words cluster together, random words are pushed apart → geometry of embedding space captures meaning!

📖

GloVe — Global Vectors for Word Representation

GloVe (Pennington et al., Stanford, 2014) takes a different approach. Instead of a sliding window (local context), GloVe uses global co-occurrence statistics — how often does word A appear in the same document as word B, across the ENTIRE corpus?

Build a co-occurrence matrix X where X_ij = how many times word i appears in the context of word j globally. Then factorize this matrix to get embeddings.

J = Σ_{i,j=1}^{V} f(X_{ij}) (w_i^T · w̃_j + b_i + b̃_j - log X_{ij})²

X_{ij}	= co-occurrence count of word i with word j in corpus
w_i	= word vector for word i
w̃_j	= context word vector for word j
b_i, b̃_j	= bias terms for word i and context j
f(X_{ij})	= weighting function (reduces weight of very common co-occurrences)
log X_{ij}	= log of co-occurrence count (target value)

GloVe vs Word2Vec

Word2Vec: local context windows, predict-based, captures syntactic relationships well
GloVe: global corpus statistics, count-based, captures word association patterns well
In practice: very similar quality. GloVe embeddings are easier to train reproducibly. Both are static (one vector per word, regardless of context) — superseded by contextual embeddings from BERT.

📖

FastText — Handling Rare & Morphological Words

FastText (Facebook AI, 2017) extends Word2Vec by representing each word as a bag of character n-grams. The embedding for a word is the sum of embeddings for all its character n-grams.

FastText character n-grams for "playing" (n=3):
  Special boundaries: <playing>
  
  Trigrams: <pl, pla, lay, ayi, yin, ing, ng>
  
  Embedding("playing") = Σ embedding(trigram) for all trigrams
                       = embed(<pl) + embed(pla) + embed(lay) + ...
  
ADVANTAGES:
  1. "played", "playing", "plays" share many trigrams → similar vectors!
     (All contain "play", "lay", "ayi" etc.)
     
  2. Rare word "photosynthesizing" gets a good vector even if it
     appeared only once, because its character pieces appear elsewhere
     
  3. Works great for morphologically rich languages (German, Finnish, Turkish)
     where words change form dramatically with suffixes/prefixes

USED IN: Facebook's production NLP systems, multilingual tasks,
         languages where word boundaries are complex

🗺️

Embedding Space — The Magic Properties

2D PROJECTION of 300D word embedding space:
(Real embeddings are 300D, this is conceptual illustration)

          Animals        Royalty
            │              │
    cat ●   │   dog ●   king ●
            │              │
  kitten ●  │  puppy ●  queen ●
            │              │
   feline ● │  canine ●  prince ●
                           │
            Countries     princess ●
            │
    France ●
    Paris ●   ← France + capital → Paris
    Germany ●
    Berlin ●  ← Germany + capital → Berlin
    Japan ●
    Tokyo ●

ANALOGIES (semantic algebra!):
  king - man + woman ≈ queen
  Paris - France + Germany ≈ Berlin
  doctor - man + woman ≈ nurse (controversial! shows bias in data)
  
CLUSTERING:
  Sports words cluster together
  Food words cluster together
  Medical terms cluster together
  Code keywords cluster together

SIMILARITY:
  cos_sim(cat, dog) ≈ 0.85  (very similar)
  cos_sim(cat, car) ≈ 0.15  (not similar)
  cos_sim(Paris, Tokyo) ≈ 0.72 (both capital cities)

💻

Python Implementation

Python

# pip install gensim numpy
import gensim.downloader as api
import numpy as np

# ── Load pre-trained Word2Vec (Google News, 300D) ─
# This downloads 1.6GB - for demo use smaller model:
model = api.load("word2vec-google-news-300")
# OR smaller: model = api.load("glove-wiki-gigaword-50")

# ── Basic operations ──────────────────────────────
# Get vector for a word
cat_vec = model['cat']
print("Shape of 'cat' vector:", cat_vec.shape)  # (300,)
print("First 5 dims:", cat_vec[:5])

# ── Semantic similarity ───────────────────────────
sim_cat_dog = model.similarity('cat', 'dog')
sim_cat_car = model.similarity('cat', 'car')
print(f"\nSimilarity(cat, dog): {sim_cat_dog:.3f}")
print(f"Similarity(cat, car): {sim_cat_car:.3f}")

# ── Most similar words ────────────────────────────
similar_to_king = model.most_similar('king', topn=5)
print("\nMost similar to 'king':", similar_to_king)

# ── FAMOUS ANALOGY: king - man + woman = queen ────
result = model.most_similar(
    positive=['king', 'woman'],   # add these
    negative=['man'],             # subtract this
    topn=1
)
print(f"\nking - man + woman = {result[0][0]}")
# Expected: queen!

# ── Word doesn't belong ──────────────────────────
odd_one_out = model.doesnt_match(['cat', 'dog', 'bird', 'car'])
print(f"\nDoesn't belong: {odd_one_out}")  # 'car'

# ── Manual cosine similarity computation ─────────
def cosine_sim(v1, v2):
    """
    Cosine similarity = how aligned are two vectors?
    Range: -1 (opposite) to +1 (identical)
    0 = perpendicular (unrelated)
    """
    dot_product = np.dot(v1, v2)          # v1 · v2
    norm_v1 = np.linalg.norm(v1)          # |v1|
    norm_v2 = np.linalg.norm(v2)          # |v2|
    return dot_product / (norm_v1 * norm_v2)

manual_sim = cosine_sim(model['paris'], model['france'])
print(f"\nManual cosine_sim(paris, france): {manual_sim:.3f}")

# ── Train your own Word2Vec on custom data ───────
from gensim.models import Word2Vec

sentences = [
    ["i", "love", "natural", "language", "processing"],
    ["nlp", "is", "the", "study", "of", "language"],
    ["transformers", "revolutionized", "nlp"],
    ["bert", "and", "gpt", "are", "transformer", "models"],
]

custom_model = Word2Vec(
    sentences=sentences,
    vector_size=50,        # embedding dimensions
    window=3,              # context window size
    min_count=1,           # include words seen at least once
    workers=4,             # parallel training
    sg=1                   # 0=CBOW, 1=Skip-Gram
)

print("\nCustom model - 'nlp' vector shape:", custom_model.wv['nlp'].shape)
print("Most similar to 'nlp':", custom_model.wv.most_similar('nlp', topn=3))

⚠️

Critical Limitation: Static Embeddings

❌ Major Problem with Word2Vec/GloVe

Word2Vec and GloVe give EVERY word ONE fixed vector, regardless of context. This is called a "static embedding".

Consider "bank":
— "I went to the bank to deposit money" → financial institution
— "The fish swam near the river bank" → geographical feature

Word2Vec has ONE vector for "bank" — it's the average of both meanings. But BERT and GPT produce contextual embeddings — different vectors for "bank" depending on the surrounding context. This is why transformers are so much more powerful than Word2Vec.

🎙️

Interview Questions

Q

Why is cosine similarity preferred over Euclidean distance for word embeddings?

Cosine similarity measures the ANGLE between two vectors, ignoring their magnitude (length). Euclidean distance measures the actual distance in space. For embeddings, direction matters more than magnitude because:

1. Word frequency affects vector magnitude — frequent words tend to have larger magnitude vectors. "the" would be "far" from many words in Euclidean space just because it's a very common word.
2. Cosine similarity captures semantic similarity regardless of how often words appear.
3. Example: vec("cat") might have magnitude 2.1 and vec("dog") magnitude 3.4, but if they point in the same direction (angle ≈ 0°), cosine similarity ≈ 1.0 (very similar) even though their Euclidean distance is large.

Q

⚠️ TRAP: "Word2Vec understands context" — True or False?

FALSE! This is a common misconception. Word2Vec uses a CONTEXT WINDOW during TRAINING to learn representations, but the resulting embeddings are STATIC — one fixed vector per word. Word2Vec doesn't understand context at inference time. When you look up `model['bank']`, you always get the same vector regardless of what sentence "bank" appears in. TRUE contextual understanding comes from BERT and GPT, which produce different embeddings for the same word depending on its context (Module 8-11).

📝

Practice Questions

Easy What is the key difference between CBOW and Skip-Gram?

CBOW predicts the CENTER word from surrounding context words. Skip-Gram predicts the SURROUNDING words from the center word. CBOW: context → center. Skip-Gram: center → context. CBOW is faster and works better for frequent words. Skip-Gram works better for rare words and learns better representations for smaller datasets.

Medium If a word appears in your test data that wasn't in the training corpus, how does FastText handle it differently from Word2Vec?

Word2Vec: The word gets mapped to [UNK] (unknown token), losing all its information. FastText: The word is broken into character n-grams (e.g., "gaming" → <ga, gam, ami, min, ing, ng>). Even if "gaming" never appeared during training, its component n-grams likely did appear in "game", "games", "gamble", etc. FastText sums up the embeddings of all these n-grams to produce a meaningful vector for "gaming". This is the key advantage of FastText for morphologically rich languages and domain-specific text.

6.1–6.5

SBERT, BGE, E5 & Similarity Measures

+

🔴

Why Sentence Embeddings?

Word embeddings give us one vector per word. But we often need ONE vector for an entire sentence or paragraph. This is needed for:

Semantic search: "Find all documents about neural networks" — compare query vector to document vectors
RAG: Convert knowledge base documents to vectors, find relevant ones for a user query
Duplicate detection: "Is this question already answered in our FAQ?"
Clustering: Group similar customer feedback together

The naive approach: average all word vectors in a sentence. Problem: "The dog bit the man" and "The man bit the dog" have the same average word vector but different meanings!

📖

SBERT — Sentence-BERT

SBERT (Reimers & Gurevych, 2019) — Siamese Network Architecture:

         Sentence A                    Sentence B
    "The cat sat on mat"         "A feline rested on rug"
             │                              │
             ▼                              ▼
      ┌─────────────┐              ┌─────────────┐
      │    BERT     │              │    BERT     │  ← SAME WEIGHTS (siamese!)
      │  Encoder    │              │  Encoder    │
      └─────────────┘              └─────────────┘
             │                              │
             ▼                              ▼
       Mean Pooling                   Mean Pooling
    (avg all token               (avg all token
      hidden states)               hidden states)
             │                              │
             ▼                              ▼
    Sentence Vector u              Sentence Vector v
    [0.23, -0.12, ...]            [0.21, -0.15, ...]
                   │              │
                   ▼              ▼
              Cosine Similarity(u, v) = 0.94  ← HIGH!
              (These sentences ARE semantically similar)

TRAINING:
  SBERT uses pairs/triplets of sentences:
  - Positive pair (similar): ("The cat sat", "A feline rested") → high similarity
  - Negative pair (dissimilar): ("The cat sat", "I love Python") → low similarity
  - Uses Triplet Loss or Cosine Similarity loss to train

📖

Production Embedding Models

Model	By	Dimensions	Best For
text-embedding-3-small	OpenAI	1536	General purpose, cost-effective, great for RAG
text-embedding-3-large	OpenAI	3072	High accuracy tasks, multilingual
BGE-large-en	BAAI	1024	Top-performing open source English embeddings
E5-large-v2	Microsoft	1024	Strong cross-lingual retrieval
Instructor-XL	HKU	768	Task-specific embeddings with instructions
all-MiniLM-L6-v2	SBERT	384	Fast, small, good quality — great for edge deployment

📐

Similarity Measures — Choosing the Right One

Cosine Similarity: cos(θ) = (u · v) / (||u|| × ||v||)

Dot Product: u · v = Σ u_i × v_i

Euclidean Distance: d(u,v) = √Σ(u_i - v_i)²

u, v	= two embedding vectors being compared
u · v	= dot product: element-wise multiplication then sum
\|\|u\|\|	= L2 norm (length/magnitude) of vector u = √(u₁² + u₂² + ... + u_n²)
θ	= angle between the two vectors

WHEN TO USE WHICH:

COSINE SIMILARITY: Range [-1, +1]
  ✓ Use for MOST text similarity tasks
  ✓ Not affected by vector magnitude (length)
  ✓ 1.0 = identical direction, 0 = perpendicular, -1 = opposite
  ✓ Standard choice for semantic search

DOT PRODUCT: Range (-∞, +∞)
  ✓ Faster to compute (no normalization)
  ✓ Used when vectors are already L2-normalized (then = cosine!)
  ✓ OpenAI recommends this for their normalized embeddings
  ✗ Affected by magnitude — longer vectors score higher even if angle is same

EUCLIDEAN DISTANCE: Range [0, +∞)
  ✓ Intuitive: physical distance in space
  ✓ Used in some clustering algorithms (k-means)
  ✗ More affected by vector dimension count and magnitude
  ✗ Less common for semantic similarity

PRACTICAL NOTE:
  Most embedding models output NORMALIZED vectors (||v|| = 1)
  When both vectors are normalized:
    cos(θ) = u · v  (they become equivalent!)
  So check if your embedding model normalizes outputs!

💻

Hugging Face Implementation

Python

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util
import numpy as np

# ── Load SBERT model ──────────────────────────────
model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, fast
# For production quality: 'BAAI/bge-large-en-v1.5'

# ── Encode sentences ──────────────────────────────
sentences = [
    "The cat is sitting on the mat.",
    "A feline is resting on a rug.",      # same meaning, different words
    "I love programming in Python.",       # different topic
    "Machine learning is fascinating.",
    "Deep learning is a subset of ML.",     # related to ML
]

embeddings = model.encode(sentences, normalize_embeddings=True)
print("Embedding shape:", embeddings.shape)  # (5, 384)

# ── Pairwise similarity ───────────────────────────
cos_sim_matrix = util.cos_sim(embeddings, embeddings)
print("\nSimilarity matrix (rounded):")
for i, sent_i in enumerate(sentences):
    for j, sent_j in enumerate(sentences):
        if i < j:
            sim = cos_sim_matrix[i][j].item()
            print(f"  [{sim:.2f}] '{sent_i[:30]}...' ↔ '{sent_j[:30]}...'")

# ── Semantic search ───────────────────────────────
knowledge_base = [
    "Python is a high-level programming language.",
    "Transformers use self-attention mechanisms.",
    "RAG combines retrieval with generation.",
    "BERT is an encoder-only transformer model.",
    "GPT uses autoregressive generation.",
]

kb_embeddings = model.encode(knowledge_base, normalize_embeddings=True)

query = "How does BERT work?"
query_emb = model.encode([query], normalize_embeddings=True)

# Find most similar documents
similarities = util.cos_sim(query_emb, kb_embeddings)[0]
top_results = sorted(enumerate(similarities), key=lambda x: -x[1])

print(f"\nQuery: '{query}'")
print("Top 3 most relevant documents:")
for rank, (idx, score) in enumerate(top_results[:3], 1):
    print(f"  {rank}. [{score:.3f}] {knowledge_base[idx]}")
# Expected: BERT document should score highest!

🔗

Relationship to RAG & Semantic Search

RAG

RAG Systems

Sentence embeddings ARE the backbone of RAG. Every document chunk in your knowledge base is stored as an embedding vector. User queries become embedding vectors. Cosine similarity finds relevant chunks. This is exactly Module 6 in production!

Sem

Semantic Search

Unlike keyword search (TF-IDF), semantic search finds documents that MEAN the same thing as the query, even if they use different words. "What are ML models?" would find "deep learning algorithms" because their embeddings are similar.

Cla

Claude

Claude's API can be paired with embedding models for RAG. Anthropic doesn't provide their own embeddings API yet — practitioners use OpenAI embeddings or open-source SBERT models.

GPT

OpenAI Embeddings

OpenAI's text-embedding-3 models are the most popular commercial embeddings for RAG. text-embedding-3-small costs $0.02/1M tokens — very cheap for building knowledge bases.

7.1–7.6

N-gram LMs, Perplexity & Text Generation

+

🔴

What is a Language Model?

Core Definition

A language model is a probability distribution over sequences of tokens. Given any sequence of words, it assigns a probability to that sequence. This seemingly simple idea is the foundation of all GPT, Claude, and LLaMA.

A language model assigns probabilities to sentences:

P("The cat sat on the mat")     = 0.0000023  (reasonable sentence)
P("The mat sat on the cat")     = 0.0000001  (weird but grammatical)
P("sat the mat on cat the")     = 0.0000000001 (not a sentence)
P("I love eating pizza")         = 0.0000087  (very natural)
P("I pizza eating love")         = 0.0000000003 (unnatural)

KEY INSIGHT: Language models learn "what is natural English"
from billions of examples. They assign HIGH probability to
natural sequences and LOW probability to unnatural ones.

APPLICATION: Given "The sky is ___", predict most likely next word:
  P("blue" | "The sky is") = 0.23  ← high
  P("clear" | "The sky is") = 0.18 ← high
  P("falling" | "The sky is") = 0.01 ← low
  P("pizza" | "The sky is") = 0.0001 ← very low

📖

N-Gram Language Models

Before neural networks, N-gram models were the standard language models. They estimate the probability of a word based only on the previous N-1 words (the Markov assumption).

P(w₁, w₂, ..., wₙ) ≈ Π P(wᵢ | wᵢ₋₁) [Bigram Model]

P(wᵢ | wᵢ₋₁) = count(wᵢ₋₁, wᵢ) / count(wᵢ₋₁)

P(w₁, w₂, ..., wₙ)	= probability of an entire sentence
Π	= product (multiply all terms together)
P(wᵢ \| wᵢ₋₁)	= probability of word wᵢ given the previous word wᵢ₋₁
count(wᵢ₋₁, wᵢ)	= how many times words wᵢ₋₁ and wᵢ appear consecutively in training data
count(wᵢ₋₁)	= how many times word wᵢ₋₁ appears in training data

BIGRAM EXAMPLE:
Training corpus: "I love NLP. I love Python. I enjoy coding."

Count pairs:
  (I, love) = 2
  (I, enjoy) = 1
  (love, NLP) = 1
  (love, Python) = 1
  (enjoy, coding) = 1

Compute bigram probabilities:
  P(love | I) = count(I,love) / count(I) = 2/3 = 0.667
  P(enjoy | I) = count(I,enjoy) / count(I) = 1/3 = 0.333

Compute sentence probability:
  P("I love NLP") = P(I) × P(love|I) × P(NLP|love)
                  = 0.25 × 0.667 × 1.0 = 0.167

LIMITATIONS:
  "I love NLP" — bigram only looks back 1 word
  Can't capture: "The man who ate sushi ... enjoyed IT"
  "IT" = "sushi" but it's 7 words back! Bigram can't know.

📖

Perplexity — Evaluating Language Models

Perplexity measures how "confused" or "surprised" a language model is by a test set. Lower perplexity = model is less surprised = model is better at predicting language.

PP(W) = P(w₁, w₂, ..., wₙ)^(-1/N)

PP(W) = ∜[N]{1/P(w₁w₂...wₙ)}

PP(W)	= perplexity of word sequence W
P(w₁...wₙ)	= probability the model assigns to the test sequence
N	= number of tokens in the sequence
^(-1/N)	= raise to the power of -1/N (geometric mean normalization)

INTUITION:
  PP = 1   → Perfect model! Always predicts the correct next word.
  PP = 10  → On average, model is choosing between 10 equally likely words.
  PP = 100 → On average, 100 equally likely choices. Not great.
  PP = 50000 → Random guessing over full vocabulary. Terrible model.

Real language model perplexities:
  GPT-2 (small, 117M params): PP ≈ 29 on Penn Treebank
  GPT-2 (large, 774M params): PP ≈ 22 on Penn Treebank  
  GPT-3 (175B params):        PP ≈ 8.5 on Penn Treebank
  Human: PP ≈ 60-80 on reading tasks (humans are uncertain too!)

WHY LOWER IS BETTER:
  A model with PP=10 needs 10 "guesses" on average to get it right.
  A model with PP=5 only needs 5 guesses.
  If your autocomplete model has PP=5, it's twice as good as PP=10!

📖

Text Generation Strategies — How GPT Decides What to Say

SETUP: Model outputs probability distribution over vocabulary at each step.
  Input: "The sky is"
  Output probabilities:
    blue: 0.30, clear: 0.20, beautiful: 0.15, falling: 0.05, ...

1. GREEDY SEARCH — Always pick most likely token
   Step 1: "blue" (0.30) → "The sky is blue"
   Step 2: "and" (0.25) → "The sky is blue and"
   Step 3: "blue" (0.30) → "The sky is blue and blue"  ← LOOPS!
   ✓ Fast, deterministic
   ✗ Can get stuck in repetitive loops

2. BEAM SEARCH — Keep top-K partial sequences simultaneously
   Width=2: Track 2 candidate sequences at each step:
   Step 1: Keep ["blue" (0.30), "clear" (0.20)]
   Step 2 from "blue": ["blue and" (0.3×0.25=0.075), "blue sky" (0.3×0.12=0.036)]
   Step 2 from "clear": ["clear blue" (0.2×0.22=0.044), "clear sky" (0.2×0.18=0.036)]
   Keep top 2: ["blue and" (0.075), "clear blue" (0.044)]
   ✓ Better quality than greedy
   ✗ Still can sound generic and boring

3. TOP-K SAMPLING — Sample from top-K most likely tokens
   K=5: Only consider ["blue", "clear", "beautiful", "bright", "vast"]
   Sample randomly from these 5 (weighted by probability)
   ✓ Introduces diversity/creativity
   ✗ K=5 might be too few (misses good options) or too many (K=100 includes weird tokens)

4. TOP-P (NUCLEUS) SAMPLING — Sample from tokens covering top P% probability
   P=0.9: Add tokens until cumulative probability ≥ 0.9
     blue=0.30 (total 0.30) → clear=0.20 (0.50) → beautiful=0.15 (0.65)
     → bright=0.12 (0.77) → vast=0.08 (0.85) → calm=0.06 (0.91) STOP
   Sample from these 6 tokens proportionally.
   ✓ Dynamic vocabulary size — automatically adjusts to model's uncertainty
   ✓ Used by most LLMs in production (including GPT, Claude)

5. TEMPERATURE — Controls randomness
   T=0.0: Always pick max probability (= greedy)
   T=1.0: Sample from original distribution (default)
   T=2.0: Flatten distribution → more random/creative
   T=0.1: Sharpen distribution → very focused/deterministic
   
   Logit scaling: adjusted_logit = raw_logit / temperature
   Then softmax to get probabilities.

✅ Production Practice

Most production LLM APIs (OpenAI, Anthropic, Together.ai) expose temperature and top_p parameters. Common settings:
— Creative writing: temperature=0.9, top_p=0.95
— Code generation: temperature=0.2, top_p=0.95
— Factual Q&A: temperature=0.1
— Brainstorming: temperature=1.2, top_p=0.99

🎙️

Interview Q&A

Q

What happens when temperature=0 vs temperature=1 in GPT?

Temperature=0: The model ALWAYS picks the token with the highest probability (equivalent to greedy decoding). The output is deterministic — run it 100 times, get the same result. Useful for: factual questions, code generation, tasks where you want consistency.

Temperature=1: The model samples from the probability distribution directly. Higher probability tokens are still more likely to be chosen, but lower probability tokens can also appear. Non-deterministic. Useful for: creative writing, brainstorming, generating diverse outputs.

Temperature=2: The probability distribution is flattened (more uniform). All tokens become more equally likely. Results are very random and often incoherent.

Temperature between 0 and 1 (e.g., 0.7) is common: adds some diversity while maintaining coherence.

8.1–8.10

Self-Attention, Multi-Head Attention & Contextual Embeddings

+

🔴

Why RNNs Failed — The Problem Attention Solves

RNN (Recurrent Neural Network) — The Old Way:

Text: "The animal didn't cross the street because it was too tired"

RNN processes LEFT TO RIGHT, one token at a time:
  "The" → h₁
  "animal" → h₂ (modified by h₁)
  "didn't" → h₃ (modified by h₂)
  ...
  "it" → h₁₀ (modified by h₉)
  ...
  "tired" → h₁₄ (final state)

PROBLEM 1: VANISHING GRADIENTS
  Information from early tokens (h₁) gets diluted/lost
  by the time we reach h₁₄.
  "animal" information is hard to recover at "tired"
  → RNNs struggle with long-range dependencies

PROBLEM 2: SEQUENTIAL PROCESSING
  Can't process "The" and "animal" simultaneously
  Must wait: token 1 → token 2 → token 3 → ...
  → SLOW! Can't parallelize on GPUs

WHAT "IT" REFERS TO?
  "it" = animal (not street)
  To understand this, model needs to connect "it" (position 10)
  to "animal" (position 2) — 8 tokens apart!
  RNN information of "animal" is mostly gone by "it"...

ATTENTION SOLUTION:
  Let "it" directly attend to "animal" regardless of distance!
  No information decay — any token can directly look at any other!

🎯

Real-World Analogy for Attention

Library Analogy

Imagine you're a researcher. You have a Query (your research question). The library has many books, each with a Key (the label on the spine). When you search, you compare your query against all keys. The Values are the actual content inside matching books.

Attention works the same way: for each token (the query), it looks at all other tokens (the keys), computes how relevant each is, then blends their information (values) weighted by relevance.

📖

Query, Key, Value — Deep Dive

For each token in the input, we create THREE vectors:
  Q (Query):  "What information am I looking for?"
  K (Key):    "What information do I contain?"
  V (Value):  "What information do I actually provide?"

These are created by multiplying the token embedding by
learned weight matrices:
  Q = x · Wq    (x = token embedding, Wq = learned matrix)
  K = x · Wk
  V = x · Wv

Example: "The cat sat on the mat"
Token:    "sat"
  Q_sat = "I need to know WHO sat and WHERE they sat"
  K_sat = "I contain information about the sitting action"
  V_sat = "My actual content: verb, past tense, sitting"

How "sat" attends to "cat":
  Score = Q_sat · K_cat  (dot product)
  High score → "sat" will use a lot of "cat"'s value
  Low score → "sat" will mostly ignore "cat"'s value

📐

Scaled Dot-Product Attention — The Formula

Attention(Q, K, V) = softmax( QK^T / √d_k ) × V

Q	= Query matrix: shape [n_tokens × d_k], stacks all Q vectors for all tokens
K	= Key matrix: shape [n_tokens × d_k], stacks all K vectors for all tokens
V	= Value matrix: shape [n_tokens × d_v], stacks all V vectors for all tokens
K^T	= Transpose of K matrix (flip rows and columns): shape [d_k × n_tokens]
QK^T	= Attention score matrix: shape [n_tokens × n_tokens]. Entry (i,j) = how much token i attends to token j
d_k	= dimension of each Q/K vector (e.g., 64)
√d_k	= scaling factor (prevents very large dot products causing softmax to saturate)
softmax	= converts raw scores to probabilities that sum to 1.0 for each query token
× V	= multiply the attention weights by the Value matrix to get the output

STEP-BY-STEP FOR "The cat sat" (simplified, 3 tokens, d_k=2):

TOKEN EMBEDDINGS (input):
  x_the = [1.0, 0.5]
  x_cat = [0.8, 0.9]
  x_sat = [0.6, 0.7]

WEIGHT MATRICES (learned):
  Wq = [[0.1, 0.2], [0.3, 0.1]]
  Wk = [[0.2, 0.1], [0.1, 0.3]]
  Wv = [[0.5, 0.1], [0.2, 0.5]]

COMPUTE Q, K, V:
  Q = X · Wq    (multiply each token emb by Wq)
  K = X · Wk
  V = X · Wv

ATTENTION SCORES:
  Scores = Q × K^T  (3×2) × (2×3) = (3×3) matrix
  
  Score[i,j] = how much token i should attend to token j:
  
         the   cat   sat
  the  [0.85, 0.78, 0.62]
  cat  [0.79, 0.91, 0.74]   ← "cat" attends most to itself
  sat  [0.63, 0.88, 0.70]   ← "sat" attends most to "cat" (subject!)

SCALE by √d_k = √2 = 1.41:
  Divide all scores by 1.41

SOFTMAX (convert each row to probabilities summing to 1):
         the   cat   sat
  the  [0.33, 0.37, 0.30]
  cat  [0.28, 0.42, 0.30]
  sat  [0.24, 0.44, 0.32]   ← "sat" gives 44% weight to "cat"

OUTPUT = Attention_weights × V:
  Each token's output = weighted blend of ALL value vectors
  Output_sat = 0.24×V_the + 0.44×V_cat + 0.32×V_sat
  = "sat" gets a vector that blends info from all tokens,
    but especially from "cat" (the subject who sat!)

📖

Multi-Head Attention — The Power Upgrade

Instead of ONE attention function, use H parallel attention heads, each with their own Q, K, V weight matrices. Each head can focus on different aspects of the relationships between words.

MULTI-HEAD ATTENTION with H=3 heads:

Input: "I saw her duck"

HEAD 1 (Syntactic attention — "who does what"):
  "saw" ←→ "I"        (subject-verb)
  "duck" ←→ "saw"     (verb-object)
  Learns: "her duck" → she owns a duck (grammatical parsing)

HEAD 2 (Coreference — "what refers to what"):
  "her" → "woman" (previously mentioned context)
  Learns: pronoun resolution

HEAD 3 (Semantic attention — "what concepts are related"):
  "saw" ←→ "duck"     (actions + animals are related somehow)
  Learns: contextual word meaning

Each head produces its own output vector.
Concatenate all head outputs → project to final output.

Multi-head output = Concat(head₁, head₂, head₃) × W_o

WHERE:
  headᵢ = Attention(Q·Wqᵢ, K·Wkᵢ, V·Wvᵢ)
  Each head has its own Wqᵢ, Wkᵢ, Wvᵢ matrices
  W_o = output projection matrix

BERT-base: 12 attention heads, 64 dims each = 768 total
GPT-2:     12 attention heads, 64 dims each = 768 total  
GPT-3:     96 attention heads, 128 dims each = 12,288 total
LLaMA-3-70B: 64 attention heads

📖

Contextual Embeddings — The Result of Attention

✅ The Big Payoff

After attention, each token has a new representation that incorporates information from ALL other tokens. This is called a contextual embedding — unlike Word2Vec which gives "bank" the same vector always, BERT/GPT give "bank" a different vector based on surrounding context.

"I went to the bank to get money" → bank vector points toward financial concepts
"The boat docked on the river bank" → bank vector points toward geographic concepts

This is the FUNDAMENTAL advantage of transformers over all previous approaches.

🎙️

Interview Q&A — High Frequency Questions

Q

Why do we divide by √d_k in scaled dot-product attention?

The dot product Q·K^T grows in magnitude as d_k increases. With large d_k (like 64 or 128), the dot products can become very large numbers. When you put large numbers through softmax, it creates extremely peaked distributions (one token gets ~1.0, all others get ~0.0). This "saturation" means gradients become very small and the model learns slowly (vanishing gradient problem). Dividing by √d_k keeps the dot products in a reasonable range (variance ≈ 1) before softmax, leading to more balanced attention distributions and better gradient flow during training.

Q

⚠️ TRAP: What is the computational complexity of self-attention and why does it matter?

Self-attention has O(n²) complexity where n = sequence length (number of tokens). Computing QK^T creates an n×n matrix. This means:
— 1K tokens: 1M operations
— 4K tokens: 16M operations
— 128K tokens: 16 BILLION operations

This is why extending context length is expensive. GPT-4's 128K context requires massive compute. Claude's 200K context required innovative architectural choices. Researchers are actively working on linear attention variants (Mamba, RetNet, RWKV) that achieve O(n) complexity. This is one of the most active research areas in LLMs today.

Q

What is the difference between self-attention and cross-attention?

Self-attention: Q, K, V all come from the SAME sequence. "I attend to myself." Used in encoder (BERT) and decoder self-attention (GPT). Each token can attend to all other tokens in the same sequence.

Cross-attention: Q comes from one sequence, K and V come from a DIFFERENT sequence. "I attend to someone else." Used in encoder-decoder models (T5, BART) in the decoder: the decoder's Q queries attend to the encoder's K and V keys/values. This is how the decoder "looks at" the input when generating the output (used in translation, summarization).

📝

Practice Questions

Easy In the sentence "She gave her friend the book", which word should "her" attend to most strongly?

"Her" should attend most strongly to "She" — they both refer to the same person (coreference). In practice, BERT's attention heads learn to do exactly this: pronoun resolution through attention patterns. Head 6 of BERT (in research studies) is known to specialize in coreference resolution, where pronouns like "her", "it", "they" attend strongly to their antecedents.

Hard In multi-head attention with H=8 heads and model dimension d_model=512, what is d_k and why?

d_k = d_model / H = 512 / 8 = 64 dimensions per head. The original "Attention is All You Need" paper uses this factorization: instead of one large attention with 512-dim Q, K, V matrices, use 8 smaller heads with 64-dim Q, K, V each. The intuition: multiple smaller attention computations, each potentially specializing in different relationship types, with total parameter count similar to one large head. After computing each head's output (64-dim), concatenate all 8 heads: 8 × 64 = 512 dimensions, which is projected back to d_model=512 via W_o.

9.1–9.6

Complete Transformer Architecture

+

🗺️

Full Transformer Architecture

ORIGINAL TRANSFORMER (Vaswani et al., 2017)
Used for: Machine Translation (English → German)

INPUT SIDE (ENCODER):           OUTPUT SIDE (DECODER):
"How are you?"                  "Wie geht es Ihnen?"

    ┌──────────────────┐            ┌──────────────────┐
    │  Input Tokens    │            │  Output Tokens   │
    │ [How, are, you]  │            │ [Wie, geht, ...]  │
    └────────┬─────────┘            └────────┬─────────┘
             │                               │
    ┌────────▼─────────┐            ┌────────▼─────────┐
    │  Token           │            │  Token           │
    │  Embeddings      │            │  Embeddings      │
    └────────┬─────────┘            └────────┬─────────┘
             │                               │
    ┌────────▼─────────┐            ┌────────▼─────────┐
    │  Positional      │            │  Positional      │
    │  Encoding (+)    │            │  Encoding (+)    │
    └────────┬─────────┘            └────────┬─────────┘
             │                               │
    ┌────────▼─────────┐            ┌────────▼─────────┐
    │  ENCODER         │            │  DECODER         │
    │  BLOCK × N       │            │  BLOCK × N       │
    │                  │            │                  │
    │ [Multi-Head      │            │ [Masked Multi-   │
    │  Self-Attention] │            │  Head Self-Attn] │
    │        ↓         │            │        ↓         │
    │ [Add & LayerNorm]│            │ [Add & LayerNorm]│
    │        ↓         │            │        ↓         │
    │ [Feed Forward]   │    ┌───────│ [Cross-Attention]│
    │        ↓         │    │       │  (Q from decoder │
    │ [Add & LayerNorm]│    │       │   K,V from encdr)│
    └────────┬─────────┘    │       │        ↓         │
             │              │       │ [Add & LayerNorm]│
             └──────────────┘       │        ↓         │
             (encoder outputs       │ [Feed Forward]   │
              → K, V for            │        ↓         │
              cross-attention)      │ [Add & LayerNorm]│
                                    └────────┬─────────┘
                                             │
                                    ┌────────▼─────────┐
                                    │  Linear + Softmax │
                                    │  (predict next   │
                                    │   word probs)    │
                                    └──────────────────┘

📖

Positional Encoding — Teaching Transformers About Order

Self-attention has no sense of order — it looks at ALL tokens simultaneously. "cat ate mouse" and "mouse ate cat" would produce the same attention scores if we don't tell the model about position. Positional encoding adds position information to each token's embedding.

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

pos	= position of the token in the sequence (0, 1, 2, ...)
i	= dimension index (0, 1, 2, ..., d_model/2)
d_model	= model embedding dimension (e.g., 512)
sin/cos	= sine and cosine functions — alternate between even and odd dimensions
10000	= base constant — creates different frequencies at different dimensions

POSITIONAL ENCODING INTUITION:
  Each position gets a unique "fingerprint" vector
  Made of sine and cosine waves at different frequencies
  
  Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
            = [0,      1,      0,      1,      ...]
  
  Position 1: [sin(1/1), cos(1/1), sin(1/10000), cos(1/10000), ...]
            = [0.84,    0.54,    0.0001,         1.0,         ...]

WHY SINE/COSINE?
  1. Bounded: always between -1 and +1
  2. Unique: each position gets a unique pattern
  3. Relative: model can learn "position A is 3 steps before position B"
     because sin(a+b) = sin(a)cos(b) + cos(a)sin(b) — linear relationship!
  4. Works for unseen lengths: can extrapolate to sequences longer than training

MODERN ALTERNATIVE: RoPE (Rotary Positional Embedding)
  Used by: LLaMA, Mistral, Qwen, Falcon, Phi
  Encodes position as rotation of Q and K vectors
  Better for long contexts — doesn't degrade with position distance
  RoPE(q, position) = q × rotation_matrix(position)

📖

Feed-Forward Layer, Residual Connections & LayerNorm

TRANSFORMER BLOCK COMPONENTS:

1. FEED-FORWARD NETWORK (FFN):
   Applied to each token INDEPENDENTLY (no cross-token interaction):
   FFN(x) = max(0, x·W₁ + b₁) · W₂ + b₂
   
   Typical sizes: d_model=512, FFN inner dim=2048 (4× expansion)
   GPT-3: d_model=12288, FFN=49152
   
   What does FFN do? 
   Attention = "relate tokens to each other"
   FFN = "think about each token individually" 
   FFN stores factual knowledge! (demonstrated by meng et al. 2022)
   "Paris is the capital of ___" → knowledge stored in FFN weights

2. RESIDUAL CONNECTIONS (Skip Connections):
   Output = x + SubLayer(x)
                 ↑ original input is ADDED back
   
   Why? Prevents vanishing gradients in deep networks!
   Gradient can flow DIRECTLY through addition → no degradation
   Allows training 12, 24, 96+ layers deep

3. LAYER NORMALIZATION:
   LayerNorm(x) = γ × (x - μ)/σ + β
   
   μ = mean of x's values
   σ = standard deviation of x's values
   γ, β = learned scale and shift parameters
   
   Stabilizes training by normalizing activations.
   Each token's vector is normalized to mean≈0, std≈1
   then rescaled by learned γ, β

FINAL TRANSFORMER BLOCK (Post-LN style):
  x → [Multi-Head Self-Attention] → + x → LayerNorm → 
  → [Feed Forward Network] → + x → LayerNorm → output

📖

Three Transformer Flavors

╔═══════════════╦═══════════════════╦══════════════════════╗
║ TYPE          ║ ENCODER-ONLY      ║ DECODER-ONLY         ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Models        ║ BERT, RoBERTa,    ║ GPT (all), Claude,   ║
║               ║ DeBERTa, ALBERT   ║ LLaMA, Mistral,      ║
║               ║                   ║ DeepSeek, Qwen       ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Attention     ║ Full bidirectional║ Causal (left-only)   ║
║               ║ Every token sees  ║ Each token only sees ║
║               ║ all other tokens  ║ past tokens          ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Training      ║ MLM: predict      ║ Next token prediction║
║ Objective     ║ masked tokens     ║ (autoregressive)     ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Best For      ║ Understanding     ║ Generation:          ║
║               ║ tasks:            ║ chatbots, writing,   ║
║               ║ classification,   ║ code, QA, everything ║
║               ║ NER, QA extract.  ║                      ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ TYPE          ║ ENCODER-DECODER   ║                      ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Models        ║ T5, BART, mT5     ║                      ║
║               ║ Pegasus, MarianMT ║                      ║
╠═══════════════╬═══════════════════╬══════════════════════╣
║ Best For      ║ Seq2Seq tasks:    ║                      ║
║               ║ translation,      ║                      ║
║               ║ summarization,    ║                      ║
║               ║ question gen.     ║                      ║
╚═══════════════╩═══════════════════╩══════════════════════╝

💻

Hugging Face — Using Any Transformer Model

Python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import pipeline
import torch

# ── EASY WAY: Pipeline API ───────────────────────────────
# Handles tokenization + model + postprocessing automatically

# Sentiment analysis (encoder-only: uses BERT under the hood)
sentiment = pipeline("sentiment-analysis")
result = sentiment("I absolutely love this course on NLP!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation (decoder-only: uses GPT-2)
generator = pipeline("text-generation", model="gpt2")
result = generator(
    "The future of AI is",
    max_new_tokens=50,
    temperature=0.8,
    do_sample=True,
    top_p=0.9
)
print(result[0]['generated_text'])

# Summarization (encoder-decoder: uses BART)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
long_text = """
Transformers were introduced in the paper "Attention Is All You Need" 
by Vaswani et al. in 2017. They use self-attention mechanisms to process
sequences in parallel, solving the vanishing gradient problem that plagued
RNNs. The architecture consists of an encoder and decoder, each containing
multiple layers of multi-head self-attention and feed-forward networks.
Transformers have since become the dominant architecture in NLP.
"""
summary = summarizer(long_text, max_length=60, min_length=20)
print("\nSummary:", summary[0]['summary_text'])

# ── MANUAL WAY: Full control ──────────────────────────────
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

text = "Transformers are amazing!"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    
pred_class = torch.argmax(probs).item()
labels = model.config.id2label
print(f"\nPrediction: {labels[pred_class]} ({probs[0][pred_class]:.3f})")

10.1–10.7

MLM, NSP, CLS Token & Fine-Tuning

+

📖

BERT Architecture Overview

BERT = Bidirectional Encoder Representations from Transformers
     = ENCODER-ONLY Transformer (uses only the Encoder stack)

BERT-base: 12 transformer encoder layers
           768 hidden dimensions
           12 attention heads
           110M parameters

BERT-large: 24 transformer encoder layers
            1024 hidden dimensions
            16 attention heads
            340M parameters

KEY INNOVATION: BIDIRECTIONAL attention
  Previous models (GPT, ELMo): read text left→right OR right→left
  BERT: reads text in BOTH directions simultaneously!
  
  "The bank can guarantee deposits will eventually cover..."
  When processing "bank", BERT sees:
  ← "The" (left) AND "can guarantee deposits" (right) →
  Both directions together let BERT figure out "bank" = financial!

INPUT FORMAT:
  [CLS] token_1 token_2 ... [SEP] token_A token_B ... [SEP]
   ↑                         ↑                         ↑
  always first           separates sentences        always last

📖

MLM — Masked Language Modeling (BERT's Training Task)

BERT is trained on a "fill in the blank" task:

Original:   "The cat sat on the mat"
Masked:     "The [MASK] sat on the mat"
Task:       Predict what [MASK] is → "cat"

MASKING STRATEGY (15% of tokens are selected):
  80% replaced with [MASK]:  "The [MASK] sat" 
  10% replaced with random word: "The dog sat" (still predict "cat"!)
  10% kept unchanged: "The cat sat" (but still predict "cat"!)

Why NOT just always use [MASK]?
  At fine-tuning/inference, [MASK] never appears!
  If model only ever sees [MASK], it won't learn good representations
  for non-masked tokens. The random replacement forces the model
  to develop good representations for ALL tokens.

TRAINING OBJECTIVE:
  Loss = Cross-entropy on masked positions ONLY
  Don't penalize predictions on non-masked positions
  
RESULT: BERT learns deep bidirectional representations because
to predict [MASK], it must understand context from both sides!

📖

NSP — Next Sentence Prediction

NSP is BERT's second pre-training task:

Task: Given two sentences, does sentence B follow sentence A naturally?

Positive example (IsNext=True):
  Sentence A: "The man went to the store."
  Sentence B: "He bought a gallon of milk."
  Label: 1 (IsNext)

Negative example (IsNext=False):  
  Sentence A: "The man went to the store."
  Sentence B: "Penguins live in Antarctica."
  Label: 0 (NotNext) ← randomly sampled, unrelated!

50% of training pairs are IsNext, 50% are NotNext.

The [CLS] token's final representation is used to predict
IsNext/NotNext → BERT learns sentence-level coherence!

NOTE: NSP has been questioned in later research.
RoBERTa (Facebook) removed NSP entirely and got better results!
But [CLS] token's usefulness for classification tasks remains.

📖

CLS Token — BERT's Secret Weapon for Classification

CLS TOKEN MECHANICS:

Input:     [CLS] I   love  NLP  [SEP]
Position:    0   1    2    3    4

After 12 transformer layers of attention, EACH token has
a contextual representation. The [CLS] token is special:
  • It attends to ALL other tokens (via self-attention)
  • All other tokens can also influence [CLS]
  • After training, [CLS] learns to aggregate the meaning
    of the ENTIRE SEQUENCE into one vector

During fine-tuning for classification:
  [CLS] representation (768-dim vector)
            ↓
  Linear layer (768 → num_classes)
            ↓
  Softmax → class probabilities

So for sentiment analysis:
  [CLS] "I love NLP" [SEP]
  After BERT → CLS vector = [0.23, -0.12, 0.89, ...]
  Linear layer → [0.05, 0.95] (neg, pos)
  → POSITIVE (95% confidence)

📖

Fine-Tuning BERT

BERT's power comes from pre-training + fine-tuning. Pre-training on 3 billion words gives BERT general language understanding. Fine-tuning on a small task-specific dataset adapts it to your specific problem.

FINE-TUNING STRATEGY:

PRE-TRAINED BERT (general knowledge)
  ↓ (add task-specific head)
┌─────────────────────────────────────────────────────┐
│ Task                  │ Head               │ Input   │
│──────────────────────────────────────────────────────│
│ Sentiment Analysis    │ [CLS] → linear     │ [CLS] S │
│ Named Entity Recog.   │ Each token → linear │ tokens  │
│ Question Answering    │ Start/End logits   │ Q [SEP] C│
│ Sentence Similarity   │ [CLS] → regression │ S₁[SEP]S₂│
│ Text Classification   │ [CLS] → linear     │ [CLS] S │
└─────────────────────────────────────────────────────┘

Training settings for fine-tuning:
  Learning rate: 2e-5 to 5e-5 (small! don't destroy pre-trained weights)
  Epochs: 2-4 (don't overfit on small dataset)
  Batch size: 16-32
  Sequence length: up to 512 tokens

Why so few epochs?
  BERT already knows language. You're just teaching it your
  specific task, not starting from scratch.
  Fine-tuning on too many epochs → "catastrophic forgetting"!

💻

Fine-Tuning BERT for Sentiment Analysis

Python

# pip install transformers datasets torch
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

# ── 1. Load Dataset ───────────────────────────────────────
dataset = load_dataset("imdb")  # 25K training, 25K test reviews
print("Dataset loaded:", dataset)

# ── 2. Load Tokenizer ─────────────────────────────────────
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    """Tokenize texts and return input_ids, attention_mask, etc."""
    return tokenizer(
        examples['text'],
        truncation=True,      # cut off if too long
        max_length=512,       # BERT max sequence length
        padding='max_length'  # pad shorter sequences
    )

# Apply tokenization to entire dataset
tokenized = dataset.map(tokenize_function, batched=True)

# ── 3. Load BERT Model ────────────────────────────────────
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2  # POSITIVE, NEGATIVE
)
# This adds a classification head on top of BERT's [CLS] output

# ── 4. Define Metrics ─────────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {'accuracy': accuracy_score(labels, predictions)}

# ── 5. Training Arguments ─────────────────────────────────
training_args = TrainingArguments(
    output_dir="./bert-sentiment",
    num_train_epochs=3,          # usually 2-5 for fine-tuning
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,          # small! don't destroy pre-training
    evaluation_strategy="epoch",  # evaluate at end of each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,
    warmup_steps=500,            # gradually increase LR at start
    weight_decay=0.01,           # regularization
)

# ── 6. Train! ─────────────────────────────────────────────
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'].select(range(5000)),  # small for demo
    eval_dataset=tokenized['test'].select(range(1000)),
    compute_metrics=compute_metrics
)

trainer.train()
# Fine-tuning takes 10-30 min on GPU, hours on CPU

# ── 7. Inference ──────────────────────────────────────────
from transformers import pipeline
sentiment_pipeline = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)
test_texts = [
    "This movie was absolutely fantastic!",
    "Worst film I've ever seen. Terrible.",
    "It was okay, nothing special."
]
for text in test_texts:
    result = sentiment_pipeline(text)[0]
    print(f"{result['label']} ({result['score']:.3f}): {text}")

🎙️

Interview Q&A

Q

What is the difference between BERT and GPT architecture?

BERT is encoder-only with BIDIRECTIONAL attention — each token attends to ALL other tokens (left AND right). This makes BERT great for UNDERSTANDING tasks.

GPT is decoder-only with CAUSAL (unidirectional) attention — each token only attends to PAST tokens (left only). This makes GPT great for GENERATION tasks.

Training objectives also differ: BERT uses MLM (fill in blanks) + NSP. GPT uses autoregressive next-token prediction (predict next word from all previous words).

Result: BERT better at classification, NER, extractive QA. GPT better at chat, writing, code generation, reasoning.

Q

Why can BERT only handle 512 tokens?

BERT uses absolute positional encodings — it has a learned embedding for each position (0-511). Since only positions 0-511 were trained, the model has no positional embedding for position 512+. If you try to give it a 600-token sequence, it literally doesn't know what position those extra tokens are at.

This is also why modern models use RoPE (Rotary Position Embedding) — it doesn't have this hard limit and can generalize to longer sequences than seen during training. LLaMA 3 with RoPE can handle up to 128K tokens.

11.1–11.7

GPT Architecture, In-Context Learning & Chat Models

+

🗺️

GPT Architecture

GPT = Decoder-only Transformer (stacks ONLY decoder blocks)

TOKENS:  "The" "cat"  "sat"  "on"   "the"   "mat"
MASKS:   ─────────────────────────────────────────
         ✓     ✓✓     ✓✓✓   ✓✓✓✓  ✓✓✓✓✓  ✓✓✓✓✓✓
         (1)   (1,2)  (1-3) (1-4)  (1-5)  (1-6)

CAUSAL MASKING = "Autoregressive mask":
  When processing "sat", it can ONLY see:
    → "The" (position 1) ✓
    → "cat" (position 2) ✓  
    → "sat" (position 3) ✓ (itself)
    → "on"  (position 4) ✗ BLOCKED! Future token!
    → "the" (position 5) ✗ BLOCKED!
    → "mat" (position 6) ✗ BLOCKED!

WHY CAUSAL MASKING?
  During TRAINING: "The cat sat on the mat" is given
  GPT is trained to predict: cat|The, sat|The cat, on|The cat sat...
  Without masking, GPT could "cheat" by looking at future words!
  
  During INFERENCE: We don't HAVE future tokens — we're generating them!
  So causal masking matches inference reality.

GENERATION PROCESS (AUTOREGRESSIVE):
  Input: "The cat"
  Step 1: Generate next token → "sat" (append to input)
  Input: "The cat sat"
  Step 2: Generate next token → "on" (append)
  Input: "The cat sat on"
  Step 3: Generate "the" → and so on...
  STOP when  token is generated

💡

In-Context Learning — GPT's Emergent Magic

One of the Most Important Discoveries in AI

In-Context Learning (ICL) means GPT can learn to do a new task just from examples given in the prompt — without updating any weights. You don't need to fine-tune. Just show examples in the prompt and the model follows the pattern.

ZERO-SHOT:
  Prompt: "Translate English to French: Hello"
  GPT: "Bonjour"  ← No examples given, just instruction

ONE-SHOT:
  Prompt: "Translate English to French:
           English: Good morning → French: Bonjour matin
           English: Hello →"
  GPT: "Bonjour"

FEW-SHOT:
  Prompt: "Classify sentiment (POSITIVE/NEGATIVE):
           'I loved it!' → POSITIVE
           'Terrible food' → NEGATIVE
           'Best movie ever!' →"
  GPT: "POSITIVE"

WHY DOES THIS WORK?
  GPT has seen BILLIONS of examples during training.
  It has learned patterns like: "question → answer",
  "English: X → French: Y", "input: X, output: Y".
  Given a few examples, it recognizes the pattern and continues it.
  This is why bigger models (GPT-4 vs GPT-2) are better at ICL —
  they've compressed more patterns from more data.

📖

RLHF — How GPT Becomes ChatGPT

Raw GPT → text completion model (predicts next token)
          "Tell me how to make a bomb" → GPT just continues the text!

ChatGPT = GPT + RLHF (Reinforcement Learning from Human Feedback)

STAGE 1: SUPERVISED FINE-TUNING (SFT)
  Human trainers write ideal prompt-response pairs:
  Prompt: "Explain quantum physics"
  Response: "Quantum physics is the branch of physics..."
  Fine-tune GPT on these demonstrations.

STAGE 2: REWARD MODEL TRAINING
  Show humans multiple GPT outputs for same prompt.
  Humans rank them: Response A > Response C > Response B
  Train a "reward model" to predict human preferences.

STAGE 3: RL OPTIMIZATION (PPO)
  Generate responses → reward model scores them
  → PPO algorithm updates GPT to maximize reward
  → GPT learns to produce responses humans prefer

RESULT: GPT that follows instructions, refuses harmful requests,
        stays on topic, is helpful, harmless, and honest!

Claude (Anthropic) uses Constitutional AI instead of pure RLHF:
  A set of principles ("be helpful, harmless, honest") guides the model
  The model critiques and revises its own outputs against the constitution

📖

GPT Versions — Evolution

Model	Year	Params	Key Innovation
GPT-1	2018	117M	First GPT: decoder-only transformer pre-trained on BooksCorpus
GPT-2	2019	1.5B	Zero-shot task performance; OpenAI initially withheld it as "dangerous"
GPT-3	2020	175B	Few-shot learning emerges; in-context learning discovered
InstructGPT	2022	175B	RLHF applied — follows instructions, much safer
ChatGPT	2022	~175B	Chat interface + RLHF; 100M users in 2 months
GPT-4	2023	~1.8T?	Multimodal, expert-level reasoning, 128K context
GPT-4o	2024	—	Omni-modal: text, audio, vision in single model

💻

GPT API Usage + Streaming

Python

# pip install openai
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env variable

# ── Basic Chat Completion ──────────────────────────────────
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert NLP teacher."},
        {"role": "user",   "content": "Explain transformers in 2 sentences."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

# ── Few-Shot Prompting ─────────────────────────────────────
few_shot_messages = [
    {"role": "system", "content": "You classify sentiment."},
    {"role": "user",   "content": "I loved this movie!"},
    {"role": "assistant", "content": "POSITIVE"},
    {"role": "user",   "content": "Terrible service."},
    {"role": "assistant", "content": "NEGATIVE"},
    {"role": "user",   "content": "It was okay I guess."},
]
resp = client.chat.completions.create(model="gpt-4o-mini", messages=few_shot_messages)
print("Few-shot result:", resp.choices[0].message.content)  # NEUTRAL

# ── Streaming Response ─────────────────────────────────────
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about NLP."}],
    stream=True
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

# ── Using open-source GPT-style model (local) ─────────────
# pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "microsoft/phi-2"  # small 2.7B GPT-style model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

prompt = "Transformers in NLP are"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🔗

All LLMs Are GPT-Style Decoder-Only Transformers

GPT

GPT (OpenAI)

The original. Pre-train on next-token prediction → RLHF → ChatGPT. GPT-4 is multimodal with ~1.8T params (rumored MoE architecture).

Cla

Claude (Anthropic)

Decoder-only transformer. Constitutional AI instead of pure RLHF. Claude 3.5 Sonnet: 200K context, strong reasoning and coding.

Lla

LLaMA (Meta)

Open-weights GPT-style model. LLaMA 3: 8B/70B/405B. Uses RoPE, GQA (Grouped Query Attention), SwiGLU activation.

DS

DeepSeek

Decoder-only. DeepSeek V3 (671B MoE, 37B active). DeepSeek R1 adds chain-of-thought reasoning. MLA (Multi-head Latent Attention) to reduce KV cache.

Qw

Qwen (Alibaba)

Decoder-only, heavily optimized for Chinese+English. Qwen2.5: 0.5B to 72B. Uses GQA, RoPE, strong code and math capabilities.

Ki

Kimi (Moonshot AI)

Decoder-only optimized for very long context (1M tokens). Strong at document analysis, research tasks, long-form Chinese content.

🎙️

Interview Q&A

Q

What is the difference between ChatGPT and GPT-4?

GPT-4 is the base language model — a decoder-only transformer pre-trained on text. ChatGPT is a product built on top of GPT-4 (or GPT-3.5) using RLHF fine-tuning to make it conversational, helpful, and safe. GPT-4 alone would just continue any text given to it (including harmful text). ChatGPT has been fine-tuned to refuse harmful requests, follow instructions, maintain conversation context, and give helpful responses. Think of GPT-4 as the engine, ChatGPT as the car with safety features and a steering wheel.

Q

⚠️ TRAP: Does GPT actually "understand" what it generates?

This is philosophically contested. The practical answer: GPT learns statistical patterns over text — it predicts what tokens are most likely given context. It doesn't have beliefs, intentions, or real-world grounding. However, GPT-4 exhibits behaviors (code debugging, math reasoning, analogical reasoning) that look remarkably like understanding. The current consensus is: GPT has learned compressed statistical representations of human knowledge that, when probed through generation, produce outputs that are functionally similar to understanding — but the underlying mechanism is pattern matching, not symbolic reasoning or genuine comprehension.

12.1–12.6

BM25, Dense Retrieval, Vector Databases & ANN

+

📖

Sparse vs Dense Retrieval

SPARSE RETRIEVAL (BM25, TF-IDF):
  Query:  "What is machine learning?"
  Method: Count keyword overlaps between query and documents
  Finds:  Docs containing "machine", "learning" (exact match)
  Misses: "What is ML?" → "ML" ≠ "machine learning" in keyword space
  
  Vector: [0, 0, 1, 0, 0, 1, 0, ...] ← mostly zeros (sparse)

DENSE RETRIEVAL (Neural Embeddings):
  Query:  "What is machine learning?"
  Method: Embed query → find closest embedding vectors
  Finds:  Docs about "ML", "AI training", "supervised algorithms"
          even if they never use the exact phrase "machine learning"!
  
  Vector: [0.23, -0.12, 0.87, ...] ← dense (no zeros)

HYBRID RETRIEVAL (Best of both):
  Score = α × BM25_score + (1-α) × Dense_score
  Use BM25 for keyword precision + dense for semantic recall
  Used by: Elasticsearch, Weaviate, Qdrant in production

WHEN TO USE WHICH:
  Keyword search (exact product names, IDs): BM25 wins
  Semantic search (meaning, paraphrase): Dense wins
  General enterprise search: Hybrid wins

📖

BM25 — The Gold Standard Sparse Retrieval

BM25(d, q) = Σ IDF(qᵢ) × [f(qᵢ,d) × (k₁+1)] / [f(qᵢ,d) + k₁×(1-b+b×|d|/avgdl)]

qᵢ	= each query term
f(qᵢ,d)	= frequency of term qᵢ in document d
\|d\|	= length of document d (in words)
avgdl	= average document length in the corpus
k₁	= term saturation parameter (typically 1.2-2.0). Controls how much repeated terms boost score.
b	= length normalization (typically 0.75). Higher b = more penalty for long documents.
IDF	= Inverse Document Frequency (same as TF-IDF — rare terms score higher)

Why BM25 Beats TF-IDF

BM25 adds two improvements over TF-IDF: (1) Term frequency saturation — mentioning a word 20 times vs 10 times doesn't double the score; there's diminishing returns. (2) Document length normalization — a short document with "machine learning" mentioned once is more relevant than a 10-page document with it mentioned once in passing.

📖

Vector Databases — Storing and Searching Embeddings

VECTOR DATABASE WORKFLOW:

INDEXING PHASE (one-time):
  Documents → Embedding Model → Vectors → Store in Vector DB
  "Paris is in France" → [0.23, -0.12, 0.87, ...] → stored at id=1

QUERY PHASE (real-time):
  User Query → Embedding Model → Query Vector
  "Where is Paris?" → [0.21, -0.10, 0.89, ...]
  
  Vector DB: Find k nearest vectors to query vector
  Returns: Top-5 most similar document IDs + scores

NAIVE APPROACH — Exact Nearest Neighbor:
  Compare query vector to EVERY vector in database
  100M docs × 1536 dims = 154 BILLION comparisons per query
  At 1 ns/comparison = 154 seconds per query 😱 WAY TOO SLOW!

ANN — Approximate Nearest Neighbor:
  Sacrifice tiny bit of accuracy for 100-1000× speed gain
  HNSW, IVF, PQ — different algorithms to find "good enough" neighbors
  Typical: find 95-99% of true nearest neighbors in milliseconds!

📖

HNSW — The Standard ANN Algorithm

HNSW = Hierarchical Navigable Small World

STRUCTURE: Multi-layer graph (like a highway system)
  Layer 2 (highway): Few nodes, long jumps
    A ─────────────────────── E
    
  Layer 1 (roads): More nodes, medium connections
    A ──── B ──── C ──── D ── E
    
  Layer 0 (streets): All nodes, local connections
    A ─ B ─ C ─ D ─ E ─ F ─ G ─ H ─ I

SEARCH ALGORITHM:
  1. Start at top layer (Layer 2), find closest node to query
  2. Drop down to Layer 1, search around that node
  3. Drop down to Layer 0, do fine-grained local search
  4. Return k nearest neighbors found

Like navigating a city: take the highway to the right
neighborhood, then local streets to the exact address.

PERFORMANCE:
  Index build: O(n log n)
  Search: O(log n)  ← logarithmic! Very fast.
  Memory: O(n × M × d)  where M = neighbors per node

USED BY: Faiss (Facebook), Weaviate, Qdrant, Pinecone,
         Chroma, Milvus — all major vector databases!

📖

Popular Vector Databases

Database	Type	Best For	Notes
Chroma	Open source, local	Prototyping, local dev	Easiest to start with, Python-native, stores on disk
Faiss	Library (Meta)	Research, large scale	Not a full DB, just ANN library. Very fast. Used internally at Meta.
Pinecone	Managed cloud	Production RAG	Fully managed, easy API, expensive at scale
Weaviate	Open source/cloud	Hybrid search	Built-in BM25 + dense hybrid, GraphQL API
Qdrant	Open source/cloud	High performance	Rust-based, very fast, good filtering support
pgvector	PostgreSQL extension	Existing Postgres users	Add vector search to your existing Postgres DB

💻

Building Semantic Search with Chroma

Python

# pip install chromadb sentence-transformers
import chromadb
from sentence_transformers import SentenceTransformer

# ── 1. Setup ──────────────────────────────────────────────
client = chromadb.Client()  # in-memory for demo
# For persistence: chromadb.PersistentClient(path="./my_db")

collection = client.create_collection("nlp_knowledge_base")
model = SentenceTransformer("all-MiniLM-L6-v2")

# ── 2. Index Documents ────────────────────────────────────
documents = [
    "BERT is an encoder-only transformer model trained with MLM.",
    "GPT uses autoregressive decoder-only architecture.",
    "Transformers use self-attention to process sequences in parallel.",
    "BPE tokenization splits rare words into subword pieces.",
    "RAG combines retrieval with language model generation.",
    "Word2Vec learns word embeddings using context window prediction.",
    "LLaMA is Meta's open-source large language model.",
    "Fine-tuning adapts pre-trained models to specific tasks.",
]
embeddings = model.encode(documents).tolist()

collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)
print(f"Indexed {collection.count()} documents")

# ── 3. Semantic Search ────────────────────────────────────
queries = [
    "How does BERT learn language representations?",
    "What is the architecture of GPT models?",
    "How are words split into tokens?",
]

for query in queries:
    query_emb = model.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_emb,
        n_results=2
    )
    print(f"\nQuery: '{query}'")
    for doc, dist in zip(results['documents'][0], results['distances'][0]):
        sim = 1 - dist  # Chroma returns distance, convert to similarity
        print(f"  [{sim:.3f}] {doc}")

🎙️

Interview Q&A

Q

What is the difference between semantic search and keyword search?

Keyword search (BM25) finds documents that contain the exact words from your query. It's fast and precise but misses synonyms and paraphrases. Semantic search converts query and documents to dense vectors and finds those with similar meaning — it can match "automobile" to a query about "cars", or "cardiac arrest" to a query about "heart attack". Keyword search: O(log n) with inverted index. Semantic search: O(log n) with ANN. In practice, hybrid search (combining both) outperforms either alone — keyword search catches exact matches, semantic search catches paraphrases.

Q

How do you handle document chunking for RAG?

Chunking strategy significantly impacts RAG quality. Key decisions: 1) Chunk size: 256-512 tokens is common. Too small = loses context. Too large = includes irrelevant info in retrieved chunk. 2) Overlap: Add 10-20% overlap between chunks so context isn't lost at boundaries. 3) Strategy: Fixed-size (simple, predictable), sentence-based (preserves logical units), semantic (split on topic changes). 4) Metadata: Store source URL, page number, section header alongside chunk — pass to LLM for citations. Best practice: experiment with chunk sizes for your specific domain. Code usually needs larger chunks, FAQ answers need smaller ones.

13.1–13.6

Complete RAG Pipeline

+

💡

Why RAG? The Problem It Solves

❌ Problems with Pure LLMs

1. Knowledge cutoff: GPT-4's training ended in April 2023. It doesn't know about events after that date.
2. Hallucination: LLMs confidently state incorrect facts when they don't know the answer.
3. Private data: LLMs don't have access to your company's documents, databases, or proprietary knowledge.
4. No citations: Can't easily trace WHERE the information came from.

✅ RAG Solution

RAG = Retrieval-Augmented Generation. First retrieve relevant documents from a knowledge base, then augment the LLM's prompt with those documents, then generate an answer grounded in retrieved evidence. The LLM only needs to reason and summarize — all facts come from retrieved documents.

🗺️

Full RAG Pipeline Architecture

INDEXING PHASE (offline, one-time):
─────────────────────────────────────────────────────────────
Documents (PDFs, websites, docs)
   │
   ▼
[Chunking] → Split into 256-512 token overlapping chunks
   │
   ▼
[Embedding Model] → Each chunk → dense vector
   │
   ▼
[Vector Database] → Store (chunk_text, vector, metadata)
─────────────────────────────────────────────────────────────

QUERY PHASE (real-time, each user query):
─────────────────────────────────────────────────────────────
User Query: "What is LLaMA's context window?"
   │
   ▼
[Query Embedding] → Query → dense vector via same embedding model
   │
   ▼
[Retrieval]
  ├── Dense: Find top-k chunks by cosine similarity
  └── Sparse: BM25 keyword matching (optional)
  → Merge, rerank, return top-5 most relevant chunks
   │
   ▼
[Context Assembly]:
  System: "Answer using ONLY the provided context."
  Context: [chunk1: "LLaMA 3.1 supports 128K tokens..."]
            [chunk2: "Meta released LLaMA 3 with..."]
  Question: "What is LLaMA's context window?"
   │
   ▼
[LLM Generation] (GPT-4, Claude, LLaMA, etc.)
   │
   ▼
Answer: "LLaMA 3.1 supports a context window of 128,000 tokens
         according to the documentation. [Source: chunk1]"
─────────────────────────────────────────────────────────────

📖

Extractive vs Abstractive QA

EXTRACTIVE QA (BERT-style):
  Context: "The Eiffel Tower was completed in 1889 in Paris."
  Question: "When was the Eiffel Tower completed?"
  
  Model output: [start_pos=5, end_pos=6] → "1889"
  Just EXTRACTS a span from the context. No generation!
  
  Use BERT + span prediction head.
  Fast, factual, no hallucination (can't generate text not in context).
  Limitation: Can't synthesize across multiple paragraphs.

ABSTRACTIVE QA (GPT-style):
  Context: "The Eiffel Tower was completed in 1889..."
  Question: "When was the Eiffel Tower completed?"
  
  Model GENERATES: "The Eiffel Tower was completed in the year 1889,
                    during the World Fair in Paris."
  
  Can synthesize, paraphrase, summarize multiple sources.
  More flexible but can hallucinate.

RAG QA = Abstractive QA + Retrieved Context:
  Best of both worlds:
  → Retrieved context grounds the generation (reduces hallucination)
  → LLM generation allows synthesis across multiple chunks

💻

Production RAG Pipeline — Full Code

Python

# pip install chromadb sentence-transformers openai
import chromadb
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import textwrap

# ── INDEXING PHASE ────────────────────────────────────────
KNOWLEDGE_BASE = [
    "LLaMA 3.1 by Meta supports a 128,000 token context window and comes in 8B, 70B, and 405B parameter sizes.",
    "Claude 3.5 Sonnet by Anthropic has a 200,000 token context window and excels at coding and reasoning tasks.",
    "GPT-4o by OpenAI is multimodal (text, image, audio) with a 128,000 token context window.",
    "DeepSeek V3 uses a Mixture-of-Experts architecture with 671B total parameters but only 37B active per token.",
    "RAG (Retrieval Augmented Generation) reduces hallucinations by grounding LLM responses in retrieved documents.",
    "BPE tokenization was introduced for NMT by Sennrich et al. in 2016 and is used by GPT models.",
    "BERT uses WordPiece tokenization and a vocabulary of 30,000 tokens. It processes text bidirectionally.",
    "Transformers were introduced in 'Attention Is All You Need' by Vaswani et al. at Google in 2017.",
    "Qwen 2.5 by Alibaba supports both English and Chinese and comes in sizes from 0.5B to 72B parameters.",
    "Fine-tuning adapts a pre-trained model to a specific task using a small labeled dataset.",
]

embed_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.Client()
collection = chroma.create_collection("llm_knowledge")

# Embed and store all documents
embeddings = embed_model.encode(KNOWLEDGE_BASE).tolist()
collection.add(
    documents=KNOWLEDGE_BASE,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(KNOWLEDGE_BASE))]
)
print(f"✓ Indexed {collection.count()} documents into vector store")

# ── RAG QUERY FUNCTION ────────────────────────────────────
openai_client = OpenAI()

def rag_query(question: str, top_k: int = 3) -> str:
    """
    Full RAG pipeline:
    1. Embed the question
    2. Retrieve top-k similar chunks
    3. Build prompt with retrieved context
    4. Generate answer with LLM
    """
    # Step 1: Embed query
    query_emb = embed_model.encode([question]).tolist()
    
    # Step 2: Retrieve top-k documents
    results = collection.query(
        query_embeddings=query_emb,
        n_results=top_k
    )
    retrieved_docs = results['documents'][0]
    
    # Step 3: Build augmented prompt
    context = "\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(retrieved_docs)])
    
    system_prompt = """You are a helpful AI assistant. Answer questions based ONLY 
on the provided context. If the context doesn't contain enough information, 
say 'I don't have enough information in my knowledge base to answer this.'
Always cite which context number [1], [2], [3] supports your answer."""
    
    user_prompt = f"""Context:
{context}

Question: {question}

Answer:"""
    
    # Step 4: Generate answer
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_prompt}
        ],
        temperature=0.1  # low temperature for factual tasks
    )
    return response.choices[0].message.content

# ── DEMO ──────────────────────────────────────────────────
questions = [
    "What context window does Claude 3.5 support?",
    "Which models use Mixture of Experts?",
    "How does RAG help reduce hallucinations?",
]

for q in questions:
    print(f"\n{'='*60}")
    print(f"Q: {q}")
    answer = rag_query(q)
    print(f"A: {answer}")

📖

Advanced RAG Techniques

HyDE

Hypothetical Document Embeddings: Generate a hypothetical answer first, embed THAT, then search. Better semantic match.

Reranking

After BM25/dense retrieval, use a cross-encoder (like BGE-Reranker) to re-score top-100 → keep top-5. More accurate.

Parent-Child Chunks

Store small chunks for retrieval but return parent (larger) context. Best of both: precise retrieval, rich context.

Multi-Query

Rephrase the user query into 3-5 variants, retrieve for all, merge results. Reduces single-query biases.

FLARE

Forward-Looking Active Retrieval: Retrieve on-demand when model is uncertain during generation, not just once upfront.

Agentic RAG

Let the LLM decide WHEN and WHAT to retrieve using tool calls. Multiple retrieval steps in one answer session.

🎙️

Interview Q&A

Q

What are the main failure modes of RAG systems?

1. Retrieval failure: Wrong chunks retrieved because query and documents use different vocabulary. Fix: Hybrid search, HyDE, query expansion. 2. Context window overflow: Too many retrieved chunks exceed context limit. Fix: Rerank and trim. 3. Lost in the middle: LLMs pay more attention to context at the beginning and end — middle chunks ignored. Fix: Reorder important content, use reciprocal rank fusion. 4. Hallucination despite context: LLM ignores context and generates from parametric memory. Fix: Stronger system prompt, use temperature=0. 5. Chunking boundary artifacts: A key sentence cut across two chunks. Fix: Overlap chunks, semantic chunking. 6. Stale knowledge base: Documents not updated. Fix: Incremental indexing pipelines.

Q

When would you NOT use RAG?

RAG is not always the right choice: 1. The task is pure reasoning (math, coding logic) — no external knowledge needed. 2. Very small knowledge base that fits in the context window — just include it all directly. 3. Latency-critical applications — RAG adds 100-500ms for retrieval + embedding. 4. Knowledge is highly interconnected — chunked retrieval misses relationships. Use a knowledge graph instead. 5. Fine-tuning is viable — if you have a small, static, well-defined domain, fine-tuning might produce better results than RAG with less latency overhead.

14.1–14.6

Complete Evaluation Metrics Guide

+

📖

Classification Metrics — Accuracy, Precision, Recall, F1

CONFUSION MATRIX (for binary classification):

                      PREDICTED
                  Positive  Negative
ACTUAL  Positive  TP=80     FN=20
        Negative  FP=10     TN=90

TP = True Positives:  Predicted positive, actually positive ✓
FP = False Positives: Predicted positive, actually negative ✗ (false alarm)
FN = False Negatives: Predicted negative, actually positive ✗ (missed it)
TN = True Negatives:  Predicted negative, actually negative ✓

ACCURACY = (TP + TN) / Total = (80+90)/200 = 85%
  Problem: Misleading for imbalanced datasets!
  If 95% of emails are NOT spam, always predicting "not spam" = 95% accuracy
  but you've built a useless spam filter!

PRECISION = TP / (TP + FP) = 80/(80+10) = 88.9%
  "Of all the things I said were positive, how many actually were?"
  High precision = few false alarms

RECALL = TP / (TP + FN) = 80/(80+20) = 80%
  "Of all the actual positives, how many did I catch?"
  High recall = few misses

F1 SCORE = 2 × (Precision × Recall) / (Precision + Recall)
         = 2 × (0.889 × 0.80) / (0.889 + 0.80) = 84.2%
  Harmonic mean — penalizes extreme imbalance between P and R
  Use F1 when both precision AND recall matter equally

📖

BLEU — Evaluating Machine Translation

BLEU (Bilingual Evaluation Understudy) measures how much overlap there is between a model's output and human-written reference translations. It counts n-gram overlaps.

BLEU = BP × exp( Σ wₙ × log pₙ )

BP	= Brevity Penalty: penalizes translations that are too short. BP = 1 if output ≥ reference length, else e^(1-r/c)
pₙ	= modified n-gram precision: how many n-grams in output appear in reference
wₙ	= weights for each n-gram order (typically 0.25 each for n=1,2,3,4)

BLEU WORKED EXAMPLE:
Reference: "The cat sat on the mat"
Candidate: "The cat is on the mat"

Unigram precision (1-gram): 
  Candidate words: The, cat, is, on, the, mat (6 words)
  Words in reference: The✓, cat✓, is✗, on✓, the✓, mat✓ = 5/6 = 83.3%

Bigram precision (2-gram):
  Candidate bigrams: (The,cat)✓, (cat,is)✗, (is,on)✗, (on,the)✓, (the,mat)✓
  = 3/5 = 60%

BLEU-1: 83.3%  BLEU-2: 60%  → Combined BLEU ≈ 70%
(Actual BLEU also considers 3-grams and 4-grams)

BLEU LIMITATIONS:
  ✗ Ignores semantic similarity: "automobile" ≠ "car" even if synonymous
  ✗ Doesn't capture fluency well
  ✗ Multiple references needed for reliability
  ✓ Still industry standard for MT benchmarks
  Used by: WMT translation benchmarks, academic comparisons

📖

ROUGE — Evaluating Summarization

ROUGE = Recall-Oriented Understudy for Gisting Evaluation
(Used for summarization evaluation)

KEY DIFFERENCE FROM BLEU:
  BLEU focuses on PRECISION (how much of output is in reference)
  ROUGE focuses on RECALL (how much of reference is in output)
  → For summarization, recall matters more: did we capture key info?

ROUGE-N: n-gram RECALL
  Reference summary: "The transformer architecture uses attention."
  Model summary: "Transformers use self-attention mechanisms."
  
  ROUGE-1 (unigram recall):
  Reference words: {The, transformer, architecture, uses, attention}
  Found in output: {transformer(transformers)≈, uses, attention≈} = ~3/5 = 60%

ROUGE-L: Longest Common Subsequence
  Measures longest matching word sequence (allows gaps)
  Reference: "The transformer architecture uses attention"
  Output:    "Transformers use self-attention mechanisms"
  LCS: "transformer ... uses ... attention" = 3 words
  ROUGE-L = 3/5 = 60%
  
  Better than ROUGE-N because it considers word order

ROUGE IN PRACTICE:
  ROUGE-1: used for general content overlap
  ROUGE-2: stricter, requires bigram matches
  ROUGE-L: used for quality of sentence structure preservation

📖

LLM-as-Judge — The Modern Evaluation Approach

Why Traditional Metrics Are Not Enough for LLMs

BLEU and ROUGE fail for open-ended generation. If you ask GPT-4 "Explain gravity" and it gives an excellent explanation using different words than the reference answer, BLEU might score it near 0. Modern LLM evaluation uses a stronger LLM (GPT-4 or Claude) to judge the quality of outputs.

LLM-AS-JUDGE WORKFLOW:

User Question: "What is attention in transformers?"
Reference Answer: "Attention allows tokens to focus on relevant..."
Model Output: "The attention mechanism enables each token to..."

Judge Prompt:
  "Rate this answer from 1-10 for:
   - Factual accuracy
   - Completeness  
   - Clarity
   Reference: [reference answer]
   Model output: [model output]
   Provide score and brief justification."

GPT-4 Judge Output:
  Factual accuracy: 9/10
  Completeness: 8/10
  Clarity: 9/10
  Overall: 8.7/10
  "The answer correctly explains attention but misses..."

FRAMEWORKS:
  RAGAS: Evaluates RAG pipelines (faithfulness, answer relevancy,
         context precision, context recall)
  MT-Bench: Multi-turn conversation quality evaluation
  Alpaca Eval: Pairwise comparison against GPT-4 responses

💻

Evaluation in Python

Python

# pip install evaluate scikit-learn rouge-score
import evaluate
from sklearn.metrics import classification_report

# ── CLASSIFICATION METRICS ─────────────────────────────────
y_true = [1, 0, 1, 1, 0, 1, 0, 0]  # actual labels
y_pred = [1, 0, 1, 0, 0, 1, 1, 0]  # model predictions

print(classification_report(y_true, y_pred, target_names=['NEG', 'POS']))

# ── BLEU SCORE ──────────────────────────────────────────────
bleu = evaluate.load("bleu")

references = [["the cat sat on the mat"]]  # list of lists
predictions = ["the cat is on the mat"]

result = bleu.compute(predictions=predictions, references=references)
print(f"\nBLEU score: {result['bleu']:.4f}")

# ── ROUGE SCORE ─────────────────────────────────────────────
rouge = evaluate.load("rouge")

reference_summaries = ["The transformer architecture uses self-attention mechanisms."]
model_summaries = ["Transformers leverage attention to process sequences in parallel."]

result = rouge.compute(
    predictions=model_summaries,
    references=reference_summaries
)
print(f"ROUGE-1: {result['rouge1']:.4f}")
print(f"ROUGE-L: {result['rougeL']:.4f}")

# ── RAGAS — RAG EVALUATION ──────────────────────────────────
# pip install ragas
from ragas import evaluate
from ragas.metrics import (
    faithfulness,         # is answer grounded in context?
    answer_relevancy,     # does answer address the question?
    context_precision,    # are retrieved chunks relevant?
    context_recall        # do chunks contain the answer?
)
from datasets import Dataset

eval_data = {
    "question": ["What context window does Claude 3.5 have?"],
    "answer": ["Claude 3.5 Sonnet supports a 200,000 token context window."],
    "contexts": [["Claude 3.5 Sonnet by Anthropic has a 200,000 token context window."]],
    "ground_truth": ["Claude 3.5 Sonnet has a 200K token context window."]
}
dataset = Dataset.from_dict(eval_data)

# score = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
# print(score)  # Runs LLM-based evaluation
print("RAGAS evaluation code ready (requires OpenAI API key)")

🎙️

Interview Q&A

Q

When would you use F1 instead of accuracy for NLP evaluation?

Use F1 when: (1) Your dataset is imbalanced — e.g., 95% not-spam vs 5% spam. Accuracy of 95% looks great but a model predicting "not-spam" for everything achieves it without finding any actual spam. (2) Both false positives and false negatives have significant cost — in medical NER (Named Entity Recognition), missing a disease name (FN) and incorrectly tagging non-diseases (FP) are both costly. (3) You care about per-class performance. For NER and POS tagging, micro-F1 (aggregate across all tokens) and macro-F1 (average per class) are standard. In practice: use accuracy only for balanced datasets with equal-cost errors.

Q

What is RAGAS and what does it measure?

RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG pipelines end-to-end without requiring human annotations. It measures 4 key metrics: (1) Faithfulness: Is the generated answer factually consistent with the retrieved context? Detects hallucination. (2) Answer Relevancy: Does the answer actually address the question asked? (3) Context Precision: Are the retrieved chunks relevant? High precision = only useful chunks were retrieved. (4) Context Recall: Were all the relevant pieces of information retrieved? Low recall = missed important context. RAGAS uses an LLM (typically GPT-4) to score each metric, making it practical for evaluating RAG without expensive human annotation.

📝

Practice Questions — All Levels

Easy A model correctly identifies 45 out of 50 positive samples and 90 out of 100 negative samples. Calculate accuracy, precision, and recall.

TP=45, FN=5, TN=90, FP=10. Total=150.
Accuracy = (45+90)/150 = 90%
Precision = TP/(TP+FP) = 45/(45+10) = 81.8%
Recall = TP/(TP+FN) = 45/(45+5) = 90%
F1 = 2×(0.818×0.90)/(0.818+0.90) = 85.7%

Medium When would ROUGE-L be more appropriate than ROUGE-1 for summarization evaluation?

ROUGE-L is more appropriate when word order and sentence structure matter. ROUGE-1 only counts individual word overlaps, ignoring order. If a summary says "The cat chased the mouse" and the reference is "The mouse was chased by the cat", ROUGE-1 would score perfectly (all words match) but ROUGE-L would score lower because the subsequence order is different. Use ROUGE-L when you care about: preserving narrative flow, maintaining logical sentence structure, evaluation of abstractive summaries where paraphrasing is expected.

Hard Design a complete evaluation framework for a customer support RAG chatbot.

A comprehensive evaluation framework would include: Retrieval evaluation: Context Precision (are retrieved chunks relevant to the question?), Context Recall (do retrieved chunks contain all needed info?), MRR (Mean Reciprocal Rank) — where does the right answer rank? Generation evaluation: Faithfulness (RAGAS) — does answer stick to retrieved context or hallucinate? Answer Relevancy — does it address the customer's actual question? Task-specific metrics: Resolution rate — did the customer issue get resolved? Escalation rate — how often does it fail and escalate to human? Response length — are answers appropriately concise? Offline evaluation: Build a golden test set of 200-500 real customer questions with correct answers + relevant document IDs. Run weekly to catch regressions. Online evaluation: A/B test new model vs old. Track CSAT scores. Monitor thumbs up/down. Use LLM-as-judge on sampled production conversations. Combine all these into a dashboard with trend lines.

📋

Complete Cheat Sheet — All Metrics

Accuracy

Correct/Total. Good for balanced classes. Misleading for imbalanced data.

Precision

TP/(TP+FP). "Of my positive predictions, how many were right?" Use when FP is costly.

Recall

TP/(TP+FN). "Of actual positives, how many did I find?" Use when FN is costly.

F1 Score

Harmonic mean of P&R. Use for imbalanced classes or when both P and R matter.

BLEU

N-gram precision. Used for translation. BLEU-1 to BLEU-4. Higher = better. Max=1.0.

ROUGE-1/2/L

N-gram recall. Used for summarization. ROUGE-L uses longest common subsequence.

Perplexity

How "surprised" the LM is. Lower = better. PP=10 means ~10 choices per token on average.

RAGAS

RAG evaluation: Faithfulness + Answer Relevancy + Context Precision + Context Recall.

What is tokenization?

And why is subword tokenization used?

Converting raw text into integer IDs a model can process. Subword tokenization (BPE, WordPiece) is used because it eliminates OOV words by breaking rare words into known smaller pieces, and keeps vocabulary size manageable (30K-128K).

click to flip

Self-Attention formula?

With explanation of each part

Attention(Q,K,V) = softmax(QKᵀ/√d_k) × V. Q=query (what am I looking for?), K=key (what do I contain?), V=value (what info do I provide?). √d_k scaling prevents softmax saturation from large dot products.

click to flip

BERT vs GPT

Core architectural difference

BERT: encoder-only, BIDIRECTIONAL attention, trained with MLM (masked prediction). Best for understanding tasks. GPT: decoder-only, CAUSAL (left-only) attention, trained with next-token prediction. Best for generation tasks.

click to flip

What is RAG?

And its 4 main steps

Retrieval-Augmented Generation: 1) Embed query, 2) Retrieve top-k similar chunks from vector DB, 3) Augment LLM prompt with retrieved context, 4) LLM generates grounded answer. Reduces hallucination by grounding output in retrieved facts.

click to flip

What is BPE?

And which models use it?

Byte Pair Encoding: start with characters, iteratively merge most frequent adjacent pairs until target vocabulary size. Used by GPT (all versions), RoBERTa, Falcon. GPT-4: ~100K vocab. GPT-2: 50K vocab.

click to flip

What is cosine similarity?

Formula + when to use it

cos(θ) = (u·v)/(||u||×||v||). Measures angle between vectors, range [-1,+1]. Use for text similarity because it ignores vector magnitude — frequent vs rare words don't bias the score. 1.0=identical direction, 0=unrelated, -1=opposite.

click to flip

What is RLHF?

3 stages

Reinforcement Learning from Human Feedback: 1) SFT — fine-tune on human-written demonstrations. 2) Train reward model from human preference rankings. 3) PPO — optimize LLM outputs to maximize reward model score. Transforms raw GPT into ChatGPT.

click to flip

Perplexity intuition?

What does PP=10 mean?

Perplexity = how many equally likely choices the model sees per token. PP=10: model is as confused as if choosing uniformly from 10 words. PP=1: perfect. Lower = better language model. GPT-3 achieves PP≈8.5 on Penn Treebank.

click to flip

Word2Vec limitation?

What problem does BERT solve?

Word2Vec gives ONE static vector per word regardless of context. "Bank" (financial) = "bank" (river) in Word2Vec. BERT produces contextual embeddings — different vectors for the same word in different contexts, by attending to surrounding tokens.

click to flip

Why √d_k scaling?

In attention mechanism

Large d_k → large dot products → softmax becomes very peaked (one near 1, rest near 0) → vanishing gradients → poor learning. Dividing by √d_k keeps variance ≈ 1, preventing saturation and enabling stable training of deep transformer networks.

click to flip

F1 vs Accuracy

When does accuracy lie?

Accuracy lies on imbalanced datasets. If 99% of samples are class A, predicting A always = 99% accuracy but 0% recall for class B. F1 = harmonic mean of Precision and Recall, penalizing models that ignore the minority class. Always use F1 for imbalanced NLP tasks (NER, spam detection).

click to flip

HNSW algorithm

What problem does it solve?

Approximate Nearest Neighbor search. Solves slow exact nearest-neighbor (O(n) per query) with a multi-layer graph: top layers = highway (few nodes, long jumps), bottom layer = streets (all nodes, local). Search: O(log n). Used by all vector databases for fast embedding lookup.

click to flip

NLP for AI EngineersComplete Roadmap Notes

Table of Contents

Introduction to NLP

Text Representation Basics

Tokenization ⭐

Traditional Text Representations

Word Embeddings ⭐⭐

Sentence & Document Embeddings

Language Models

Attention Mechanism

Transformers

BERT

GPT

Retrieval & Semantic Search

Question Answering

NLP Evaluation

Introduction to NLP

What is NLP?

1. What Problem Does NLP Solve?

2. Why Was NLP Invented?

3. Historical Background

4. Real-World Analogy

5. Explain Like I'm 10

6. Explain Like a College Student

7. Explain Like an AI Engineer

8. Terminology Breakdown

11. Visual: NLP vs Traditional Programming

Real-World Applications

Interview Questions & Answers

Practice Questions

The NLP Pipeline

What Is A Pipeline?

The 5-Stage NLP Pipeline

Step-by-Step Worked Example

Common Misconceptions

NLP Challenges

Worked Examples for Each Challenge

Text Representation Basics

Text as Data, Corpus, Vocabulary & OOV

ELI10: What is Text as Data?

Text Granularity Levels

What is a Vocabulary?

OOV — Out of Vocabulary Problem

Interview Q&A

Tokenization

Complete Tokenization Guide: BPE, WordPiece, SentencePiece

Why Models Cannot Read Text Directly

3 Types of Tokenization — Comparison

BPE — Byte Pair Encoding (Used by GPT)

WordPiece (Used by BERT)

SentencePiece (Used by T5, LLaMA, Mistral)

Special Tokens — Critical for Understanding LLMs

Context Window & Token Limits

Python Implementation — BPE from Scratch

Using Hugging Face Tokenizers

Interview Questions & Traps

Practice Questions

Relationships to LLMs

GPT (all versions)

Claude

LLaMA

DeepSeek

Qwen

Kimi

Cheat Sheet

Mini Project

Traditional Text Representations

One-Hot, BoW, N-Grams & TF-IDF

One-Hot Encoding

Bag of Words (BoW)

N-Grams — Capturing Some Context

TF-IDF — The Classic Information Retrieval Method

Python Implementation

Interview Q&A

Word Embeddings

Word2Vec, GloVe, FastText & Embedding Space

The Problem with One-Hot Encoding

Distributional Semantics — The Core Idea

ELI10: What is an Embedding?

Word2Vec — How It Works

NLP for AI Engineers
Complete Roadmap Notes