44 Characters9 Words0 Tokens

What Is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens before feeding it to a language model. Tokens are not necessarily whole words. Depending on the tokenizer, a single word might be split into multiple sub-word tokens, or several characters might be merged into one token. For example, the word “unhappiness” might be tokenized as [“un”, “happiness”] or [“un”, “happi”, “ness”] depending on the model’s vocabulary.

Tokenization matters because it directly affects how models understand and generate text. It determines the model’s context window (measured in tokens, not characters), influences generation cost (APIs charge per token), and affects how well a model handles different languages, code, and special characters.

Byte Pair Encoding (BPE)

Byte Pair Encoding is the most widely used tokenization algorithm in modern language models, used by GPT-2, GPT-3, GPT-4, LLaMA 3, and Mistral. BPE starts with individual bytes or characters and iteratively merges the most frequent adjacent pairs to build up a vocabulary of sub-word tokens. The merge rules are learned from a training corpus during a one-time preprocessing step.

The resulting vocabulary naturally captures common words as single tokens while gracefully handling rare or novel words by breaking them into known sub-word pieces. This means BPE never encounters an “unknown token” problem — any input can be encoded, even if the model has never seen that specific word before. The vocabulary size is a key hyperparameter: GPT-2 uses ~50k tokens, GPT-4 uses ~100k, and GPT-4o expanded to ~200k for better multilingual efficiency.

WordPiece

WordPiece is the tokenization algorithm used by BERT and other models from Google. Like BPE, WordPiece builds a sub-word vocabulary, but it uses a different criterion for merging: instead of frequency, it maximizes the likelihood of the training data. WordPiece tokens that continue a word are prefixed with “##” — for example, “playing” becomes [“play”, “##ing”]. This explicit marking of word continuations makes it easy to reconstruct original word boundaries.

SentencePiece and Unigram

SentencePiece is a language-independent tokenization framework developed by Google that treats the input as a raw byte stream rather than pre-tokenized words. This makes it particularly effective for languages that don’t use spaces between words, like Japanese and Chinese. LLaMA 2 uses SentencePiece with a BPE algorithm, while T5 uses SentencePiece with the Unigram algorithm.

The Unigram model takes the opposite approach from BPE: instead of starting small and merging up, it starts with a large vocabulary and iteratively removes tokens that contribute least to the training data likelihood. This top-down approach can produce more linguistically meaningful sub-word units in some cases.

Why Vocabulary Size Matters

The size of a tokenizer’s vocabulary has direct practical implications. A larger vocabulary means each token covers more text on average, so a given input requires fewer tokens. This means more text fits within the model’s context window and API costs are lower per character. However, a larger vocabulary also means a bigger embedding table in the model, requiring more parameters and memory.

Compare how different models tokenize the same text to see this trade-off in action. GPT-4o (200k vocab) typically produces 15-20% fewer tokens than GPT-4 (100k vocab) for the same English text, and the difference is even more dramatic for non-English languages and code. BERT’s smaller 30k WordPiece vocabulary tends to produce more tokens but was sufficient for its original masked language modeling task.

How to Use This Tool

Type or paste any text in the input area and select a tokenizer from the dropdown. The tool runs the actual tokenizer from each model directly in your browser — no server is needed and your text never leaves your device. Switch between models to compare how they tokenize the same input. Toggle “#IDs” to see the raw token IDs (the numbers the model actually processes). Try pasting code, non-English text, or emoji to see how different tokenizers handle them.