Site Logo
Osman's Odyssey: Byte & Build
Chronicles of a Perpetual Learner

First Came The Tokenizer

Why the Humble Tokenizer Is Where It All Starts


Posted on
7 Minutes

Cover Image
The process of tokenization that turns raw text into a stream of integers for LLMs

This is blogpost #5 in my 101 Days of Blogging series. If it sparks anything; ideas, questions, or critique, my DMs are open. Hope it gives you something useful to walk away with.

Before Everything

Before an LLM has a chance to process your message, the tokenizer has to digest the text into a stream of (usually) 16-bit integers. It’s not glamorous work, but make no mistake: This step has high implications on your model’s context window, training speed, prompt cost, and whether or not your favorite Unicode emoji makes it out alive.

Interactive Visual Tokenizer
Interactive Visual Tokenizer
www.ahmadosman.com/tokenizer

Tokenizers: The Hidden Operators Behind LLMs

From Whitespace to Subwords: A Lightning Tour

Let’s run through the main flavors of tokenizers, in true “what do I actually care about” fashion:

Word-level

The OG tokenizer. You split on whitespace, assign every word an ID. It’s simple, but your vocab balloons to 500,000+ for English alone, and “dog” and “dogs” are considered different vocabs.

Character-level

Shrink the vocab down to size, now every letter or symbol is its own token. Problem: even simple words turn into sprawling token chains, which slows training and loses the semantic chunking that makes LLMs shine.

Subword-level

This is where the modern magic happens. Break rare words into pieces, keep common ones whole. This is the approach adopted by basically every transformer since 2017: not too big, not too small, just right for GPU memory and token throughput.

Tokenizer Algorithms

Here are the big players, how they work, and why they matter:

Byte-Pair Encoding (BPE) — Used in GPT family, RoBERTa, DeBERTa

How it works:

Why it matters:

How it works:

Why it matters:

Unigram LM — Used in T5, mBART, XLNet

How it works:

Why it matters:

SentencePiece — Designed for multilingual/agglutinative languages

How it works:

Why it matters:

Byte-level BPE (tiktoken) — Used in GPT-2, GPT-4o, OpenAI API

How it works:

Why it matters:

Vocabulary Size: A Tradeoff

Scaling law nerds (hey, that’s us) know: as you scale up your model, you need to scale your tokenizer’s vocabulary too, which in turn helps with optimizing how efficiently your model chews through raw text.

Take Llama 3, for example: its tokenizer jumped to a whopping 128K vocab size, compared to Llama 2’s more modest 32K. Why? With a bigger vocab, each token can represent longer or more meaningful text chunks. That means your input sequences get shorter (the model has fewer tokens to process for the same amount of text), which can speed up both training and inference. You’re basically packing more info into every step.

But there’s no free lunch. A larger vocabulary means your embedding matrix-the lookup table mapping token IDs to vectors-grows right along with it. More tokens = more rows = more parameters = more GPU RAM required. For massive models or memory-constrained deployments, that can be a real pain. You also run into diminishing returns: after a certain point, adding more tokens doesn’t buy you much, but still taxes your hardware.

On the flip side, if your vocabulary is too small, your tokenizer starts breaking up words into lots of tiny pieces (“subwords”). Your model ends up wrestling with long, fragmented sequences, burning compute on basic reconstruction instead of actual language understanding. That’s inefficient, especially for languages with rich morphology or lots of rare words.

So where’s the “just right” zone? It depends. The ideal vocab size is a balancing act:

At the end of the day, picking a vocab size isn’t just a technical tweak, it’s a core design choice that shapes everything downstream, from model cost to multilingual robustness. Getting this wrong means you’re either bottlenecked by hardware or wasting compute on needlessly long sequences.

Tokenizer Quirks: Fun Ways To Sabotage Yourself

Picking a Tokenizer for Your LLM Playground: My Cheat Sheet

Want to skip the theory and just get your hands dirty? Here’s the cheat sheet:

If you care about…

Training speed & GPU RAM:

On-device (mobile) inference:

Cross-language coverage:

DIY research tinkering:

Final Words: The Humble Tokenizer Is Doing More Than You Think

Tokenizers are never the sexiest part of an LLM stack, but they are very impactful. Understand them, and you control real prompting costs, inference throughput, and the very shape of what your models can express.

Next time your model is spitting out garbage or can’t fit your prompt, don’t blame the GPUs 😝

Give a little respect to the humble tokenizer.

In the meantime, if you want to see tokenizers in action, check out www.ahmadosman.com/tokenizer .

PS: Remember, you can always use my DeepResearch Workflow to learn more, if you’re stuck, need something broken down to first principles, want material tailored to your level, need to identify gaps, or just want to explore deeper.