What Is a Tokenizer? How Text Gets Split into Tokens
Published March 14, 2026 by Joel Thyberg

The term tokenizer appears everywhere as soon as you start working with language models. It affects cost, quality, speed, and how much text a model can actually handle. And yet it is often one of the least explained layers in the entire LLM stack.
The short version is simple: a tokenizer is the system that splits text into smaller pieces, tokens, and translates them into numbers the model can work with. It sounds technical, but the core idea is actually quite intuitive.
Here we go deeper into tokenization itself, why it is needed, and why models do not work directly with words or raw binary data.
Short answer
A tokenizer splits text into tokens and maps each token to a number. It is the bridge between human text and the model's mathematics.
Why it matters
Context windows, pricing, and a model's ability to interpret your text are measured in tokens, not in words, sentences, or characters.
Common misconception
That a token is the same thing as a word. In practice, a token can be a whole word, a piece of a word, a space, or punctuation.
What Is a Token?
A token is a piece of text that the model uses as its basic unit. Sometimes it happens to be a full word. Sometimes it is only the beginning of a word, the end of a word, a space, or punctuation.
What matters is not whether a token looks natural to us. What matters is that tokenization gives the model a practical way to represent text as a sequence of integers.
Token Representation
Why Use Tokens Instead of Words?
The most intuitive thing might have been to let the model work directly with words. But words are a poor technical compromise.
If every word in a language got its own ID, several problems appear immediately:
- the vocabulary explodes with inflections, compounds, names, spelling mistakes, and domain language
- new or rare words become hard to handle if they are not already in the vocabulary
- Swedish compounds, code, product names, and mixed-language text make a purely word-based system fragile
With tokens, the model can instead reuse smaller building blocks. An unusual word does not need to be unknown just because the full word has never been seen before. It can be split into parts that already exist in the vocabulary.
Words
Easy for humans to think in, but quickly leads to a huge and fragile vocabulary.
Characters or bytes
Universal and robust, but sequences become long and the model must take many more steps to understand the same text.
Tokens
The practical middle ground. Small enough to be flexible, large enough to keep sequences shorter.
Why Not Let the Model Work Directly with Binary Data?
That is a reasonable question. Computers ultimately work with binary numbers, zeros and ones. So why not just feed the model raw binary data from the start?
The answer is that it becomes very inefficient.
Text is stored in computers as bits and bytes, often through encodings such as UTF-8. But if a language model worked directly at the bit level, every small piece of text would become very long. The same sentence would require many more steps, and patterns that are obvious in language would be harder for the model to detect.
One byte contains 8 bits. With 8 bits you can represent 256 different values, from 0 to 255. For simple characters like A, I, and !, one byte each is enough. In UTF-8 they therefore get the byte values 65, 73, and 33. Characters such as å, ä, and ö require more bytes.
That is why tokenization exists. It builds a more meaningful discrete representation on top of the raw data. The model still receives numbers, but at a level where language patterns are far more learnable.
8 bits become one byte value
Example: the character A
01000001 means 64 + 1 = 65.
Three simple UTF-8 characters
For simple ASCII characters, the byte values map directly in UTF-8. Many other characters require more than one byte.
Simplified pipeline from text to model input
1. Text
AI!
2. Bytes
65 73 33
01000001 01001001 00100001
3. Tokens
4. Token IDs
[6157, 0]
Illustrative IDs. The key point is that text first becomes bytes, then tokens, and finally integers that the model can process.
Once the tokenizer vocabulary has been built, each token gets an ID. At a basic level, that is just the index of the token in the vocabulary. That is why the same text can get different token IDs in different models.
How Do You Get from Text to Token IDs?
This happens in several steps:
- Your text is read as bytes according to a text encoding.
- The tokenizer compares the text against its vocabulary, meaning a list of common text pieces.
- The text is split into the pieces that best fit that vocabulary.
- Each piece is replaced with an integer, a token ID.
So the model does not read letters or words directly. By the time the text reaches the model, it has already been turned into a sequence of numbers.
After that comes the next step in the LLM pipeline: each token ID is looked up in an embedding table and becomes a vector. That is where the model's actual mathematics begins.
How Is a Tokenizer Built?
A modern tokenizer is usually not written manually word by word. It is learned from large text corpora.
The common core idea is:
- Start from small building blocks, often bytes or very small text units.
- Go through huge corpora and measure which sequences appear often.
- Merge common sequences into larger units.
- Repeat until you have a practical vocabulary, often tens or hundreds of thousands of tokens.
That is why common words often become a single token, while unusual words can be split into several pieces. It is also why a space is sometimes included at the beginning of a token. The tokenizer has learned that certain patterns frequently appear together.
Many modern tokenizers build on ideas like byte pair encoding, or similar statistical methods. The exact algorithm varies, but the goal is the same: find a balance between a reasonably sized vocabulary and reasonably short sequences.
Simplified view of byte pair encoding across many merge rounds
1. Training data
2. Start small
In a byte-based tokenizer, you effectively begin from very small units.
3. Merge common pairs
Frequent patterns get their own building blocks because they occur over and over in the data.
4. Vocabulary and encoding
The word tokenizer can then be split as [token][izer] instead of ten separate characters.
Simplified illustration. In reality this happens across enormous corpora and thousands of merge rounds.
The point is not that someone manually decides that strings like token or izer should become tokens. They emerge because they appear often enough in the training data to become useful building blocks.
How Are Tokens Used in an LLM?
A language model does not receive text directly. The flow is basically this:
- Text is written by the user.
- The tokenizer converts the text into token IDs.
- The model processes the sequence of token IDs.
- The model predicts the next token ID.
- The token ID is decoded back into text.
That is why LLMs are described as systems that predict the next token, not the next word. Token is the actual unit the model works with.
Why Does Tokenization Affect Cost and Context Window?
This is the practical part many people miss.
Language models are normally priced per input token and per output token. At the same time, model capacity is almost always stated as a context window in tokens. That means tokenization directly controls two central questions. If you want to go deeper into what actually fills the context, how reasoning tokens are counted, and why context engineering has become so important, see our deeper guide to the context window.
- how expensive a request becomes
- how much information the model can fit at the same time
That is also why the same amount of text does not always cost the same. A compact sentence may turn into relatively few tokens. Code, tables, JSON, mixed-language text, or long Swedish compounds can turn into many more.
Some practical consequences:
- a text with few words can still be token-heavy
- two models can produce different token counts for exactly the same text, because they use different tokenizers
- when the context window fills up, it is tokens that count, not words or characters
If you build systems with RAG, long instructions, or lots of conversation history, this quickly becomes decisive. At that point tokenization is not just theory, but actual system design.
When Should You Care a Lot About the Tokenizer?
If you only test a chatbot once in a while, you rarely need to think much about tokenization. But as soon as you build something more serious, it becomes important:
- when you want to understand or optimize cost
- when you operate close to the model's context limit
- when you send in long documents, code, or tables
- when you compare different models and see them behave differently despite the same input
- when you build RAG or agent flows where a lot of text passes through the model
Summary
A tokenizer is the translation layer between human text and the model's mathematics. It splits text into tokens and maps them to integers the model can work with.
That is why language models do not work directly with words, and not with raw binary data either. Words are too rigid. Binary data is too low-level and too inefficient. Tokens are the compromise that makes modern LLMs practical.
And when someone says a model has a context window of 128,000, they always mean tokens.