Tokenizer: Understanding and Building One

What is this tutorial for?
What does a tokenizer do?
BytePairEncoding (BPE)
- Example of Basic BPE:
Implementation
- Encoding in UTF-8:
Conclusion
- Related Resources & Tools

Tokenization is a backbone of any modern LLMs and will be a basic building block or prerequisite of building our own AI assistant. Whether you’re developing agentic AI systems or exploring how language models work, understanding tokenization is essential. Here we will explore in detail what constitutes the creation of tokenizers, why they are required, and how they are built via BytePairEncoding and SentencePiece tokenization.

What is this tutorial for?

This tutorial will aim to make you familiar with core workings for tokenization – not a quick, easy solution. We could import libraries and finish the job easily, and it would in fact be better than the one we will elementally attempt to build now. But the reader would have no knowledge as to how tokenization occurs. We will walk you step by step through which you can arrive at your simple tokenizer. Although not quite grand, it will give you an understanding needed to move forward into building more complex AI tools (explore our developer tools for practical applications). Eventually though, we will build one which will be functional and implementable in our assistant.

What does a tokenizer do?

Tokenization process refers to breakdown of raw texts into individual building blocks that can be fed to a model to process. Basically, a LLM is a next token predictor which was later generalized for specific context. In order to predict the next sequence, it predicts these blocks called tokens – it takes in tokens and spits out tokens.

A sentence “I’m a Nepali boy” will be tokenized into blocks as: ["I", "'", "m", " a", " Nep", "ali", " boy"]

Notice that a token is not quite a word—punctuations and subwords are also taken into account. A tokenizer is separate from a language model; we train these two separately. The goal of the tokenizer is to come up with a vocabulary or mappings (similar to how our time zone planner maps different time representations to standardized formats) or lookup table for these blocks, while the language model itself will be trained to understand context, structure, and meaning of these tokens.

Tokenizer workflow in Large Language Models showing text to token conversion

This token-based understanding is fundamental to how AI systems build and maintain memory of conversations, enabling them to recall and reference previous interactions efficiently.

Interactive Demo: To understand this more, you can visit https://tiktokenizer.vercel.app/ which will let you choose a tokenizer for different models and let you play around with what tokens are obtained.

BytePairEncoding (BPE)

BytePairEncoding (BPE) is the algorithm used by GPT models. This algorithm involves first traversing the text stream and looking for repeated patterns, then creating a special character (token) to signify the pattern. This is iteratively repeated until there are no repeated patterns. However, this is not implemented in its entirety by GPT models as that would be very computationally expensive—they add in a regex layer and also some further augmentations which are not totally public.

Example of Basic BPE:

Assume a pattern: aaabdaaabac

The pattern aa appears more so we replace it with Z and create a replacement table for it:

Z = aa
Pattern becomes: ZabdZabac

ab seems to occur, we replace it with Y:

ZYdZYac
Y = ab
Z = aa

Now again ZY seems to be iterative on its own so we merge it as X:

XdXac
X = ZY
Y = ab
Z = aa

It cannot be broken down further, so we stop.

It is clearly evident that this is not feasible for actual implementation as the number of merges will run out our memory. So here are some tweaks that are done:

Precompute fixed vocabulary with limited number of merges
Byte-level tokenization (so unicode can be represented)
Regex preprocessing

GPT-2 has 50,257 tokens—a carefully chosen number that balances vocabulary coverage with cognitive efficiency. This includes 256 for normal unicode vocabulary (i.e., letters) and 50,000 merges (learned in training) along with one special token.

Implementation

So we must first encode the text stream in UTF-8 and then start to create merges. Here you can follow along with the documentation or directly download the notebook from the link : https://github.com/biterdevs/TokenizerCourse .
.

Encoding in UTF-8:

text = 'Your text here………………..'
tokens = text.encode('utf-8')
tokens = list(map(int, tokens))
print('-----')
print(text)
print(tokens)
print(f'Length of text: {len(text)}')
print(f'Length of tokens: {len(tokens)}')

As it will print, you will receive the length of token exactly as equal to length of text. The next step is to look for merges—count up the most recurring patterns in the text. Now this can be implemented as you wish, but we opted for the easiest possible method we could come across:

def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

This will iterate through the text as a dictionary and look for pairs and retain the stats—the number of their occurrences. Now we ought to merge the recurring patterns which can be done as:

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

This will create newids for recurring vocab pairs. Finally, we can now utilize the two functions for our process:

vocab_size = 276
num_merges = vocab_size - 256
ids = list(tokens)

merges = {}
for i in range(num_merges):
    stats = get_stats(ids)
    pair = max(stats, key=stats.get)
    idx = 256 + i
    ids = merge(ids, pair, idx)
    print(f"merged {pair} into a token {idx}")


    merges[pair] = idx

The num_merges is a hyperparameter which can be tuned as we wish and we can obtain now our merged pair from our text. We can extract initial performance now as:

print(f"Initial length: {len(tokens)}")
print(f"ID length: {len(ids)}")
print(f"Compression ratio: {len(tokens) / len(ids):.2f}X")

Now, of course, the ID length will emerge lesser than initial length as merges were done. Now we need to append code for decode and encoding back and forth from tokens. This is relatively simple and can be done as follows:

vocab = {idx: bytes([idx]) for idx in range(256)}
for (p0, p1), idx in merges.items():
    vocab[idx] = vocab[p0] + vocab[p1]

def decode(ids):
    tokens = b"".join(vocab[idx] for idx in ids)
    text = tokens.decode("utf-8", errors='replace')
    return text

def encode(text):
    tokens = list(text.encode("utf-8"))
    while len(tokens) >= 2:
        stats = get_stats(tokens)
        pair = min(stats, key=lambda p: merges.get(p, float("inf")))
        if pair not in merges:
            break
        idx = merges[pair]
        tokens = merge(tokens, pair, idx)
    return tokens

print(encode("h"))

Conclusion

Believe it or not, we have now arrived at basically all we need to achieve tokenization. Of course, the code is far from complete, but we have created the gist of what tokenization ought to do. Here we looked for repeating patterns and merged them into a vocab, and again came up with functions to encode and decode our tokens.

The job now is to scale up and train for a large amount of text, preferably writing it as a Python class. For more insights on implementing AI systems that scale and handle complex reasoning, check out our guide on building agentic AI with the ACE framework.

Related Resources & Tools

AI Memory Systems – How tokens are stored and retrieved in AI memory
Agentic AI Architecture – Complete guide to building autonomous AI systems
Cognitive Debt in AI – Understanding token limitations and AI performance
BiterDevs Tools – Explore our suite of developer tools and calculators
Economics Calculator – AI-powered economic analysis tools

Tokenizer: Understanding and Building One

Table of Contents

What is this tutorial for?

What does a tokenizer do?

BytePairEncoding (BPE)

Example of Basic BPE:

Implementation

Encoding in UTF-8:

Conclusion

Related Resources & Tools

2 thoughts on “Tokenizer: Understanding and Building One”

Leave a Comment Cancel Reply

BitEr Blogs

Sign up to receive email updates, fresh news and more!

Tokenizer: Understanding and Building One

Table of Contents

What is this tutorial for?

What does a tokenizer do?

BytePairEncoding (BPE)

Example of Basic BPE:

Implementation

Encoding in UTF-8:

Conclusion

Related Resources & Tools

Related Posts

2 thoughts on “Tokenizer: Understanding and Building One”

Leave a Comment Cancel Reply

BitEr Blogs