Everything about Tokenization

Last updated on Nov 26, 2023

This was a fun project I worked on in November 2023. Tokenization is an oft-neglected topic in natural language processing. With the recent blow-up of interest in language models, I thought it might be good to step back and really get into the guts of what tokenization is. I created a repository to serve as a deep dive into different aspects of tokenization. This was initially supposed to be a single blog, but it evolved into a full mini-course! It’s been organized as bite-size chapters for easy navigation, with some code samples and (badly designed) walkthrough notebooks. It is NOT meant to be a complete reference in itself, and is meant accompany other excellent resources like HuggingFace’s NLP course. The following topics are covered:

Intro: A quick introduction on tokens and the different tokenization algorithms out there.
BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We’ll also go over a minimal implementation for training a BPE model.
🤗 Tokenizer: The internals of HuggingFace tokenizers! We look at state (what’s saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal 🤗 Tokenizer in Python for GPT2.
Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc.
PostProcessing and more: A look at special tokens and postprocessing, glitch tokens and why you might want to shrink your tokenizer.
Galactica: Thinking about tokenizer design by diving into the Galactica paper.

ML NLP

Everything about Tokenization

Sumanth R Hegde