What is a Tokenizer? How do they affect the training of large language models?

These are lecture notes from AI Code and Coffee meetup on (2024-03-06) (https://www.facebook.com/events/1058526405447690/), where we workrf through and discussing Tokenizers used for large language models (LLMs).

Code and coffee meetups, are practical mentoring and coding sessions - this week we’ll work through code based of Andrej Karpathy’s tutorial

https://www.youtube.com/watch?v=zduSFxRajkE

Google colab for the video:

https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing

GitHub repo for the video: https://github.com/karpathy/minbpe

Lecture Notes on ‘Let’s Build the GPT Tokenizer’

Introduction

Understanding Tokenization

Tokenization is the process of translating strings or text into sequences of tokens (integers), and also vice versa.
It serves as the fundamental unit or atom of large language models.
The tokenization process involves converting each possible occurrence of a string piece into an integer, after creating a look-up table.
The integer associated with every single token is used to look up into this table and extracts the corresponding row.
The row contains trainable parameters that can be trained through back-propagation.
The vectors fed into the transformer perceive every single token.

Need for a More Refined Approach

While character-level tokenization may appear functional at the basic level, working with state-of-the-art language models necessitates the use of more sophisticated methods of constructing token vocabularies. Therefore, tokenization operates at chunk level and employs algorithms such as Byte Pair Encoding (BPE).

Byte Pair Encoding (BPE)

BPE is a mechanism for tokenization that was introduced for use in large language models.
It provides a tokenizer with a vocabulary of potential tokens.
It also designates the context size, determining the number of previous tokens each token can attend to in the transformer’s attention layer.

Building the BPE Tokenizer

The BPE algorithm might appear complex, but it’s feasible to build it from scratch ourselves.
Understanding the BPE encoding algorithm will also reveal how several issues that large language models encounter can be traced back to the tokenization process and how the finer elements of implementing these models can be tackled effectively.

By understanding and effectively implementing tokenization techniques like BPE, it is possible to significantly improve the performance and reliability of large language models.

Note: This lecture also indicates that the GPT2 paper and the Llama 2 paper can be useful resources for understanding the role of tokenization in the operation and implementation of large language models.

Building the GPT Tokenizer

This lecture discusses the GPT tokenizer, highlighting its functionality, how it breaks down English and non-English sentences, its interaction with Python code, and the significant differences and enhancements from GPT-2 to GPT-4 tokenizer.

Section 1: Introduction to GPT Tokenizer

The GPT tokenizer breaks down sentences into tokens, with the process executed live in the browser via JavaScript.
For instance, a string such as “hello world” would be broken down into individual tokens, each assigned a unique ID.
Even spaces are considered as tokens. For example, the word “tokenization” is split into two tokens and the space is a separate token.
Some multi-digit numbers can be broken down into multiple tokens. For example, ‘804’ breaks up into two tokens.

Section 2: Inconsistent Token Splitting

An interesting observation is that the same string can be tokenized differently based on its position in the text.
For instance, ‘egg’ at the beginning of a sentence is broken into two tokens, but when preceded by a space (‘ egg’), it appears as a single token.
The tokenization is case-sensitive as well, with ‘egg’, ‘Egg’, and ‘EGG’ each represented as different tokens.
This inconsistency means the model must learn from raw data that these are very similar concepts and group them correctly in the neural network.

Section 3: Non-English Languages and Tokens

Non-English languages, like Korean, don’t tokenize as efficiently as English, resulting in a larger number of tokens for the same sentence when translated.
This inefficiency is due to the fact that English is predominant in the training data used for the tokenizer.
The consequent longer sequence length causes a strain on the Transformer’s context length, essentially stretching the non-English text from the Transformer’s perspective.

Section 4: Python Code and Tokenizer

In the case of Python code, spaces are tokenized individually, meaning a line with indentation would create numerous tokens for spaces alone. This results in wasteful bloating of text, making GPT-2 less efficient in handling Python code.

Section 5: Transition from GPT-2 to GPT-4 Tokenizer

The GPT-4 tokenizer reduces the token count significantly. For example, a string that was tokenized into 300 tokens in GPT-2 is broken into only 185 tokens with GPT-4.
This reduction is achieved by doubling the number of tokens, from roughly 50k in GPT-2 to around 100k in GPT-4.
This change densifies the input to the Transformer, enabling it to process twice the amount of text for context. However, increasing tokens indefinitely is not beneficial as the SoftMax function and embedding table grow significantly.
For Python code, the GPT-4 tokenizer has improved handling of white space. In GPT-4, multiple spaces are grouped into a single token, making the tokenization more efficient.

The enhancements in GPT-4’s tokenizer, especially in handling Python code and white space, result in more efficient tokenization and denser input to the Transformer. These improvements contribute significantly to the overall efficiency and effectiveness of the large language model. Therefore, the tokenizer’s design plays a crucial role in improving the language model’s performance.

Strings in Python. Unicode

1. Introduction

The purpose of this lecture is to explore how text is prepared (tokenized) to be used in training the GPT (Generative Pre-training Transformer) language model. This process includes steps from taking raw string inputs, tokenizing them into a set of integers, and then using these integers as indices into a lookup table to pass into the transformer model. The complexity arises when considering the need to support the array of languages and character types that exist beyond just basic English alphabets.

2. Understanding the Role of Unicode

In Python, strings are defined as mutable sequences of Unicode code points. Unicode is a standard that defines unique integers (Unicode code points) for a huge set of characters used in written languages across the world. The standard is defined and maintained by the Unicode Consortium and it currently encompasses 150,000 characters across 161 scripts. The standard continues to be updated and the latest one in this lecture is version 15.1 in September 2023.

To access the Unicode code point of a single character in Python, you can use the ord() function. For instance:

        
      
unicode_val = ord('h')  # This returns 104, the Unicode code point for 'h'

You need to ensure that your input to ord() function is a single character only, since only individual characters have Unicode code points.

The Unicode standard encompasses a vast range of characters, from common alphabets, numerals, special characters, to even emojis. Hence, it provides a powerful tool to process text inputs across various languages and platforms.

3. From Unicode to Raw Integers

By using the ord() function, you can convert each character in a string into their corresponding Unicode code point, essentially turning them into a sequence of integers:

        
      
unicode_sequence = [ord(x) for x in 'hello']  # Converts 'hello' into sequence of Unicode code points

4. Limitations of Direct Unicode Integration

You may wonder why we can’t use these integers (derived directly from Unicode code points) for tokenization, thus eliminating the need for a tokenization step. However, there are two main obstacles:

Vocabulary size: When using Unicode directly, you would potentially need to deal with a vocabulary size of 150,000 different Unicode code points. This can lead to computational inefficiency and more complex models.
Standards instability: The Unicode standard is alive and frequently evolving, meaning that the integer representation of certain characters could change over time, leading to instability in the representation.

Hence, while the Unicode standard serves as a great starting point to capture the variety in text data, additional steps (like tokenization) are necessary to prepare the text for language models like GPT.

5. Next Steps

The follow-up topic would be to further delve into the concept of tokenization and how it can help in making our language processing more efficient and effective.

Unicode byte encodings

The process of tokenization involves encoding a text in a digitally understandable form. Different types of encoding are used for this purpose. These include UTF-8, UTF-16, and UTF-32 which are defined by the Unicode Consortium.

UTF-8, UTF-16, UTF-32:

These are encodings that help in converting Unicode text into binary data or byte streams.

UTF-8: It is a variable-length encoding which takes each Unicode point and translates it into a byte stream. The resultant byte stream can have a length between a single byte to four bytes, depending upon the Unicode point. UTF-8, being backward-compatible with the simple ASCII encoding, is the most prominently used encoding on the internet.
UTF-16 and UTF-32: While UTF-32 can offer fixed-length encoding, both UTF-16 and UTF-32 have downsides, such as potential wastefulness. For simple ASCII characters, we might end up with sequences like zero-something, zero-something, suggesting potential redundancy in encoding.

Python Implementation:

The string class in Python provides a convenient way to encode a string into different types of encodings like UTF-8, UTF-16, and UTF-32. As byte stream object printouts aren’t clear, we can translate the byte stream into a list to understand the raw bytes of the encoding.

        
some_string.encode("utf8") # encodes a string into UTF-8
list(some_string.encode("utf8")) # outputs the raw bytes of the encoding

Through analyzing the raw encoding bytes, it’s clear how UTF-16 and UTF-32 can be wasteful for our purposes, especially when encoding simple ASCII or English characters. This wastefulness becomes more evident as we see many zeros in the byte streams.

Challenge with Small Vocabulary Size:

Using raw bytes from UTF-8 encoding effectively limits our vocabulary size to only 256 tokens, which is very small. This limitation results in our text stretched over lengthy sequences of bytes. While this scenario makes the embedding table and prediction at the final layer tiny, it significantly elongates our sequences.

Long sequences hinder the transformer’s computational efficiency because they restrict us from attending to sufficiently long text. A small vocabulary size is thus undesirable and inefficient as it limits our context length. Instead, we need a larger tunable vocabulary size.

Solution: Byte Pair Encoding (BPE)

To accommodate a larger vocabulary size while using UTF-8 encoding, we turn to the Byte Pair Encoding algorithm. BPE helps compress byte sequences to a variable extent, thus efficiently reducing sequence lengths without compromising on the power of encoding.

This lecture underscores the importance of encoding in tokenization and how different encodings operate. A special emphasis is placed on why UTF-8 is universally preferred and the need to utilize BPE for accommodating a larger vocabulary size while ensuring computational efficiency.

Daydreaming Deleting Tokenization

The objective of this is to explore the concept of feeding raw byte sequences directly into language models like GPT. Although this is an exciting and promising approach, it has its complexities and challenges.

A potential method for this approach was detailed in a paper published in the summer of the previous year.

Discussion on Previous Literature

The paper proposed that feeding raw byte sequences directly into language models could be possible yet complex. Some important points mentioned in the paper were:

Modifying the Transformer architecture is necessary to accommodate raw byte sequences.
This necessity arises from the fact that the sequence length becomes very long when they are raw bytes, making computation with transformers extremely expensive.
To mitigate this issue, the paper proposed a hierarchical structuring of the Transformer architecture, which would allow the feeding of raw bytes without causing a significant load on the system.

Feasibility of Tokenization-free Approach

The paper successfully established the feasibility of tokenization-free autoregressive sequence modeling at scale. This would allow us to input raw byte streams directly into our models, bypassing the need for processing or compressing the data.

However, certain reservations exist:

The method hasn’t been widely validated or adopted by many research groups.
There’s a lack of confidence that this approach would operate effectively on a large scale.

Current Status

While the idea of completely bypassing tokenization has its appeal, the practicality of such a system is yet to be proven. As of now, the common practice is to compress the raw data using the Byte Pair Encoding (BPE) algorithm before feeding it into language models.

Despite the challenges, the pursuit of a potentially tokenization-free methodology is an interesting area. It’s hoped that future research and developments will find a solution to the problems that currently confine us to using the existing BPE algorithm. For the time being, however, we must continue to rely on data compression methods like BPE before inputting data into language models.

BytePair Encoding Algorithm Walkthrough

The lecturer started off by expressing a desire to feed raw byte sequences directly into language models. They referenced a paper from the previous summer that proposed a potential methodology for this. The challenge lies within the necessity to modify the Transformer architecture since attention becomes extremely expensive if sequences are excessively long.

The mentioned paper proposed a hierarchical structure for the Transformer that could potentially solve this problem, enabling raw bytes to be fed into it directly. This proposed approach is referred to as tokenization-free autoregressive sequence modelling. Despite the exciting possibilities of this approach, the lecturer pointed out that it has not been extensively proven by many groups at a sufficient scale.

The lecturer then transitioned into discussing the alternative method of compressing data using the Byte Pair Encoding (BPE) algorithm.

Byte Pair Encoding (BPE) Algorithm

The premise of BPE is simple and leans on the principle of replacing the most frequent pair of tokens iteratively with a single new token. This algorithm is used as a form of data compression.

To illustrate this, the lecturer used a sequence of 11 characters having only four elements in the vocabulary (a, b, c, d) as an example. The idea is to iterate through the sequence and identify the pair of tokens that occur most frequently - say ‘aa’. This pair is then replaced with a single new token, ‘Z’, thus reducing the sequence from 11 characters to nine, but increasing the vocabulary to five (a, b, c, d, Z).

This process is then repeated. If the next most frequent pair is ‘ab’, it is replaced with another new token, ‘Y’. The sequence is further compressed, leading to an increased vocabulary (a, b, c, d, Z, Y). The process continues until the sequence is sufficiently compressed.

By applying this strategy to byte sequences starting with a vocabulary of size 256 (for all possible byte values), the BPE process iteratively “mints” new tokens. These are appended to the vocabulary and replace the most frequently occurring byte pairs.

This approach results in a compressed training dataset and an algorithm that can encode any arbitrary sequence using the newly built vocabulary and decode it back to strings.

The lecture continues with the implementation of the above process, guided by a reference block post paragraph serving as the source data. The goal is to obtain a more manageable sequence length with a larger vocabulary, using the BPE algorithm for compression.

Code and Coffee Meeetup - Notes on LLM tokenizers