LogicLoop Logo
LogicLoop
LogicLoop / machine-learning / Understanding LLM Tokens in TypeScript: A Developer's Guide
machine-learning September 20, 2025 7 min read

Understanding LLM Tokens in TypeScript: A Comprehensive Developer's Guide

Jamal Washington

Jamal Washington

Infrastructure Lead

Understanding LLM Tokens in TypeScript: A Developer's Guide

Many developers are working with Large Language Models (LLMs) without understanding the fundamental concepts that power them. One of the most important concepts to grasp is tokens - the basic units that LLMs process. In this deep dive, we'll explore what tokens are, how they work, and why they matter, all through the lens of TypeScript code examples.

What Are Tokens in LLMs?

Tokens are the currency of LLMs. When you send an input like "Hello World" to an LLM, that text gets broken down into its constituent tokens. These tokens are then processed by the model, which produces output tokens that are converted back into human-readable text.

Each LLM provider has their own tokenization system. For example, sending "Hello World" to OpenAI might result in 3 tokens, while the same prompt sent to different models could result in different token counts. This variation affects both processing and cost, as LLM providers typically charge based on the number of tokens processed.

How Token Counting Works

Let's demonstrate token counting using the AI SDK with different models:

TYPESCRIPT
// Using Anthropic's Claude 3.5 Haiku
const anthropicResponse = await anthropic.messages.create({
  model: "claude-3-haiku-20240307",
  max_tokens: 1000,
  messages: [{ role: "user", content: "Hello world" }]
});

console.log("Text:", anthropicResponse.content);
console.log("Usage:", anthropicResponse.usage);
// Output: 11 input tokens, 20 output tokens

// Using Google's Gemini 2.0 Flashlight
const googleResponse = await genAI.generateContent({
  model: "gemini-2.0-flashlight",
  contents: [{ role: "user", parts: [{ text: "Hello world" }] }]
});

console.log("Text:", googleResponse.text);
console.log("Usage:", googleResponse.usage);
// Output: 4 input tokens, 11 output tokens
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Why do we see different token counts for the same input? This is because each model has its own token vocabulary - a collection of words, subwords, and characters that it recognizes as individual tokens.

The Token Encoding Process

Every LLM has a different token vocabulary. These tokens are all the different words, subwords, and characters that the model knows. Each of these linguistic units gets assigned a number, and that number is the token.

When we call an LLM, it encodes the text we send it into tokens. For example, "Hello World" gets split up into its largest individual tokens in the vocabulary (which might be "hello", " ", "world"). The model then identifies the numbers in the vocabulary that match each word or subword.

LLM token processing using TypeScript code instead of Python
LLM token processing using TypeScript code instead of Python

We can examine this process using the JS TikToken library, a JavaScript implementation of OpenAI's tokenizer:

TYPESCRIPT
import { Tiktoken } from "js-tiktoken";

// Using the tokenizer for GPT-4o
const tokenizer = new Tiktoken("cl100k_base");

// Read some text
const content = "The wise owl of moonlight forest, where ancient trees stretch their branches toward the starry sky.";

// Encode the text into tokens
const tokens = tokenizer.encode(content);

console.log(`Content length: ${content.length} characters`);
console.log(`Number of tokens: ${tokens.length}`);
console.log(`Tokens: ${tokens}`);

// Try with a simpler example
const simpleContent = "Hello world";
const simpleTokens = tokenizer.encode(simpleContent);
console.log(`Content length: ${simpleContent.length} characters`);
console.log(`Number of tokens: ${simpleTokens.length}`);
console.log(`Tokens: ${simpleTokens}`);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Tokenizing text with GPT-4o tokenizer using sample text from a markdown file
Tokenizing text with GPT-4o tokenizer using sample text from a markdown file

The Full LLM Process

Let's look at the complete LLM process from end to end:

  1. Text input (e.g., "Hello world") is sent to the model
  2. The text is split into the largest chunks recognized in the vocabulary
  3. These chunks are encoded into tokens (numbers) by looking up their assigned values in the vocabulary
  4. The LLM processes these tokens and generates output tokens
  5. The output tokens are decoded back into text by reversing the process
  6. The decoded text is returned as the response

It's important to understand that all the computation in an LLM is done on numbers (tokens), not on text directly. The model only deals with these numeric representations of chunks of text.

Decoding tokens back to text using the tokenizer's decode function
Decoding tokens back to text using the tokenizer's decode function
TYPESCRIPT
// Decoding tokens back into text
const decodedText = tokenizer.decode(tokens);
console.log(`Decoded text: ${decodedText}`);
1
2
3

How Tokenizers Are Built

To understand why different models tokenize text differently, we need to explore how tokenizers are trained. Tokenizers are typically built using the same corpus of text that the model itself is trained on.

Let's implement a simple character-level tokenizer in TypeScript to understand the basics:

TYPESCRIPT
class CharacterLevelTokenizer {
  private mapping: Record<string, number> = {};
  private reverseMapping: Record<number, string> = {};

  constructor(dataset: string) {
    // Extract unique characters from the dataset
    const uniqueChars = [...new Set(dataset)].sort();
    
    // Assign a token ID to each unique character
    uniqueChars.forEach((char, index) => {
      this.mapping[char] = index;
      this.reverseMapping[index] = char;
    });
  }

  encode(text: string): number[] {
    // Convert each character to its token ID
    return [...text].map(char => this.mapping[char] || 0);
  }

  decode(tokens: number[]): string {
    // Convert each token ID back to its character
    return tokens.map(token => this.reverseMapping[token] || '').join('');
  }
}

// Usage
const dataset = "the cat sat on the mat";
const tokenizer = new CharacterLevelTokenizer(dataset);

const input = "cats sat mat";
const tokens = tokenizer.encode(input);

console.log(`Input length: ${input.length} characters`);
console.log(`Number of tokens: ${tokens.length}`);
console.log(`Tokenizer vocabulary: ${JSON.stringify(tokenizer.mapping)}`);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

With this character-level tokenizer, the number of tokens will always equal the number of characters, which isn't efficient. Real tokenizers use subword tokenization to reduce the number of tokens needed to represent text.

Subword Tokenization

A more efficient approach is subword tokenization, which identifies common character sequences and treats them as single tokens. Here's a simplified implementation:

TYPESCRIPT
class SubwordLevelTokenizer {
  private mapping: Record<string, number> = {};
  private reverseMapping: Record<number, string> = {};
  private subwords: string[] = [];

  constructor(dataset: string) {
    // First, add all individual characters
    const uniqueChars = [...new Set(dataset)].sort();
    let tokenId = 0;
    
    uniqueChars.forEach(char => {
      this.mapping[char] = tokenId;
      this.reverseMapping[tokenId] = char;
      tokenId++;
    });
    
    // Find common character pairs
    for (let i = 0; i < dataset.length - 1; i++) {
      const pair = dataset.substring(i, i + 2);
      if (!this.subwords.includes(pair)) {
        this.subwords.push(pair);
        this.mapping[pair] = tokenId;
        this.reverseMapping[tokenId] = pair;
        tokenId++;
      }
    }
  }

  encode(text: string): number[] {
    const tokens: number[] = [];
    let i = 0;
    
    while (i < text.length) {
      // Try to match the longest possible subword
      let matched = false;
      
      // Check if we can match a 2-character subword
      if (i < text.length - 1) {
        const pair = text.substring(i, i + 2);
        if (this.mapping[pair] !== undefined) {
          tokens.push(this.mapping[pair]);
          i += 2;
          matched = true;
          continue;
        }
      }
      
      // If no subword matched, use the single character
      if (!matched) {
        tokens.push(this.mapping[text[i]] || 0);
        i++;
      }
    }
    
    return tokens;
  }

  decode(tokens: number[]): string {
    return tokens.map(token => this.reverseMapping[token] || '').join('');
  }
}

// Usage
const dataset = "the cat sat on the mat";
const tokenizer = new SubwordLevelTokenizer(dataset);

const input = "cats sat mat";
const tokens = tokenizer.encode(input);

console.log(`Input length: ${input.length} characters`);
console.log(`Number of tokens: ${tokens.length}`);
console.log(`Tokenizer vocabulary: ${JSON.stringify(tokenizer.mapping)}`);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

With this subword tokenizer, we can represent the same text with fewer tokens, making processing more efficient. Real tokenizers use more sophisticated algorithms to identify larger subword chunks based on frequency in the training corpus.

Vocabulary Size Trade-offs

The size of a tokenizer's vocabulary significantly impacts efficiency. With a small vocabulary of around 1,000 tokens, a word like "understanding" might be split into five tokens: "under", "st", "and", "ing". With a larger vocabulary of 50,000 tokens, it might be just two tokens: "under", "standing". The larger the vocabulary, the fewer tokens needed to represent text.

However, there's a trade-off: larger vocabularies increase model size and memory requirements. Different model providers make different decisions about this trade-off, which is why the same text can result in different token counts across models.

How Tokenizers Handle Unusual Words

When a tokenizer encounters an unusual word that wasn't common in its training data, it typically breaks it down into smaller subwords or even individual characters. For example, a made-up word like "frabjous" (from Lewis Carroll's poem) might be split into multiple tokens because it's not common in the training data.

TYPESCRIPT
// Tokenizing unusual words
const unusualText = "O frabjous day";
const unusualTokens = tokenizer.encode(unusualText);

console.log(`Text: ${unusualText}`);
console.log(`Tokens: ${unusualTokens}`);
console.log(`Number of tokens: ${unusualTokens.length}`);

// Decode to see how it was split
const decodedPieces = unusualTokens.map(t => tokenizer.decode([t]));
console.log(`Token pieces: ${JSON.stringify(decodedPieces)}`);
1
2
3
4
5
6
7
8
9
10
11

This behavior has important implications for using LLMs with different languages or specialized domains. Text in languages that were less represented in the training data will typically require more tokens, as will technical jargon or domain-specific terminology.

Practical Implications for Developers

  • Cost optimization: Understanding tokens helps you estimate and reduce API costs when working with LLMs
  • Context window management: Since LLMs have limited context windows measured in tokens, efficient tokenization helps maximize the information you can include
  • Performance considerations: Fewer tokens generally means faster processing
  • Language support: Be aware that non-English languages may use more tokens, affecting costs and context limits
  • Programming languages: Common programming languages like JavaScript typically tokenize more efficiently than less common ones

Summary

Tokens are the fundamental units that LLMs process. They represent words, subwords, or characters in a numeric format that models can work with. The process involves encoding text into tokens, processing those tokens, and then decoding the output tokens back into text.

Different model providers use different tokenization strategies, which is why the same text can result in different token counts. Understanding how tokenization works helps developers optimize their LLM applications for cost, performance, and effectiveness.

By implementing simple tokenizers in TypeScript, we've seen how the process works under the hood and gained insights into the trade-offs involved in tokenizer design. This knowledge is essential for any developer working with LLMs in production environments.

Let's Watch!

Understanding LLM Tokens in TypeScript: A Developer's Guide

Ready to enhance your neural network?

Access our quantum knowledge cores and upgrade your programming abilities.

Initialize Training Sequence
L
LogicLoop

High-quality programming content and resources for developers of all skill levels. Our platform offers comprehensive tutorials, practical code examples, and interactive learning paths designed to help you master modern development concepts.

© 2025 LogicLoop. All rights reserved.