Community Blog

Resource Center

Newsroom

Artificial Intelligence

From characters to tokens: Dynamic grouping with hierarchical BPE

February 17, 2026

From characters to tokens: Dynamic grouping with hierarchical BPE

Authored by:

Rares Dolga

Summarize:

Transformers work particularly well with sub-word tokenisations, which improve model quality and shorten sequences for faster inference. However, the gains taper off once the vocabulary becomes very large (> 500K): many tokens are rare, their embeddings are under-trained, and the bloated embedding matrix adds cost without proportional benefit. At the other end of the spectrum, character-level models avoid the rare-token problem by construction, but their sequences are much longer; with standard self-attention’s quadratic time and memory, this often makes them impractical.

Recent hierarchical approaches aim to reconcile character-level robustness with subword-level efficiency by grouping characters into patches. However, existing patching strategies often depend on whitespace-restricting applicability to languages without clear word boundaries-or require auxiliary models that introduce additional complexity and dependencies.

We propose a dynamic character-grouping method that leverages the structure of standard BPE tokenisation without auxiliary models. Specifically, we append explicit end-of-patch markers to BPE tokens and apply a second, lightweight BPE pass to regulate patch granularity. This design yields representations that are efficient, flexible, and language-agnostic, while preserving a simple training and deployment pipeline.

Figure 1. Two-stage tokenisation—BPE patches (Tokenizer 1) followed by a tiny BPE inside each patch (Tokenizer 2).

Algorithm: dynamic patching with two-stage BPE

Figure 1 provides a schematic overview. Below is the procedure in a compact format.

Notation

Symbol	Meaning
T₁	Existing subword tokenizer (BPE; “Tokenizer 1”)
<!>	End-of-patch marker (EoP) appended to every T₁ token
T₂	Lightweight inner BPE (“Tokenizer 2”) applied within each patch
S	Maximum sub-token length per patch after T₂ (small cap, e.g. 4–8)
E_local / D_local	Local encoder/decoder operating inside each patch
E_global	Global causal transformer over patch embeddings
C	Raw character stream
pᵢ	i-th patch (character of a T₁ token + <!> )
zᵢ	Patch embedding produced by E_local

High-level idea (TL;DR)

Start from any existing BPE (T1). Each token becomes a patch.
Append an end-of-patch marker so decoding is incremental.
Apply a lightweight BPE (T2) inside each patch to cap length at a small S.
Use local encoders and decoders per patch + a global causal transformer over patch embeddings.
Outcome: faster models and state-of-the-art BPB among dynamic patching baselines - without a boundary model and without whitespace assumptions.
More details about the local and global models can be found in the architecture section.

Procedure

1. Outer tokenisation (patching)

Tokenise text with T₁ to obtain [t₁, t₂, …, tₙ]. For each tᵢ, form a patch of characters within the token: pᵢ = (c¹ᵢ, c²ᵢ, …, cᴺᵢ).

2. Within-patch compression

For each pᵢ, apply T₂ to obtain a bounded sequence uᵢ = T₂(pᵢ) with |uᵢ| ≤ S.

3. Local encoding

Encode uᵢ with E_local to produce a fixed-size embedding zᵢ.

4. Global modelling

Feed the sequence [z₁, z₂, …, zₙ] to E_global (causal transformer) for autoregressive modelling over patches.

5. Decoding (incremental)

During generation, Eglobal predicts the next patch representation; Dlocal decodes uᵢ → pᵢ, then strip the EoP ⟂ to recover tᵢ and characters.

Architecture

Figure 2: Local encoders/decoders operate within patches; a global transformer models the compressed patch sequence.

Local encoder: turns each bounded-length patch into an embedding.
Global transformer: models the sequence of patch embeddings (short!).
Local decoder: reconstructs the next patch internally during training or inference.

Why it is efficient: replacing a huge embedding matrix with light local modules + shorter global sequences reduces FLOPs and parameters while improving bits-per-byte (BPB).

Results (high level)

Lower BPB than standard BPE and entropy-based patching; competitive with space-based patching without relying on spaces.
Compact models: fewer parameters than BPE baselines yet better compression (lower BPB).
Language-agnostic: strong results on non-whitespace scripts (e.g., Chinese).
Patch sizes ~4 bytes in practice → short global sequences and faster compute.

Table 1 — Comparison for S = 10, BPE tokenisation, Space and Entropy patching (Ent.-Patch, Ent.-Space) for 128M models trained for 13,500 steps. FLOPs are in billions. * denotes the unbounded model. We report the mean over multiple runs on the test set; max std dev across experiments: ±0.007.

Model	FLOPs ^↓	Fertility ^↓	Params	BPB ^↓
Ent.-Patch*	509	1.41	351M	1.24
Ent.-Patch	668	1.83	351M	1.20
Space-Patch	534	1.45 ↓	351M	1.14
Char - Level	4214	4.5	85M	1.16
BPE	562	1.51	123M	1.16
BPE-Patch	554	1.51	99.7M	1.11

Table 2 shows that our approach outperforms all baselines in this setup. The entropy-patching method yields the weakest results, likely due to a domain mismatch between its entropy model’s training data and SlimPajama. Although the unbounded entropy variant is more FLOP-efficient, its longer patches force the local models to learn more complex representations with limited capacity, causing a marked drop in performance. Constraining the maximum patch length to 6 improves accuracy, but at higher FLOPs-and it still trails our method.

Space-based patching is the most FLOP-efficient baseline and reasonably strong, yet it fails to generalise to languages without whitespace, an issue our method avoids. Character-level modelling uses fewer parameters but is both slower and less predictive. Notably, we also surpass standard BPE despite using fewer parameters, suggesting our local encoders learn richer token-level representations than a fixed embedding matrix.

Downstream evaluation

To further verify our model, we conduct experiments on question-answering tasks. We first pre-train a medium model on next-token prediction for 15B tokens, then perform zero-shot evaluations on:

Commonsense reasoning (0-shot): HellaSwag Zellers et al., 2019, PIQA Bisk et al., 2020 WinoGrande Sakaguchi et al., 2020, and ARC-e/-c Clark et al., 2018.
Broad Context understanding: LAMBADA (OpenAI) Paperno et al., 2016.
Language-agnostic: strong results on non-whitespace scripts (e.g., Chinese).
Popular aggregated results: MMLU Hendrycks et al., 2021

To obtain the scores, we compute the log-likelihood per character within each patch and select the answer with the highest log-likelihood. Table 2 shows that we outperform the standard BPE approach and entropy-based patching, while performing on par with space patching on most benchmarks. We also evaluate language-modelling capacity via BPB as shown in Table 2, the medium model improves with scale (lower BPB than the small model) and outperforms other baselines at larger sizes.

Table 2: Common academic evaluation benchmarks comparing a standard embedding matrix vs. our learnable embeddings. Task performance is zero-shot.

Model	Param	MMLU (acc)	Hella (acc_norm)	Lmb. (acc)	ARC-c (acc_norm)	ARC-e (acc)	Wino. (acc)	Piqa (acc)
BPE	359M	23.00	31.3	29.89	23.63	38.05	50.5	61.43
Ent.-Patch	575M	23.07	32.91	29.40	20.90	36.60	48.8	59.60
Space-Patch	575M	23.24	35.9	32.8	21.50	39.02	52.5	64.2
BPE-Patch	323M	22.90	34.6	32.5	22.10	40.6	53.4	63.3
BPE-Patch	1.3B	24.95	52.3	47.3	28.1	40.6	54.9	70.1

Practical tips

Cap S around 8–10 for a sweet spot between speed and fidelity.
Works with any BPE (GPT-2, LLaMA-3 SPM, etc.).
Drop the auxiliary boundary network used in entropy patching — the end marker is enough.

Topics:

Agentic orchestration

Rares Dolga

Research Intern, UiPath