
Summarize:
Transformers work particularly well with sub-word tokenisations, which improve model quality and shorten sequences for faster inference. However, the gains taper off once the vocabulary becomes very large (> 500K): many tokens are rare, their embeddings are under-trained, and the bloated embedding matrix adds cost without proportional benefit. At the other end of the spectrum, character-level models avoid the rare-token problem by construction, but their sequences are much longer; with standard self-attention’s quadratic time and memory, this often makes them impractical.
Recent hierarchical approaches aim to reconcile character-level robustness with subword-level efficiency by grouping characters into patches. However, existing patching strategies often depend on whitespace-restricting applicability to languages without clear word boundaries-or require auxiliary models that introduce additional complexity and dependencies.
We propose a dynamic character-grouping method that leverages the structure of standard BPE tokenisation without auxiliary models. Specifically, we append explicit end-of-patch markers to BPE tokens and apply a second, lightweight BPE pass to regulate patch granularity. This design yields representations that are efficient, flexible, and language-agnostic, while preserving a simple training and deployment pipeline.

Figure 1. Two-stage tokenisation—BPE patches (Tokenizer 1) followed by a tiny BPE inside each patch (Tokenizer 2).
Figure 1 provides a schematic overview. Below is the procedure in a compact format.
Symbol | Meaning |
| T₁ | Existing subword tokenizer (BPE; “Tokenizer 1”) |
| <!> | End-of-patch marker (EoP) appended to every T₁
token |
| T₂ | Lightweight inner BPE (“Tokenizer 2”) applied within each patch |
| S | Maximum sub-token length per patch after T₂
(small cap, e.g. 4–8) |
| E_local / D_local | Local encoder/decoder operating inside each patch |
| E_global | Global causal transformer over patch embeddings |
| C | Raw character stream |
| pᵢ | i-th patch (character of a T₁
token + <!> ) |
| zᵢ | Patch embedding produced by E_local |
Start from any existing BPE (T1). Each token becomes a patch.
Append an end-of-patch marker so decoding is incremental.
Apply a lightweight BPE (T2) inside each patch to cap length at a small S.
Use local encoders and decoders per patch + a global causal transformer over patch embeddings.
Outcome: faster models and state-of-the-art BPB among dynamic patching baselines - without a boundary model and without whitespace assumptions.
More details about the local and global models can be found in the architecture section.
Tokenise text with T₁ to obtain [t₁, t₂, …, tₙ]. For each tᵢ, form a patch of characters within the token: pᵢ = (c¹ᵢ, c²ᵢ, …, cᴺᵢ).
For each pᵢ, apply T₂ to obtain a bounded sequence uᵢ = T₂(pᵢ) with |uᵢ| ≤ S.
Encode uᵢ with E_local to produce a fixed-size embedding zᵢ.
Feed the sequence [z₁, z₂, …, zₙ] to E_global (causal transformer) for autoregressive modelling over patches.
During generation, Eglobal predicts the next patch representation; Dlocal decodes uᵢ → pᵢ, then strip the EoP ⟂ to recover tᵢ and characters.

Figure 2: Local encoders/decoders operate within patches; a global transformer models the compressed patch sequence.
Local encoder: turns each bounded-length patch into an embedding.
Global transformer: models the sequence of patch embeddings (short!).
Local decoder: reconstructs the next patch internally during training or inference.
Why it is efficient: replacing a huge embedding matrix with light local modules + shorter global sequences reduces FLOPs and parameters while improving bits-per-byte (BPB).
Lower BPB than standard BPE and entropy-based patching; competitive with space-based patching without relying on spaces.
Compact models: fewer parameters than BPE baselines yet better compression (lower BPB).
Language-agnostic: strong results on non-whitespace scripts (e.g., Chinese).
Patch sizes ~4 bytes in practice → short global sequences and faster compute.
Table 1 — Comparison for S = 10, BPE tokenisation, Space and Entropy patching (Ent.-Patch, Ent.-Space) for 128M models trained for 13,500 steps. FLOPs are in billions. * denotes the unbounded model. We report the mean over multiple runs on the test set; max std dev across experiments: ±0.007.
Model | FLOPs ↓ | Fertility ↓ | Params | BPB ↓ |
Ent.-Patch* | 509 | 1.41 | 351M | 1.24 |
Ent.-Patch | 668 | 1.83 | 351M | 1.20 |
Space-Patch | 534 | 1.45 | 351M | 1.14 |
Char - Level | 4214 | 4.5 | 85M | 1.16 |
BPE | 562 | 1.51 | 123M | 1.16 |
BPE-Patch | 554 | 1.51 | 99.7M | 1.11 |
Table 2 shows that our approach outperforms all baselines in this setup. The entropy-patching method yields the weakest results, likely due to a domain mismatch between its entropy model’s training data and SlimPajama. Although the unbounded entropy variant is more FLOP-efficient, its longer patches force the local models to learn more complex representations with limited capacity, causing a marked drop in performance. Constraining the maximum patch length to 6 improves accuracy, but at higher FLOPs-and it still trails our method.
Space-based patching is the most FLOP-efficient baseline and reasonably strong, yet it fails to generalise to languages without whitespace, an issue our method avoids. Character-level modelling uses fewer parameters but is both slower and less predictive. Notably, we also surpass standard BPE despite using fewer parameters, suggesting our local encoders learn richer token-level representations than a fixed embedding matrix.
To further verify our model, we conduct experiments on question-answering tasks. We first pre-train a medium model on next-token prediction for 15B tokens, then perform zero-shot evaluations on:
Commonsense reasoning (0-shot): HellaSwag Zellers et al., 2019, PIQA Bisk et al., 2020 WinoGrande Sakaguchi et al., 2020, and ARC-e/-c Clark et al., 2018.
Broad Context understanding: LAMBADA (OpenAI) Paperno et al., 2016.
Language-agnostic: strong results on non-whitespace scripts (e.g., Chinese).
Popular aggregated results: MMLU Hendrycks et al., 2021
To obtain the scores, we compute the log-likelihood per character within each patch and select the answer with the highest log-likelihood. Table 2 shows that we outperform the standard BPE approach and entropy-based patching, while performing on par with space patching on most benchmarks. We also evaluate language-modelling capacity via BPB as shown in Table 2, the medium model improves with scale (lower BPB than the small model) and outperforms other baselines at larger sizes.
Table 2: Common academic evaluation benchmarks comparing a standard embedding matrix vs. our learnable embeddings. Task performance is zero-shot.
Model | Param | MMLU (acc) | Hella (acc_norm) | Lmb. (acc) | ARC-c (acc_norm) | ARC-e (acc) | Wino. (acc) | Piqa (acc) |
BPE | 359M | 23.00 | 31.3 | 29.89 | 23.63 | 38.05 | 50.5 | 61.43 |
Ent.-Patch | 575M | 23.07 | 32.91 | 29.40 | 20.90 | 36.60 | 48.8 | 59.60 |
Space-Patch | 575M | 23.24 | 35.9 | 32.8 | 21.50 | 39.02 | 52.5 | 64.2 |
BPE-Patch | 323M | 22.90 | 34.6 | 32.5 | 22.10 | 40.6 | 53.4 | 63.3 |
BPE-Patch | 1.3B | 24.95 | 52.3 | 47.3 | 28.1 | 40.6 | 54.9 | 70.1 |
Cap S around 8–10 for a sweet spot between speed and fidelity.
Works with any BPE (GPT-2, LLaMA-3 SPM, etc.).
Drop the auxiliary boundary network used in entropy patching — the end marker is enough.
Topics:
Agentic orchestration
Research Intern, UiPath
Sign up today and we'll email you the newest articles every week.
Thank you for subscribing! Each week, we'll send the best automation blog posts straight to your inbox.
Sign up today and we'll email you the newest articles every week.
Thank you for subscribing! Each week, we'll send the best automation blog posts straight to your inbox.