Community Blog

Resource Center

Newsroom

Artificial Intelligence

Latte

October 1, 2024

Latte Latent attention for linear time Transformers

Authored by:

Marius Cobzarenco

UiPath DocPath fine-tuned LLM for information extraction from documents

Summarize:

Transformers are one of the most popular architectures used in both sequence modeling and computer vision. At the center of Transformers is the attention mechanism, which compares each element of a sequence with every other element. This pairwise similarity score is used to decide how much the other tokens contribute to the new representation of one element. While the approach gives state-of-the-art results, it comes at the cost of quadratic time complexity. Additionally, for language generation, the next token prediction is linear in the prompt length, compared to the constant time complexity of approaches like Structured State Models (SSMs).

We introduce Latte, a new linear time and memory replacement for standard attention, which achieves a comparable performance to Transformers while being more efficient during training and inference. These properties are important for document modeling or high-resolution visual question answering, where input can be very long. In this blog post, we focus on an intuitive explanation of Latte, but the approach is inspired and can be easily understood from the lens of latent variables. For a concise mathematical explanation, check out our paper.

We will first rewrite the classic attention mechanism in the non-vectorized form, which will help us describe the idea behind Latte.

Quick recap

One of the most common ways of writing a standard attention layer is using the matrix form:

Equation showing the standard attention layer in matrix form with queries Q, keys K, values V and softmax normalization, the baseline mechanism Latte replaces with linear-time attention.

Nonetheless, bearing in mind that standard attention is based on pairwise interactions between elements of a sequence, the formula can be written more intuitively without any vectorization. Defining a sequence of tokens

Inline formula X with subscript 1 superscript T denoting a sequence of input tokens in the Latte linear attention model for Transformers.

from which we obtain the queries

Inline formula q with subscript t denoting the queries vector for token t in standard attention used by the Latte Transformer.

keys

and values

Inline formula v with subscript t denoting the values vector for token t in standard attention used by the Latte Transformer.

we explain standard attention in Figure 1. Mathematically, one can write the above for the new

vector:

and

Equation defining the standard attention weight alpha subscript ts using softmax over the dot product of queries and keys in the Latte derivation.

Diagram illustrating the standard attention mechanism with pairwise interactions between every pair of sequence tokens, the quadratic-cost baseline Latte replaces with linear attention.

Hence, the new representation of

is a combination of all elements, weighted by their similarity with the current token

Inline formula x with subscript t denoting an input token at position t in the Latte attention mechanism for Transformers.

. To make the connection with the matrix form, we observe that

Inline formula q transpose subscript t equals Q subscript t showing the matrix-form equivalence in Latte attention.

and

Inline formula k subscript s equals K transpose subscript s showing the matrix-form equivalence in Latte attention.

. Notice that we can think of

Inline formula alpha subscript ts denoting attention weights between tokens t and s in standard attention, the bottleneck Latte mitigates.

as the probability of occurrence for a token at position

Inline formula lowercase t denoting a token position index in the Latte sequence modeling notation.

. This observation will help us understand our latent approach. Here we defined bidirectional standard attention, but for the causal case, we would sum up to index

instead of the entire sequence length

Inline formula uppercase T denoting the total sequence length in the Latte attention model.

. This ensures that the new representation does not incorporate tokens in the future which are not available at test time.

Bidirectional Latte

As previously stated, the bottleneck of standard attention is computing weights

. We mitigate this by introducing learnable latent tokens that are compared to each element of the sequence. Since the number of latent tokens is fixed, the computation cost becomes linear in the sequence length. Intuitively, we can think of the latent variables as concepts like colors or shapes to which we compare the input. Then our method creates a new representation using all the sequence tokens and their similarity to the learned high-level concepts. In Figure 2, we show the difference between bidirectional Latte and bidirectional standard attention methods.

Diagram comparing bidirectional Latte against bidirectional standard attention, showing how Latte routes information through learnable latent variables instead of pairwise token comparisons.

The approach has similarities with sparse attention methods such as BigBird, which only compute attention between a set of learnable global tokens and all the sequence elements. However, the main difference is that the sparse methods are weighted sums of the global tokens, while in our approach we consider the entire sequence. Specifically, we define a different parametrization of full attention using latent variables, instead of only performing attention between the latents and the sequence elements.

Defining our previous observation that attention has a probabilistic interpretation, we can re-parameterize

with a weighted sum based on

learnable latent variables:

Equation re-parameterizing the attention weight alpha subscript ts as a sum over latent-variable factors in the Latte bidirectional attention formulation.

In the above, we assumed independence between

Inline formula lowercase s denoting a token position index in the Latte attention model.

and

give

Inline formula lowercase l denoting a latent variable index in the Latte latent-attention parameterization.

. Intuitively, we compute the similarity between a high-level concept and each element, then we re-weight it based on the similarity between our current token and the high-level concepts. These concepts are our latent variables, which we learn end-to-end in tasks like classification or language modeling. Hence, they might not necessarily be interpretable. To calculate the probabilities above, we can reuse the attention matrices

Inline formula Q denoting the queries matrix in the Latte attention reformulation with latent variables.

and

Inline formula K denoting the keys matrix in the Latte attention reformulation with latent variables.

, giving us the new vector representation of each token:

Note that

and

have different sizes than the queries and keys in the standard attention. Figure 3 describes in detail how we obtain these matrices.

Our formulation results in

Inline formula big-O of T L D denoting the linear time complexity of the Latte attention mechanism in sequence length and latent count.

time and

Inline formula big-O of T L plus L D denoting the linear memory complexity of the Latte attention mechanism.

memory complexity, compared to the

Inline formula big-O of T squared D denoting the quadratic complexity of standard Transformer attention that Latte improves upon.

of the standard bidirectional approach. We defined

Inline formula uppercase D denoting the vector dimension in the Latte attention mechanism.

to be the vector dimension. The approach is not entirely new. Other works have decomposed attention in the same fashion for the bidirectional case. However, our probabilistic framework easily allows us to extend our model to the causal case.

Causal Latte

In the previous sections, we described the bidirectional case, but for problems like language generation, we need a causal mechanism. The change can be trivially seen by looking at the formula for

and only sum up to index

instead of the entire sequence. This means that we have a cumulative sum and we cannot simply apply the softmax function over

. Instead, we need an approach which updates sequentially the normalisation factor and the weight given by

. Defining

Inline formula Q-prime equals Softmax denoting the softmax-normalized query matrix in Causal Latte.

and

Inline formula K equals exp denoting the exponential of the keys matrix used in the cumulative sum of Causal Latte.

we can write:

Equation defining the causal Latte mechanism as a cumulative sum over the sequence using softmax-normalized queries and exponential keys for linear-time language generation.

The formulation above can be vectorized. However, a sequential implementation has the benefit of constant time complexity for the next token prediction task. Hence, predicting

tokens requires

time and

memory.

Relative embeddings

Relative embeddings generalize better to unseen sequence length when compared to additive positional embeddings. However, in their standard form, they do not make sense to be used for latent tokens. We therefore introduce VAPOR (value embedded positional rotations) which computes the relative distance between tokens, but without affecting the attention weights:

Equation defining VAPOR, the value embedded positional rotations used by Latte for relative positional encoding without altering the attention weights.

Results

Runtime Efficiency

We developed a method with linear time and memory complexity in the sequence length. One drawback is that the causal version needs to be implemented sequentially to decrease memory usage and have constant time inference. If the sequence length is small, this can be slower than a vectorized version of standard attention on GPUs. To see the benefits of Latte, we perform an analysis of runtime performance in Figure 4.

Runtime comparison chart showing memory and time efficiency of bidirectional and causal Latte against standard attention across varying sequence lengths.

From the above, we can see that the bidirectional case is faster than the standard attention even when the sequence length is small. However, the sequential causal model has a better runtime performance than causal attention only for sequences longer than 3,000 tokens. In terms of memory, Latte is more efficient even when the sequence has a smaller length. The results are dependent on the number of latent variables which give a tradeoff between runtime efficiency and the complexity of a model.

Long Range Arena

Long Range Arena is a synthetic benchmark which tests the ability of models to capture long-range dependencies on sequences of 2,000 to 16,000 tokens. All the tasks in the benchmark treat the input as a sequence of tokens and are formulated as classification problems. Consequently, the performance of the model is measured with accuracy, where a higher score means a better model.

We implement the tasks with a bidirectional Latte model using 40 latents and show that we outperform the standard attention. The low number of latents results in a model which is faster than the standard attention, while still having better performance. We also compare Bidirectional Latte to other efficient Transformers and obtain comparable results, with the benefit that our method could easily be applied in both causal and bidirectional cases.

Table comparing bidirectional Latte against standard attention and other efficient Transformers on Long Range Arena tasks using accuracy scores.

Language generation

For language modeling, we train a Causal Latte model on the next token prediction task. The datasets used are Wiki103, OpenWebText, and Enwik8. We tokenize the first two with a byte pair encoding tokenizer, while for the latter we used a character tokenizer. The sequence lengths are 1,024 and 2,048 for the two tokenization types. Two common metrics that we also use to measure the success of this task are perplexity (PPL) and bits-per-character (BPC). PPL is the exponential of the negative log-likelihood, meaning that a lower score indicates a better model. Similarly, BPC is the negative log-likelihood transformed in based two such that it indicates the number of bits used to represent a character. Again, a lower score means a better model.

We set the number of latent variables

Inline formula L equals 128 denoting the number of latent variables in the Causal Latte language modeling experiments.

such that the model is faster than the standard causal attention while obtaining comparable results, as reported in Table 2.

Table comparing Causal Latte with VAPOR against standard attention and Transformer-XL on Wiki103, OpenWebText, and Enwik8 language modeling tasks using perplexity and bits-per-character.

On token-level language modeling tasks, Latte combined with VAPOR obtains scores close to the standard attention. This is shown by experiments on Wiki103 and OpenWebText datasets. We also benchmark against Transformer-XL, a recursive model built for long sequences, and we get better results for a comparable number of parameters. While these results are promising considering the runtime gains, our model has some disadvantages on character-level data sets like Enwik8. For this setting, patterns are more difficult to observe and elementwise interaction between characters might be required to increase performance. Nonetheless, the results show a tradeoff between computational complexity and model capacity.

Final thoughts

Inspired by the fact that language can be decomposed into higher-level concepts, we developed a simple framework for bidirectional and causal cases that acts as a replacement for standard attention. Following the probabilistic interpretation, our model is easy to implement and has a fast and memory-effective runtime while it achieves better or comparable performance on classification and language generation tasks. Another benefit of our approach is that the next token prediction runs in constant time, resulting in a fast model during generation. Latte is a flexible model, which we would also like to apply in multimodal tasks like visual question answering. Check out our code for more details!

Topics:

Agentic orchestration

Marius Cobzarenco

Senior Director, Software Engineering, UiPath