Abstract

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a 256k context length for passkey retrieval.

Layman explanation

Imagine you have a very smart computer program that can read and write text in any language. This program can learn from a lot of text data, such as books, articles, tweets, etc. The more data it learns from, the better it becomes at understanding and generating text.

However, there is a problem. The program can only remember a limited amount of text at a time. For example, it might forget what it read in the first paragraph when it reaches the fifth paragraph. This makes it hard for the program to deal with long texts or texts that have many parts. It would be like trying to solve a puzzle with only a few pieces at a time.

One way to solve this problem is to give the program an extra memory, like a notebook, where it can store some important information from the text. For example, it might write down the names of the characters, the places, the dates, etc. Then, whenever it needs to recall something, it can look at the notebook and find the relevant information.

But there is another problem. The notebook can get very crowded and messy as the program reads more and more texts. Some information might be outdated, irrelevant, or confusing. For example, the program might write down the name “Harry” for different characters from different stories. Then, when it needs to find out who Harry is, it might get confused and pick the wrong one.

To solve this problem, the authors of this paper propose a new technique called Focused Transformer (FoT). This technique helps the program to organize its notebook better and to focus on the most important information. It does this by comparing different pieces of information and finding out which ones are similar and which ones are different. For example, it might compare different Harrys and figure out which one belongs to which story. Then, it can label them differently and avoid confusion.

The authors claim that this technique can improve the performance of the program on tasks that require long texts or texts that have many parts. They also claim that this technique can be applied to existing programs that are already very smart and make them even smarter. They show some examples of how their technique works on tasks such as retrieving passwords from long texts or answering questions from long documents.

Assisted by Bing

EDIT: Here’s the repo if you’re interested: https://github.com/CStanKonrad/long_llama