Abstract
Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using 30times fewer tokens during inference. LLoCO achieves up to 7.62times speed-up and substantially reduces the cost of long document question answering, making it a promising solution for efficient long context processing. Our code is publicly available at https://github.com/jeffreysijuntan/lloco.
Community
So to do QA on a book:
- Summarise/Compress the book using a separate LLM
- Store it in a vector database
- Generate the answers to all the questions that you want to ask
- Finetune it
- Voila. You can now ask it questions...
It's a bit cumbersome, and for the use-case proscribed, it defeats it's own purpose (You have have to generate the QA pairs! In the real world, these don't exist yet, hence the reason for doing the QA in the first place)
I'm sure what you've built works great in certain circumstances (books like the Bible) but for real world on the fly use cases (newly released books, legal texts, confidential data etc) this is cracking a nut with a sledgehammer, only to find you already had a pocketful of cracked nuts.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Long-Context Language Modeling with Parallel Context Encoding (2024)
- Grounding Language Model with Chunking-Free In-Context Retrieval (2024)
- Training-Free Long-Context Scaling of Large Language Models (2024)
- Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts (2024)
- BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper