princeton-nlp/FullAttention-Llama-2-7b-6k

license: apache-2.0

Paper: Adapting Language Models to Compress Contexts

Code: https://github.com/princeton-nlp/AutoCompressors

Models:

Llama-2-7b fine-tuned models: AutoCompressor-Llama-2-7b-6k, FullAttention-Llama-2-7b-6k
OPT-2.7b fine-tuned models: AutoCompressor-2.7b-6k, AutoCompressor-2.7b-30k, RMT-2.7b-8k
OPT-1.3b fine-tuned models: AutoCompressor-1.3b-30k, RMT-1.3b-30k

FullAttention-Llama-2-7b-6k is a model fine-tuned from meta-llama/Llama-2-7b-hf and used as baseline in Adapting Language Models to Compress Contexts. This model is fine-tuned on 15B tokens from RedPajama dataset. The pre-trained Llama-2 model is fine-tuned on sequences of 6,144 tokens with a RoPE θ value of 80,000.

To get started, load this model as a LlamaForCausalLM model, or download the AutoCompressor repository and load the model as follows:

from auto_compressor_llama import LlamaAutoCompressorModel

model = LlamaAutoCompressorModel.from_pretrained("princeton-nlp/FullAttention-Llama-2-7b-6k")

Evaluation

We record the perplexity achieved by our Llama-2-7B models on segments of 2048 tokens, conditioned on different amounts of context. FullAttention-Llama-2-7b-6k uses full uncompressed contexts whereas AutoCompressor-Llama-2-7b-6k compresses segments of 2048 tokens into 50 summary vectors.

Context Tokens	0	512	2048	4096	6144
Pre-trained Llama-2-7b	5.52	5.15	4.98	-	-
FullAttention-Llama-2-7b-6k	5.40	5.06	4.88	4.80	4.76
AutoCompressor-Llama-2-7b-6k	5.40	5.16	5.11	5.08	5.07

Bibtex

@misc{chevalier2023adapting,
      title={Adapting Language Models to Compress Contexts}, 
      author={Alexis Chevalier and Alexander Wettig and Anirudh Ajith and Danqi Chen},
      year={2023},
      eprint={2305.14788},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}