Multi-token prediction models and baselines

Models accompanying the research paper "Better & Faster Large Language Models via Multi-token Prediction" (https://arxiv.org/abs/2404.19737).

Included are the following four 7B parameter models trained on code:

baseline model (n=1) trained on 200B tokens of code: 7B_200B_1/
multi-token prediction model (n=4) trained on 200B tokens of code: 7B_200B_4/
baseline model (n=1) trained on 1T tokens of code: 7B_1T_1/
multi-token prediction model (n=4) trained on 1T tokens of code: 7B_1T_4/

Tokenizer: standard Llama 2 SentencePiece tokenizer in tokenizer.model.

Quickstart

Install torch, fairscale, fire and sentencepiece and run

torchrun --nproc_per_node 1 example_completion.py --ckpt_dir 7B_200B_4/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 2

replacing 7B_200B_4 by the respective checkpoint directory.

Format

The Pytorch state_dicts are compatible with Llama format: the layers of the shared trunk and the next-token prediction head layer are numbered contiguously. Additional prediction heads for tokens further in the future are names extra_heads and can be ignored for standard autoregressive inference.

The implementation of forward() in llama/model.py provides an additional argument return_all_heads. If set, the additional prediction heads are called and the logits are returned in shape (batch_size, seq_len, n_future_tokens, vocab_size). Otherwise, the logit's shape is (batch_size, seq_len, 1, vocab_size).

Citation

Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737.

Bibtex entry:

@article{gloeckle2024better,
  title={Better \& faster large language models via multi-token prediction},
  author={Gloeckle, Fabian and Idrissi, Badr Youbi and Rozi{\`e}re, Baptiste and Lopez-Paz, David and Synnaeve, Gabriel},
  journal={arXiv preprint arXiv:2404.19737},
  year={2024}
}

Feedback and comments

Please report risks as indicated in the Acceptable Use Policy and address bugs and any other comments to the corresponding authors as indicated in the research paper.