OpenNLPLab's picture
Update README.md
0226389
metadata
license: apache-2.0
language:
  - en
  - zh
pipeline_tag: text-generation
tags:
  - ' TransNormerLLM'

TransNormerLLM3 -- A Faster and Better LLM

Introduction

This official repository unveils the TransNormerLLM3 model along with its open-source weights for every 50 billion tokens processed during pre-training.

TransNormerLLM evolving from TransNormer, standing out as the first LLM within the linear transformer architecture. Additionally, it distinguishes itself by being the first non-Transformer LLM to exceed both traditional Transformer and other efficient Transformer models (such as, RetNet and Mamba) in terms of speed and performance.

TransNormerLLM3

  • TransNormerLLM3-15B features 14.83 billion parameters. It is structured with 42 layers, includes 40 attention heads, and has a total embedding size of 5120.
  • TransNormerLLM3-15B is purely intergrated with Lightning Attention-2, which can maintain a stable TGS during training of unlimited sequence lengths, up until encountering firm limitations like GPU memory constraints.
  • Titoken tokenizer is used with a total vocabulary size of about 100,000.
  • It incorporates Simple GLU for its channel mixer, GLA in the token mixer, and SRMSNorm for normalization.
  • In terms of position encoding, the first layer employs LRPE with exponential decay, whereas the subsequent layers continue with exponential decay encoding.

Pre-training Logbook

--23.12.25-- startup: WeChat - 预训练启航 <<<>>> Twitter - Pre-training Commences <<<>>> YouTube Recording <<<>>> bilibili 回放
--24.01.02-- first week review: WeChat - 第一周概览 <<<>>> Twitter - First Week Review
--24.01.09-- second week review: WeChat - 第二周概览 <<<>>> Twitter - Second Week Review

Released Weights

param token Hugging Face Model Scope Wisemodel
15B 50B 🤗 🤖 🐯

Benchmark Results

The evaluations of all models are conducted using the official settings and the lm-evaluation-harness framework.

Model P T BoolQ PIQA HS WG ARC-e ARC-c OBQA MMLU C-Eval
TransNormerLLM3-15B 15 0.05 62.08 72.52 55.55 57.14 62.12 31.14 32.40 27.50 26.18
TransNormerLLM3-15B 15 0.10 63.98 74.70 61.09 61.33 65.95 34.64 35.60 25.38 27.40
TransNormerLLM3-15B 15 0.15 60.34 75.08 63.99 62.04 64.56 34.90 35.20 22.64 26.60

P: parameter size (billion). T: tokens (trillion). BoolQ: acc. PIQA: acc. HellaSwag: acc_norm. WinoGrande: acc. ARC-easy: acc. ARC-challenge: acc_norm. OpenBookQA: acc_norm. MMLU: 5-shot acc. C-Eval: 5-shot acc.

Acknowledgments and Citation

Acknowledgments

Our project is developed based on the following open source projects:

Citation

If you wish to cite our work, please use the following reference:

@article{qin2023scaling,
  title={Scaling transnormer to 175 billion parameters},
  author={Qin, Zhen and Li, Dong and Sun, Weigao and Sun, Weixuan and Shen, Xuyang and Han, Xiaodong and Wei, Yunshen and Lv, Baohong and Yuan, Fei and Luo, Xiao and others},
  journal={arXiv preprint arXiv:2307.14995},
  year={2023}
}

@misc{qin2024lightning,
      title={Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models}, 
      author={Zhen Qin and Weigao Sun and Dong Li and Xuyang Shen and Weixuan Sun and Yiran Zhong},
      year={2024},
      eprint={2401.04658},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

- OpenNLPLab @2024 -