TransNormerLLM -- A Faster and Better LLM

Introduction
Released Weights
Benchmark Results
- General Domain
  - Model Results
Inference and Deployment
Fine-tuning the Model
- Dependency Installation
- Training
Community and Ecosystem
Disclaimer, License and Citation

Introduction

We are re-inventing the Large Language Model (LLM). This is the official implementation of TransNormerLLM in link. Our opened weights of TransNormerLLM are now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly.

Our release contains the TransNormerLLM model implementation, the open-source weights and the starting code for Supervised Fine-tuning (SFT). We will show examples on how to load TransNormerLLM models, run SFT and inference on it.

TransNormerLLM is the first linear attention-based LLM that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. It was trained on a high-quality corpus with up to 1.4 trillion tokens.
TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include LRPE positional embedding, Lightning Attention acceleration, new gating and normalization mechanisms.
TransNormerLLM achieved competitive performance of its size on multiple well-approved Chinese, English, and multi-language general and domain-specific benchmarks.
This release includes Base versions with 385M, 1B, and 7B parameters.
All versions are fully open to academic research. Developers only need to apply via email and obtain official commercial permission to use it for free commercially.
For more information, welcome reading our academic paper TransNormerLLM.

Released Weights

The specific released versions and download links are shown as below:

	Base Models
385M	🤗 TransNormerLLM-385M
1B	🤗 TransNormerLLM-1B
7B	🤗 TransNormerLLM-7B

Benchmark Results

To validate TransNormerLLM, we tested our 385M, 1B, and 7B models on Commonsense Reasoning Task, MMLU, CMMLU, and C-Eval. For comparison, we selected several open-source models as competitors, including Transformer-based models such as OPT, Pythia, BLOOM, GPT-Neo, GPT-J, MPT, Falcon, LLaMA1/2, OpenLLAMA v1/v2, Baichuan 1/2, ChatGLM 1/2, and non-Transformer model RWKV. It can be observed that, compared to these models, TransNormerLLM remains highly competitive.

Commonsense Reasoning We report BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA and their average. We report 0-shot results for all benchmarks using LM-Eval-Harness. All of our models achieve competitive performance compared to existing state-of-the-art LLMs, showcasing a remarkable ability to comprehend and apply commonsense reasoning.

Aggregated Benchmarks We report the overall results for MMLU, CMMLU, C-Eval. Official scripts were used for evaluating MMLU, CMMLU, and C-Eval, with all evaluation results being conducted with a 5-shot setup. In comparison to top-tier open-source models available in the industry, our models have demonstrated matched performance in both English and Chinese benchmarks.

General Domain

In the general domain, we conducted 5-shot tests on the following datasets:

C-Eval is a comprehensive Chinese basic model evaluation dataset, covering 52 disciplines and four levels of difficulty. Our evaluation approach followed that of LM-Evaluation-Harness.
MMLU is an English evaluation dataset comprising 57 tasks, encompassing elementary math, American history, computer science, law, etc. The difficulty ranges from high school level to expert level. It's a mainstream LLM evaluation dataset. We used its official evaluation approach.
CMMLU is a comprehensive Chinese evaluation benchmark covering 67 topics, specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context. We adopted its official evaluation approach.

Model Results

Performance Comparison on Commonsense Reasoning and Aggregated Benchmarks. For a fair comparison, we report competing methods' results reproduced by us using their released models. PS: parameter size (billion). T: tokens (trillion). HS: HellaSwag. WG: WinoGrande.

Model	PS	T	BoolQ	PIQA	HS	WG	ARC-e	ARC-c	OBQA	MMLU	CMMLU	C-Eval
GPT-J	6.9	0.3	65.44	75.41	66.25	64.09	66.92	36.60	38.20	25.40	26.47	23.39
OPT	6.7	0.3	66.18	76.22	67.21	65.19	65.66	34.64	37.20	24.57	25.36	25.32
Pythia	6.9	0.3	63.46	75.14	63.92	60.77	67.34	35.41	37.00	24.64	25.56	26.40
BLOOM	7.1	0.35	62.91	72.69	62.33	64.01	65.11	33.45	35.80	26.25	24.97	24.25
RWKV	7.4	-	-	76.06	65.51	61.01	67.80	37.46	40.20	24.96	-	-
MPT	6.9	1.0	73.88	79.43	76.25	68.27	74.79	41.72	42.20	30.80	25.99	24.06
Falcon	7.2	1.5	73.73	79.38	76.3	67.17	74.62	43.60	43.80	27.79	25.73	22.92
Baichuan1	7.0	1.2	70.09	76.01	70.06	64.09	71.72	40.53	38.20	42.30	44.43	42.80
Baichuan2	7.0	2.6	72.72	76.50	72.17	68.35	75.17	42.32	39.60	54.16	57.07	54.00
ChatGLM1	6.7	1.0	74.74	68.88	45.57	52.25	48.78	31.66	36.80	40.63	37.48	40.23
ChatGLM2	7.1	1.4	77.65	69.37	50.51	57.62	59.13	34.30	37.00	45.46	48.80	52.55
OpenLLaMAv1	6.7	1.0	70.43	75.68	69.23	66.69	71.17	38.57	39.00	30.49	25.40	26.09
OpenLLaMAv2	6.7	1.0	72.20	78.84	74.51	65.67	72.39	41.30	41.00	41.29	29.58	30.01
LLaMA1	6.7	1.0	76.50	79.80	76.10	70.10	72.80	47.60	57.20	35.10	25.62	25.72
LLaMA2	6.7	2.0	77.68	78.07	76.02	68.98	76.30	46.33	44.20	45.30	32.96	33.20
Ours	6.8	1.4	75.11	85.47	78.61	66.93	73.11	52.99	61.60	44.90	49.32	45.01

Inference and Deployment

The model weights, source code, and configuration needed for inference have been released on Hugging Face. Download links can be found in the table at the beginning of this document. Below, we demonstrate various inference methods using TransNormerLLM-7B-Chat as an example. The program will automatically download the required resources from Hugging Face.

Dependency Installation

📝Note Please configure the following environment before using the model:

pip install triton==2.0.0
pip install einops

Notice

If you encounter errors related to Triton, please set the following environment variables:

export use_triton=False

Python Code Inference

Demonstration of Base Model Inference

📝Note Kindly utilize the model employing bfloat16 instead of float16.

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM-7B", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("OpenNLPLab/TransNormerLLM-7B", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
>>> inputs = tokenizer('今天是美好的一天', return_tensors='pt')
>>> pred = model.generate(**inputs, max_new_tokens=4096, repetition_penalty=1.0)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

In the above code snippets, the model loading specifies device_map='auto', which will use all available GPUs. If you need to specify the device(s) to use, you can control it in a way similar to export CUDA_VISIBLE_DEVICES=0,1 (using the 0 and 1 graphics cards).

Fine-tuning the Model

Dependency Installation

git clone https://github.com/OpenNLPLab/TransNormerLLM.git
cd TransNormerLLM/fine-tune
pip install -r requirements.txt

To use lightweight fine-tuning methods like LoRA, you must additionally install peft.

Training

Below, we provide an example of fine-tuning the TransNormerLLM-1B on a single machine with ZeRO-3.

Training Data: alpaca_data.json. This sample data was drawn from alpaca_data.json, consisting of a selection of 52,002 entries, and has been reformatted. The main purpose is to demonstrate how to SFT our model, and effectiveness is not guaranteed.

torchrun \
    --nproc_per_node=8 \
    train.py \
    --model_name_or_path OpenNLPLab/TransNormerLLM-1B \
    --data_path ./alpaca_data.json \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --bf16 true \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 30 \
    --learning_rate 1e-4 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "cosine" \
    --deepspeed 'configs/zero3.json' \
    --logging_steps 1 \
    --dataloader_num_workers 24 \
    --ddp_find_unused_parameters false \
    --tf32 true \

Community and Ecosystem

📢📢📢 We will continuously update the support for TransNormerLLM from the community and ecosystem here 😀😀😀

nanoTransnormer

Disclaimer, License and Citation

Disclaimer

We hereby declare that our team has not developed any applications based on TransNormerLLM models, not on iOS, Android, the web, or any other platform. We strongly call on all users not to use TransNormerLLM models for any activities that harm national / social security or violate the law. Also, we ask users not to use TransNormerLLM models for Internet services that have not undergone appropriate security reviews and filings. We hope that all users can abide by this principle and ensure that the development of technology proceeds in a regulated and legal environment.

We have done our best to ensure the compliance of the data used in the model training process. However, despite our considerable efforts, there may still be some unforeseeable issues due to the complexity of the model and data. Therefore, if any problems arise due to the use of TransNormerLLM open-source models, including but not limited to data security issues, public opinion risks, or any risks and problems brought about by the model being misled, abused, spread or improperly exploited, we will not assume any responsibility.

License

The community usage of TransNormerLLM model requires adherence to Apache 2.0 and Community License for TransNormerLLM Model. The TransNormerLLM model supports commercial use. If you plan to use the TransNormerLLM model or its derivatives for commercial purposes, please ensure that your entity meets the following conditions:

The Daily Active Users (DAU) of your or your affiliate's service or product is less than 1 million.
Neither you nor your affiliates are software service providers or cloud service providers.
There is no possibility for you or your affiliates to grant the commercial license given to you, to reauthorize it to other third parties without TransNormerLLM's permission.

Upon meeting the above conditions, you need to submit the application materials required by the TransNormerLLM Model Community License Agreement via the following contact email: opennlplab@gmail.com. Once approved, TransNormerLLM will hereby grant you a non-exclusive, global, non-transferable, non-sublicensable, revocable commercial copyright license.

Acknowledgments

Our project is developed based on the following open source projects:

Baichuan for the tokenizer.
metaseq for training.
lm-evaluation-harness for evaluation.

Citation

If you wish to cite our work, please use the following reference:

@article{qin2023scaling,
  title={Scaling transnormer to 175 billion parameters},
  author={Qin, Zhen and Li, Dong and Sun, Weigao and Sun, Weixuan and Shen, Xuyang and Han, Xiaodong and Wei, Yunshen and Lv, Baohong and Yuan, Fei and Luo, Xiao and others},
  journal={arXiv preprint arXiv:2307.14995},
  year={2023}
}

OpenNLPLab
/

TransNormerLLM-7B

TransNormerLLM -- A Faster and Better LLM

Table of Contents

Introduction

Released Weights

Benchmark Results

General Domain

Model Results

Inference and Deployment

Dependency Installation

Notice

Python Code Inference

Demonstration of Base Model Inference

Fine-tuning the Model

Dependency Installation

Training

Community and Ecosystem

Disclaimer, License and Citation

Disclaimer

License

Acknowledgments

Citation

Model tree for OpenNLPLab/TransNormerLLM-7B