# Introducing Qwen-7B: Open foundation and human-aligned models (of the state-of-the-arts) Large language models have recently attracted an extremely large amount of attention. The boom of [ChatGPT](https://openai.com/blog/chatgpt) rocketed the development of artificial general intelligence and indicates that large language models compress world knowledge into neural networks, and the alignment to human cognition can lead to powerful conversational agents that can provide assistance by interacting with human users. Now, the latest version of ChatGPT based on [GPT-4](https://arxiv.org/abs/2303.08774) demonstrates tremendously exciting performance across unlimited capabilities, say, language understanding, logical reasoning, planning, etc., and its incorporation with external tools, including tools and models, releases the power of an agent capable of understanding instructions, executing code, using tools, and so on, to reach the objectives set up by human users. These significant progresses indicate the importance of large language models as _the foundation of AI services_. We are happy to release the 7B-parameter models of our large pretrained model series Qwen (abbr. Tongyi Qianwen), Qwen-7B. This release includes model weights and codes for pretrained and human-aligned language models of 7B parameters: - `Qwen-7B` is the pretrained language model, and `Qwen-7B-Chat` is fine-tuned to align with human intent. - `Qwen-7B` is pretrained on over 2.2 trillion tokens with a context length of 2048. On the series of benchmarks we tested, Qwen-7B generally performs better than existing open models of similar scales and appears to be on par with some of the larger models. - `Qwen-7B-Chat` is fine-tuned on curated data, including not only task-oriented data but also specific security- and service-oriented data, which seems insufficient in existing open models. - Example codes for fine-tuning, evaluation, and inference are included. There are also guides on long-context and tool use in inference. **Goal of release**: We believe that while the recent waves of releases of LLMs may have deepened our understanding of model behaviors under standard regimes, it is yet to be revealed how the accompanied techniques of nowadays LLMs, such as 1) quantization and fine-tuning after quantization, 2) training-free long-context inference, and 3) fine-tuning with service-oriented data, including search and tool uses, affect the models as a whole. The open release of Qwen-7B marks our first step towards fully understanding the real-world application of such techniques. It is our hope that it will enable the community to analyze and continue to improve the safety of those models, striving to establish responsible development and deployment of LLMs. > **Disclaimer**: > We must note that even though the weights and codes are released in an open manner and commercial use is not prohibited, similar to other pretrained language models, Qwen-7B comes with potential risks influenced by complex factors, including but not limited to over-diversified, inaccurate, or misleading generation. > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations. > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights or codes. The remainder of this document describes our pretraining and fine-tuning methodology. ## Pretraining Qwen-7B is a transformer-based decoder-only language model with an architecture similar to the [LLaMA](https://github.com/facebookresearch/llama) series of models. It is pretrained on over 2.2 trillion tokens with 2048 context length from publicly available data, covering general and professional fields with a focus on the English and Chinese languages. ### Data **Pretraining data**: Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files. Besides, the data are multilingual, with most of them in English and Chinese. We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content. For math reasoning, we include RFT data from [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel). The final data underwent global fuzzy deduplication. The mix of pretraining corpora has been optimized through numerous ablation experiments. **Tokenization**: Compared to the current mainstream open models based on Chinese and English vocabularies, we use a vocabulary of 151,851 tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digits and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization. After tokenization, the data amounts to over 2.2 trillion tokens. ### Model **Model architecture**: Qwen-7B is built with architecture similar to LLaMA. The following are the main differences from the standard transformer: 1) using untied embedding, 2) using rotary positional embedding, 3) no biases except for QKV in attention, 4) RMSNorm instead of LayerNorm, 5) SwiGLU instead of ReLU, and 6) adopting flash attention to accelerate training. The model has 32 layers, the embedding dimension is 4096, and the number of attention heads is 32. **Training details**: The model is trained using the AdamW optimizer, with $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-6}$. The sequence length is 2048, and the batch size is 2048, which means each optimization step accumulates over 4 million tokens. We use a cosine learning rate schedule, with a warm-up of 2000 steps, a peak learning rate of $3 \times 10^{-4}$, and a minimum learning rate of 10% of the peak learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. The training adopts mixed precision training with `bfloat16`. ### Evaluation We report results of Qwen-7B on standard benchmarks. #### World knowledge [C-Eval](https://arxiv.org/abs/2305.08322) is a common evaluation benchmark for testing the common-sense capability of pretrained models in Chinese. It covers 52 subjects in four major directions: humanities, social sciences, STEM, and other specialties. According to standard practice, we use the development set samples as the source of few-shot prompts to evaluate the 5-shot validation set and test set accuracy of the Qwen-7B pretrained model. The accuracy comparison of the Qwen-7B model and other models on the C-Eval validation set is as follows: | Model | Average | | :---------- | -------: | | Alpaca-7B | 28.9 | | Vicuna-7B | 31.2 | | ChatGLM-6B | 37.1 | | Baichuan-7B | 42.7 | | ChatGLM2-6B | 50.9 | | InternLM-7B | 53.4 | | ChatGPT | 53.5 | | Claude-v1.3 | 55.5 | | **Qwen-7B** | **60.8** | The performance comparison of the Qwen-7B pretrained model and other models on the C-Eval test set is shown in the following table: | Model | Avg. | Avg. (Hard) | STEM | Social Sciences | Humanities | Others | | :---------------------- | -------- | ----------: | ---: | --------------: | ---------: | -----: | | ChatGLM-6B | 38.9 | 29.2 | 33.3 | 48.3 | 41.3 | 38.0 | | Chinese-Alpaca-Plus-13B | 41.5 | 30.5 | 36.6 | 49.7 | 43.1 | 41.2 | | Baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 | | WestlakeLM-19B | 44.6 | 34.9 | 41.6 | 51.0 | 44.3 | 44.5 | | AndesLM-13B | 46.0 | 29.7 | 38.1 | 61.0 | 51.0 | 41.9 | | BatGPT-15B-sirius | 47.0 | 31.9 | 42.7 | 57.5 | 48.6 | 43.6 | | ChatGLM2-6B | 51.7 | 37.1 | 48.6 | 60.5 | 51.3 | 49.8 | | InternLM-7B | 52.8 | 37.1 | 48.0 | 67.4 | 55.4 | 45.8 | | Baichuan-13B | 53.6 | 36.7 | 47.0 | 66.8 | 57.3 | 49.8 | | Claude-v1.3 | 54.2 | 39.0 | 51.9 | 61.7 | 52.1 | 53.7 | | ChatGPT | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 | | **Qwen-7B** | **59.6** | 41.0 | 52.8 | 74.1 | 63.1 | 55.2 | As can be seen, Qwen-7B achieves the best performance out of all existing models of similar scale and even surpasses larger-scale models. MMLU is currently one of the most recognized benchmarks for evaluating English comprehension abilities, covering 57 subtasks across different academic fields and difficulty levels. The MMLU 5-shot accuracy performance of the Qwen-7B is shown in the following table: | Model | Average | STEM | Social Sciences | Humanities | Others | | :----------- | -------: | ---: | --------------: | ---------: | -----: | | LLaMA-7B | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 | | Baichuan-7B | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 | | LLaMA2-7B | 45.3 | 36.4 | 51.2 | 42.9 | 52.2 | | LLaMA-13B | 46.9 | 35.8 | 53.8 | 45.0 | 53.3 | | ChatGLM2-6B | 47.9 | 41.2 | 54.4 | 43.7 | 54.5 | | InternLM-7B | 51.0 | - | - | - | - | | Baichuan-13B | 51.6 | 41.6 | 60.9 | 47.4 | 58.5 | | LLaMA2-13B | 54.8 | 44.1 | 62.6 | 52.8 | 61.1 | | ChatGLM2-12B | 56.2 | 48.2 | 65.1 | 52.6 | 60.9 | | **Qwen-7B** | **56.7** | 47.6 | 65.9 | 51.5 | 64.7 | In terms of English, Qwen-7B also surpasses other similar open pretrained models, and is competitive when compared to larger versions of other models. #### Coding We compared the code capabilities of pretrained models on [HumanEval](https://github.com/openai/human-eval), and the results are as follows: | Model | Pass@1 | | :----------- | -------: | | Baichuan-7B | 9.2 | | ChatGLM2-6B | 9.2 | | InternLM-7B | 10.4 | | LLaMA-7B | 10.5 | | LLaMA2-7B | 12.8 | | Baichuan-13B | 12.8 | | LLaMA-13B | 15.8 | | MPT-7B | 18.3 | | LLaMA2-13B | 18.3 | | **Qwen-7B** | **24.4** | #### Math We compared the math capabilities of pretrained models on [GSM8K](https://github.com/openai/grade-school-math) (8-shot), and the results are as follows: | Model | Accuracy | | :----------- | -------: | | MPT-7B | 6.8 | | Falcon-7B | 6.8 | | Baichuan-7B | 9.7 | | LLaMA-7B | 11.0 | | LLaMA2-7B | 14.6 | | LLaMA-13B | 17.8 | | Baichuan-13B | 26.6 | | LLaMA2-13B | 28.7 | | InternLM-7B | 31.2 | | ChatGLM2-6B | 32.4 | | ChatGLM2-12B | 40.9 | | **Qwen-7B** | **51.6** | #### Natural language processing We compared the translation capabilities of pre-trained models on WMT22 zh-en and en-zh (5-shot BLEU), and the results are as follows: | Model | Average | zh-en | en-zh | | :---------- | -------: | -------: | -------: | | InternLM-7B | 11.8 | 9.0 | 14.5 | | LLaMA-7B | 12.7 | 16.7 | 8.7 | | LLaMA-13B | 15.8 | 19.5 | 12.0 | | LLaMA2-7B | 19.9 | 21.9 | 17.9 | | Bloom-7B | 20.3 | 19.1 | 21.4 | | LLaMA2-13B | 23.3 | 22.4 | 24.2 | | PolyLM-13B | 23.6 | 20.2 | 27.0 | | Baichuan-7B | 24.6 | 22.6 | 26.6 | | **Qwen-7B** | **27.5** | **24.3** | **30.6** | #### Long-context inference We include support for training-free long-context inference based on ntk-aware interpolation, LogN attention scaling, and local window attention. The context can be expanded from 2048 to over 8192. The following are the test results on arXiv in terms of perplexity (PPL).
Model | Sequence Length | ||||
---|---|---|---|---|---|
1024 | 2048 | 4096 | 8192 | 16384 | |
Qwen-7B | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
+ dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
+ dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
+ dynamic_ntk + logn + local_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |