xiangpeng.wxp commited on
Commit
ae21363
1 Parent(s): 55b5a0d

add readme

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md CHANGED
@@ -1,3 +1,136 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - zh
4
+ - en
5
+ - es
6
+ - fr
7
+ - pt
8
+ - ru
9
+ - de
10
+ - it
11
+ - ar
12
+ - ja
13
+ - ko
14
+ - th
15
+ - vi
16
+ - id
17
+ - nl
18
+ - pl
19
+ - tr
20
+ - he
21
+
22
+ tags:
23
+ - text-generation
24
+
25
  license: apache-2.0
26
  ---
27
+
28
+ # Model Card for PolyLM (a polyglot large language model)
29
+
30
+ ## Table of Contents
31
+
32
+ 1. [Model Details](#model-details)
33
+ 2. [Usage](#usage)
34
+ 3. [Uses](#uses)
35
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
36
+ 5. [Next Steps](#next-steps)
37
+ 6. [Citation](#citation)
38
+
39
+ # Model Details
40
+
41
+ ## Abstract
42
+
43
+ > Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English.
44
+
45
+ ## Model Description
46
+
47
+ - **Model type:** Decoder-only Language model
48
+ - **Language(s) (NLP):** Chinese, English, Spanish, German, French, Portuguese, Russian, Italian, Arabic, Japanese, Korean, Thai, Vietnamese, Indonesian, Polish, Turkish, Dutch, Hebrew
49
+ - **License:** Apache 2.0
50
+ - **Original Checkpoints:** [Modelscope DAMO PolyLM-13B](https://www.modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)
51
+ - **Link to paper:** [here](https://arxiv.org/pdf/2307.06018.pdf)
52
+ - **Total seen tokens:** 640 billion tokens
53
+ - **Version:** Version 1.0 / 12 July 2023
54
+
55
+ # Usage
56
+
57
+ Find below some example scripts on how to use the model in `transformers`:
58
+
59
+ <details>
60
+ <summary> Click to expand </summary>
61
+
62
+ ```python
63
+
64
+ from transformers import AutoModelForCausalLM, AutoTokenizer
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained("DAMO-NLP-MT/polylm-13b", use_fast=False)
67
+
68
+ model = AutoModelForCausalLM.from_pretrained(str(ckpt_path), device_map="auto")
69
+ model.eval()
70
+
71
+ input_doc = "Beijing is the capital of China. Translate this sentence from English to Chinese."
72
+
73
+ inputs = tokenizer(input_doc, return_tensors="pt")
74
+
75
+ generate_ids = model.generate(
76
+ inputs.input_ids,
77
+ attention_mask=inputs.attention_mask,
78
+ do_sample=True,
79
+ max_new_tokens=128,
80
+ top_k=10,
81
+ top_p=0.9,
82
+ temperature=0.7,
83
+ repetition_penalty=1.0,
84
+ num_return_sequences=5
85
+ )
86
+
87
+ decoded = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
88
+
89
+ print(f">>> {decoded}")
90
+
91
+ ```
92
+
93
+ </details>
94
+
95
+ # Uses
96
+
97
+ ## Direct Use and Downstream Use
98
+
99
+ > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models
100
+
101
+ See the [research paper](https://arxiv.org/pdf/2307.06018.pdf) for further details.
102
+
103
+ ## Out-of-Scope Use
104
+
105
+ More information needed.
106
+
107
+ # Bias, Risks, and Limitations
108
+
109
+ The information below in this section are copied from the model's [official model card](https://arxiv.org/pdf/2307.06018.pdf):
110
+
111
+ > Our contributions are fully methodological: adding the support of multilingualism to LLM during training and SFT phases. It is unavoidable that PolyLM might exhibit several common deficiencies of language models, e.g. hallucination and toxicity. PolyLM should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.
112
+
113
+
114
+ # Next Steps
115
+
116
+ We are continuously enhancing the capabilities of PolyLM by focusing on the following aspects:
117
+
118
+ 1. Replacement of absolute position embeddings with RoPE, as outlined in the research paper [here](https://arxiv.org/abs/2104.09864).
119
+ 2. Expansion of window size to more than 10,000.
120
+ 3. Verification of lightweight techniques to quickly enhance multilingual quality, especially for low-resource languages.
121
+
122
+ # Citation
123
+
124
+ **BibTeX:**
125
+
126
+ ```bibtex
127
+ @misc{wei2023polylm,
128
+ title={PolyLM: An Open Source Polyglot Large Language Model},
129
+ author={Xiangpeng Wei and Haoran Wei and Huan Lin and Tianhao Li and Pei Zhang and Xingzhang Ren and Mei Li and Yu Wan and Zhiwei Cao and Binbin Xie and Tianxiang Hu and Shangjie Li and Binyuan Hui and Bowen Yu and Dayiheng Liu and Baosong Yang and Fei Huang and Jun Xie},
130
+ year={2023},
131
+ eprint={2307.06018},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CL}
134
+ }
135
+ ```
136
+