sedrickkeh commited on
Commit
fb5d3c9
1 Parent(s): 51801c5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -0
README.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - tiiuae/falcon-refinedweb
5
+ pipeline_tag: text-generation
6
+ library_name: openlm
7
+ tags:
8
+ - linear
9
+ - mistral
10
+ language:
11
+ - en
12
+ model-index:
13
+ - name: mamba-7b
14
+ results:
15
+ - task:
16
+ type: text-generation
17
+ dataset:
18
+ type: MMLU
19
+ name: MMLU
20
+ metrics:
21
+ - name: accuracy
22
+ type: accuracy
23
+ value: 34.2
24
+ verified: false
25
+ - task:
26
+ type: text-generation
27
+ dataset:
28
+ type: HellaSwag
29
+ name: HellaSwag
30
+ metrics:
31
+ - name: accuracy
32
+ type: accuracy
33
+ value: 77.1
34
+ verified: false
35
+ - task:
36
+ type: text-generation
37
+ dataset:
38
+ type: PIQA
39
+ name: PIQA
40
+ metrics:
41
+ - name: accuracy
42
+ type: accuracy
43
+ value: 80.4
44
+ verified: false
45
+ - task:
46
+ type: text-generation
47
+ dataset:
48
+ type: Winogrande
49
+ name: Winogrande
50
+ metrics:
51
+ - name: accuracy
52
+ type: accuracy
53
+ value: 70.3
54
+ verified: false
55
+ - task:
56
+ type: text-generation
57
+ dataset:
58
+ type: ai2_arc
59
+ name: ARC-E
60
+ metrics:
61
+ - name: accuracy
62
+ type: accuracy
63
+ value: 75.9
64
+ verified: false
65
+ - task:
66
+ type: text-generation
67
+ dataset:
68
+ type: ai2_arc
69
+ name: ARC-C
70
+ metrics:
71
+ - name: accuracy
72
+ type: accuracy
73
+ value: 45.8
74
+ verified: false
75
+ ---
76
+
77
+ # Mistral-SUPRA
78
+ This model was initialized from the weights of the [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) transformer model and uprained to become a linear RNN.
79
+
80
+ This is an accompanying model of our paper [Linearizing Large Language Models](), where we detail our process of converting a softmax transformer into a linear transformer, which at inference time can function as both a transformer and a recurrent model.
81
+
82
+ We uptrain Mistral-7B on 100B tokens of RefinedWeb.
83
+
84
+
85
+ ## Model Details
86
+ - **Developed by**: [Toyota Research Institute](https://www.tri.global/our-work/robotics)
87
+ - **Model Type**: This is an auto-regressive language model initialized from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and uptrained into a linear model based on the [SUPRA](https://arxiv.org/abs/2312.00752) architecture.
88
+ - **Dataset**: Initialized from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1). Uprained on 100B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
89
+ - **Tokenizer**: `mistralai/Mistral-7B-v0.1`
90
+ - **Library**: [OpenLM](https://github.com/mlfoundations/open_lm/)
91
+ - **License**: This model is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
92
+
93
+ | Parameters | Hidden Size | Layers | Vocab Size | Sequence Length |
94
+ |------------|-------------|--------| ---------- | --------------- |
95
+ | 7B | 4096 | 32 | 32000 | 2048 |
96
+
97
+ ## Training Details
98
+ - Mistral-SUPRA was trained using AWS SageMaker on 128 H100 80GB GPUs.
99
+ - Training on 100B tokens finished in 1.5 days.
100
+ | **Hyperparameter** | **Value** |
101
+ |--------------------|------------|
102
+ | Precision | `bfloat16` |
103
+ | Optimizer | AdamW |
104
+ | Learning rate | 3e-5 |
105
+ | LR cooldown end | 1e-5 |
106
+ | Warmup steps | 1000 |
107
+ | Batch size | 2M |
108
+ | QK norm | False |
109
+
110
+
111
+ ## Usage
112
+ This model was trained using [OpenLM](https://github.com/mlfoundations/open_lm/). The weights have been converted to be compatible with HuggingFace.
113
+
114
+ To use the model, you need to first install OpenLM.
115
+ ```bash
116
+ pip install git+https://github.com/tri-ml/linear_open_lm.git
117
+ ```
118
+
119
+ Import the OpenLM classes with
120
+
121
+ ```python
122
+ from open_lm.open_lm_hf import *
123
+ ```
124
+
125
+ The model can then be loaded normally using `AutoTokenizer` and `AutoModelForCausalLM` as follows:
126
+
127
+ ```python
128
+ from open_lm.open_lm_hf import *
129
+ from transformers import AutoTokenizer, AutoModelForCausalLM
130
+ tokenizer = AutoTokenizer.from_pretrained("tri-ml/mistral-supra")
131
+ model = AutoModelForCausalLM.from_pretrained("tri-ml/mistral-supra")
132
+
133
+ inputs = tokenizer(["Machine learning is"], return_tensors="pt")
134
+ gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
135
+ output = model.generate(inputs['input_ids'], **gen_kwargs)
136
+ output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
137
+ print(output)
138
+ # Machine learning is a branch of artificial intelligence (AI) that enables computers to learn from experience without being explicitly programmed. Machine learning is used in a wide range of applications, including spam filtering, image recognition, speech recognition, and computer-based medical diagnosis
139
+ ```
140
+
141
+ The Mistral-SUPRA model can be used both in parallel mode and in recurrent mode. If `use_cache` is set to `False` for `model.generate(...)`, then it will use recurrent mode; otherwise, it will use parallel mode.
142
+ The recurrent model uses `xformers` and requires the inputs and models to be loaded to GPU.
143
+
144
+ ```python
145
+ # Recurrent mode
146
+ output = model.to('cuda').generate(inputs['input_ids'].to('cuda'), use_cache=False, **gen_kwargs)
147
+ ```
148
+
149
+
150
+ ## Performance Evaluation
151
+ Our evaluations were done using the [Eleuther LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) repo.
152
+
153
+ Below we report the performance of Mistral-SUPRA compared to other similarly sized models.
154
+
155
+ <div class="evalTable">
156
+
157
+ | | HellaSwag | PIQA | Winogrande | ARC-E | ARC-C | MMLU (5-shot) |
158
+ | ----------------- | ------------- | -------- | -------------- | --------- | --------- | ---------------- |
159
+ | Llama2-7B | 76.0 | 79.1 | 69.1 | 76.3 | 46.3 | 45.9 |
160
+ | Gemma-7B | 80.7 | 81.9 | 73.7 | 81.1 | 53.2 | 62.9 |
161
+ | Mistral-7B | 81.0 | 82.1 | 74.0 | 80.9 | 53.8 | 62.4 |
162
+ | RWKV5-1.7T-7B | 73.0 | 78.6 | 72.9 | 75.8 | 45.6 | 34.9 |
163
+ | Mamba-7B | 77.9 | 81.0 | 71.8 | 77.5 | 46.7 | 33.3 |
164
+ | **Mistral-SUPRA** | 77.1 | 80.4 | 70.3 | 75.9 | 45.8 | 34.2 |
165
+
166
+ </div>
167
+
168
+
169
+ ## How to Cite
170
+ If you use this model, please cite our paper on Linearizing Large Language Models.
171
+ ```
172
+ @article{Mercat2024Linearizing,
173
+ title={Linearizing Large Language Models},
174
+ author={Jean Mercat and Igor Vasiljevic and Sedrick Keh and Kushal Arora and Achal Dave and Adrien Gaidon and Thomas Kollar},
175
+ journal={ArXiv},
176
+ year={2024},
177
+ volume={},
178
+ }
179
+ ```
180
+
181
+ ## Citations
182
+ OpenLM
183
+ ```
184
+ @misc{open_lm,
185
+ author = {Gururangan, Suchin and Wortsman, Mitchell and Gadre, Samir Yitzhak and Dave, Achal and Kilian, Maciej and Shi, Weijia and Mercat, Jean and Smyrnis, Georgios and Ilharco, Gabriel and Jordan, Matt and Heckel, Reinhard and Dimakis, Alex and Farhadi, Ali and Shankar, Vaishaal and Schmidt, Ludwig},
186
+ title = {{open_lm}: a minimal but performative language modeling (LM) repository},
187
+ year = {2023},
188
+ note = {GitHub repository},
189
+ url = {https://github.com/mlfoundations/open_lm/}
190
+ }
191
+ ```