File size: 15,309 Bytes
46cab60 aca8cfd bede6e8 aca8cfd 46cab60 aca8cfd 8eceaae aca8cfd c8f621c aca8cfd c8f621c aca8cfd 8eceaae aca8cfd 8eceaae aca8cfd b3a294d aca8cfd 69e469f aca8cfd 69e469f aca8cfd 72e9a69 aca8cfd d0a30cf aca8cfd 69e469f aca8cfd 69e469f d0a30cf 69e469f d0a30cf 69e469f aca8cfd 69e469f aca8cfd 69e469f aca8cfd 69e469f aca8cfd 44b3209 69e469f aca8cfd 69e469f aca8cfd 69e469f aca8cfd 69e469f aca8cfd 36f2b77 aca8cfd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 |
---
language:
- 'no'
- nb
- nn
inference: true
tags:
- mistral
- gpt
- generative
license: apache-2.0
pipeline_tag: text-generation
datasets:
- uonlp/CulturaX
- NbAiLab/NCC
- vikp/starcoder_filtered
---
# **NorMistral-7b-scratch**
<img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
NorMistral-7b-scratch is a large Norwegian language model pretrained from scratch on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
This model is a part of the NORA.LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
All the models are pre-trained on the same dataset and with the same tokenizer.
NorMistral-7b-scratch has over 7 billion parameters and is based on [the Mistral architecture](https://huggingface.co/mistralai/Mistral-7B-v0.1).
The NORA.LLM language model family includes (as of now):
- [**NorMistral-7b-warm**](https://huggingface.co/norallm/normistral-7b-warm) -- an LLM initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and continuously pretrained on Norwegian data;
- [**NorMistral-7b-scratch**](https://huggingface.co/norallm/normistral-7b-scratch) -- a Mistral-based LLM pretrained from scratch on Norwegian data;
- [**NorBLOOM-7b-scratch**](https://huggingface.co/norallm/NorBLOOM-7b-scratch) -- a BLOOM-based LLM pretrained from scratch on Norwegian data.
*Disclaimer: This model is pretrained on raw (mostly web-based) textual data.
It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts.
It is primarily intended for research purposes.*
_____
## Pretraining corpus
The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
_____
## Model details
**Model Developers:** Language Technology Group at the University of Oslo.
**Variations:** NorMistral is currently published as two 7B variants: one trained entirely from *scratch* and one *warm*-started from the Mistral model.
**Input:** Textual input.
**Output:** Generated text.
**Model Architecture:** NorMistral is an auto-regressive language model that uses an optimized transformer architecture based on the Mistral/Llama language models.
||Training Data|Params|Context Length|Tokens|LR|
|---|---|---|---|---|---|
|NorMistral-7b-warm|NCC+HPLT+CulturaX+Starcoder|7B|2k|260B|1.0 x 10<sup>-4</sup>|
|NorMistral-7b-scratch|NCC+HPLT+CulturaX+Starcoder|7B|2k|260B|3.0 x 10<sup>-4</sup>|
|NorBLOOM-7b-scratch|NCC+HPLT+CulturaX+Starcoder|7B|2k|260B|1.2 x 10<sup>-4</sup>|
**Tokenizer:** Byte-based BPE tokenizer trained on the same Norwegian corpus as this model. The vocabulary size is 32,768 tokens.
**Training FLOPs** The approximate amount is 1.22e+22 FLOPs; calculated as in [Chowdhery et al. (2022)](https://arxiv.org/abs/2204.02311).
**Model Dates:** The models were pretrained between December 2023 and January 2024.
**Status:** These are only pretrained language models; instruction-finetuned models will follow soon.
**License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
**Research Paper:** Forthcoming
_____
## Initial evaluation
*Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended.
The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.*
The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
### Sentiment analysis
[NoReC](https://huggingface.co/datasets/ltg/norec_sentence) ([Øvrelid et al., 2020](https://aclanthology.org/2020.lrec-1.618/)) is a dataset for sentence-level sentiment analysis derived from the Norwegian Review Corpus [(Velldal et al., 2018)](https://aclanthology.org/L18-1661/).
We use the binary formulation of this task (positive vs. negative).
<details>
<summary>Method</summary>
* Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
* Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ".
* Few-shot results show the average scores across 5 repetitions
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
* Performance metric: macro-averaged F1-score.
</details>
<details open>
<summary>Macro-averaged F1-scores on the sentence-level sentiment analysis task (NoReC)</summary>
|Model|0-shot (macro F1)|1-shot (macro F1)|16-shot (macro F1)|
|---|---|---|---|
|NorMistral-7b-warm|60.6|**77.8**|**87.3**|
|NorMistral-7b-scratch|47.3|62.2|80.1|
|NorBLOOM-7b|**75.7**|73.8|65.5|
|NB-GPT-J|48.4|56.5|65.2|
|GPT-Sw3-6.7B|61.5|72.2|76.5|
|GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
|Falcon-7B|53.3|61.6|74.9|
|Mistral-7B-v0.1|70.2|72.9|84.8|
</details>
### Reading comprehension
[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
<details>
<summary>Method</summary>
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"``` Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
* Few-shot results show the average scores across 5 repetitions
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
* Performance metrics: macro-averaged F1-score and exact match (EM).
</details>
<details open>
<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
|---|---|---|---|
|NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
|Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
</details>
### Machine translation
[Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
<details>
<summary>Method</summary>
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
* Few-shot results show the average scores across 5 repetitions
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
* Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
</details>
<details open>
<summary>English → Norwegian Bokmål</summary>
|Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
|---|---|---|---|
|NorMistral-7b-warm|**55.8**/**70.7**|**56.7**/**71.5**|57.7/72.4|
|NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
|NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
|NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
|GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
|GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
|Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
</details>
<details open>
<summary>English → Norwegian Nynorsk</summary>
|Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
|---|---|---|---|
|NorMistral-7b-warm|**43.6**/**62.0**|**44.2**/**63.2**|44.3/**63.7**|
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
|NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
|GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
|GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
|Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
</details>
<details open>
<summary>Norwegian Bokmål → English</summary>
|Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
|---|---|---|---|
|NorMistral-7b-warm|**56.7**/**70.6**|**57.7**/**71.7**|**58.5**/**72.2**|
|NorMistral-7b-scratch|48.1/62.9|51.5/66.6|52.6/67.6|
|NorBLOOM-7b|46.0/61.5|51.3/66.7|51.7/66.9|
|NB-GPT-J|23.9/55.3|32.3/63.1|48.5/68.7|
|GPT-Sw3-6.7B|47.9/67.8|52.4/70.6|50.0/70.7|
|GPT-Sw3-6.7B-v2|38.8/59.6|49.0/68.6|50.7/70.6|
|Falcon-7B|42.4/58.5|47.3/62.3|48.6/63.3|
|Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
</details>
<details open>
<summary>Norwegian Nynorsk → English</summary>
|Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
|---|---|---|---|
|NorMistral-7b-warm|**55.1**/**68.4**|**55.5**/**69.5**|56.0/69.8|
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
|Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
</details>
<details open>
<summary>Norwegian Bokmål → Norwegian Nynorsk</summary>
|Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
|---|---|---|---|
|NorMistral-7b-warm|**75.8**/**87.5**|74.0/**86.9**|75.3/87.5|
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
|NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
|GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
|GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
|Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
</details>
<details open>
<summary>Norwegian Nynorsk → Norwegian Bokmål</summary>
|Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
|---|---|---|---|
|NorMistral-7b-warm|**88.1**/**93.6**|**89.2**/**94.3**|**89.3**/**94.6**|
|NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
|NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
|NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
|GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
|GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
|Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
</details>
_____
## Hardware and Software
**Training Factors:** The models were pretrained using the Megatron-DeepSpeed library on [the LUMI cluster in Finland](https://lumi-supercomputer.eu/).
**Carbon Footprint:** Pretraining one model took approximately 70k GPU hours of computation on AMD MI250X GPUs (assuming 2 GPUs per one AMD MI250X device), each of which draws 500W.
LUMI is [one of the most eco-efficient data centers in the world](https://www.lumi-supercomputer.eu/sustainable-future/), and its energy consumption is covered 100% with renewable electricity.
_____
## Example usage
Let's try to use this model for English-to-Norwegian machine translation using simple zero-shot prompting:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# First, we will have to import the tokenizer and the language model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-scratch")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-7b-scratch").cuda().eval()
# Now we will define the zero-shot prompt template
prompt = """Engelsk: {0}
Bokmål:"""
# A function that will take care of generating the output
@torch.no_grad()
def generate(text):
text = prompt.format(text)
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
prediction = model.generate(
input_ids,
max_new_tokens=64,
do_sample=False,
eos_token_id=tokenizer('\n').input_ids
)
return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
# Now you can simply call the generate function with an English text you want to translate:
generate("I'm super excited about this Norwegian NORA model! Can it translate these sentences?")
# > this should output: 'Jeg er super spent på denne norske NORA modellen! Kan den oversette disse setningene?'
```
## Example usage on a GPU with ~16GB VRAM (try for yourself [in Google Colab](https://colab.research.google.com/drive/1AQgJ8lN-SNOqkUKj4xpQI5rr0R7V2Xzy?usp=sharing))
Install bitsandbytes if you want to load in 8bit
```bash
pip install bitsandbytes
pip install accelerate
```
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(
"norallm/normistral-7b-scratch"
)
# This setup needs about 8gb VRAM
# Setting `load_in_8bit=False` -> 15gb VRAM
# Using `torch.float32` and `load_in_8bit=False` -> 21gb VRAM
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-7b-scratch",
device_map='auto',
load_in_8bit=True,
torch_dtype=torch.bfloat16
)
``` |