crodri commited on
Commit
80a7210
1 Parent(s): 8532f15

Initial upload

Browse files

quantized version of foundational model

README.md CHANGED
@@ -1,3 +1,245 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - es
5
+ - ca
6
+ licence:
7
+ - apache-2.0
8
+ tags:
9
+ - aguila
10
+ - falcon
11
+ - spanish
12
+ - catalan
13
+ metrics:
14
+ - ppl
15
+ model-index:
16
+ - name: aguila_7b
17
+ results:
18
+ - task:
19
+ name: Causal Language Modeling
20
+ type: text-generation
21
+ metrics:
22
+ - name: Perplexity
23
+ type: ppl
24
+ value: 8.59
25
+ pipeline_tag: text-generation
26
+ widget:
27
+ - text: |-
28
+ Respon a la pregunta següent.
29
+ Pregunta: "Quina és la capital de Suècia?"
30
+ Resposta: "La capital de Suècia és Estocolm."
31
+ ----
32
+ Respon a la pregunta següent.
33
+ Pregunta: "Quina beguda es consumeix als matins per despertar-se?"
34
+ Resposta: "La majoria de gent consumeix cafè per despertar-se."
35
+ ----
36
+ Respon a la pregunta següent.
37
+ Pregunta: "Explica com funciona un motor de combustió"
38
+ Resposta:
39
+ example_title: Pregunta-Resposta
40
+ - text: |-
41
+ Extrae las entidades nombradas del siguiente texto:
42
+ Texto: "Me llamo Wolfgang y vivo en Berlin"
43
+ Entidades: Wolfgang:PER, Berlin:LOC
44
+ ----
45
+ Extrae las entidades nombradas del siguiente texto:
46
+ Texto: "Hoy voy a visitar el parc güell tras salir del barcelona supercomputing center"
47
+ Entidades: parc güell:LOC, barcelona supercomputing center:LOC
48
+ ----
49
+ Extrae las entidades nombradas del siguiente texto:
50
+ Texto: "Maria y Miguel no tienen ningún problema contigo"
51
+ Entidades: Maria:PER, Miguel:PER
52
+ ----
53
+ Extrae las entidades nombradas del siguiente texto:
54
+ Texto: "Damián se cortó el pelo"
55
+ Entidades: Damián:PER
56
+ ----
57
+ Extrae las entidades nombradas del siguiente texto:
58
+ Texto: "Lo mejor de Barcelona és el bar de mi amigo Pablo"
59
+ Entidades: Pablo:PER, Barcelona:LOC
60
+ ----
61
+ Extrae las entidades nombradas del siguiente texto:
62
+ Texto: "Carlos comparte piso con Marc"
63
+ Entidades:
64
+ example_title: Entidades-Nombradas
65
  ---
66
+
67
+ # Ǎguila-7B
68
+
69
+ ## Table of Contents
70
+ <details>
71
+ <summary>Click to expand</summary>
72
+
73
+ - [Model description](#model-description)
74
+ - [Intended uses and limitations](#intended-uses-and-limitations)
75
+ - [How to use](#how-to-use)
76
+ - [Limitations and bias](#limitations-and-bias)
77
+ - [Language adaptation](#language-adaptation)
78
+ - [Training](#training)
79
+ - [Training data](#training-data)
80
+ - [Training procedure](#training-procedure)
81
+ - [Additional information](#additional-information)
82
+ - [Author](#author)
83
+ - [Contact](#contact)
84
+ - [Copyright](#copyright)
85
+ - [License](#license)
86
+ - [Funding](#funding)
87
+ - [Disclaimer](#disclaimer)
88
+
89
+ </details>
90
+
91
+ ## Model description
92
+
93
+ **Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
94
+ It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
95
+ trilingual corpus collected from publicly available corpora and crawlers.
96
+
97
+
98
+ ## Intended uses and limitations
99
+
100
+ The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
101
+ However, it is intended to be fine-tuned for downstream tasks.
102
+
103
+ ## How to use
104
+
105
+ Here is how to use this model:
106
+
107
+ ```python
108
+ import torch
109
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
110
+
111
+ input_text = "El mercat del barri és fantàstic, hi pots trobar"
112
+
113
+ model_id = "projecte-aina/aguila-7b"
114
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
115
+ generator = pipeline(
116
+ "text-generation",
117
+ model=model_id,
118
+ tokenizer=tokenizer,
119
+ torch_dtype=torch.bfloat16,
120
+ trust_remote_code=True,
121
+ device_map="auto",
122
+ )
123
+ generation = generator(
124
+ input_text,
125
+ do_sample=True,
126
+ top_k=10,
127
+ eos_token_id=tokenizer.eos_token_id,
128
+ )
129
+
130
+ print(f"Result: {generation[0]['generated_text']}")
131
+ ```
132
+
133
+ ## Limitations and bias
134
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
135
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
136
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
137
+
138
+
139
+ ## Language adaptation
140
+
141
+ We adapted the original [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
142
+
143
+ The adaptation procedure is explained in [this blog post](https://medium.com/@mpamies247/ee1ebc70bc79).
144
+
145
+ ## Training
146
+
147
+ ### Training data
148
+
149
+ The training corpus consists of 26B tokens of several corpora gathered from web crawlings and public domain data.
150
+
151
+ | Dataset | Language | Words (per-epoch) | Epochs |
152
+ |---------------------|----------|--------------------|--------------|
153
+ | Wikipedia | en | 2169.97M | 1.428144485 |
154
+ | C4_es | es | 53709.80M | 0.1049686196 |
155
+ | Biomedical | es | 455.03M | 0.7140722425 |
156
+ | Legal | es | 995.70M | 0.7140722425 |
157
+ | Wikipedia | es | 693.60M | 1.428144485 |
158
+ | Gutenberg | es | 53.18M | 0.7140722425 |
159
+ | C4_ca | ca | 2826.00M | 2.142216727 |
160
+ | Biomedical | ca | 11.80M | 1.428144485 |
161
+ | RacoCatalà Noticias | ca | 17.16M | 2.142216727 |
162
+ | RacoCatalà Forums | ca | 333.73M | 2.142216727 |
163
+ | CaWaC | ca | 57.79M | 2.142216727 |
164
+ | Wikipedia | ca | 228.01M | 3.570361212 |
165
+ | Vilaweb | ca | 50.34M | 2.142216727 |
166
+
167
+ The dataset has the following language distribution:
168
+
169
+ |Language|Percentage|
170
+ |--------|----------|
171
+ | En | 16.84% |
172
+ | Es | 41.38% |
173
+ | Ca | 41.79% |
174
+
175
+ Note: A small amount of English data was kept to avoid catastrophic forgetting.
176
+
177
+ ## Training procedure
178
+
179
+ The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) with a vocabulary size of 50,257 tokens.
180
+ After training a new tokenizer and adapting [falcon-7b](https://huggingface.co/tiiuae/falcon-7b)'s embedding layer, the model was
181
+ further pre-trained in three target languages: Catalan, Spanish and English.
182
+
183
+ The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
184
+
185
+
186
+ ### Training hyperparameters
187
+
188
+ - seed: 42
189
+ - distributed_type: multi-GPU
190
+ - num_devices: 8
191
+ - train_batch_size: 1
192
+ - eval_batch_size: 1
193
+ - total_train_batch_size: 8
194
+ - total_eval_batch_size: 8
195
+ - optimizer: Adam
196
+ - betas: (0.9,0.999)
197
+ - epsilon: 1e-08
198
+ - learning_rate: 5e-05
199
+ - lr_scheduler_type: linear
200
+ - num_epochs: 1.0
201
+
202
+
203
+ ### Framework versions
204
+
205
+ - Pytorch 2.0.0
206
+ - Transformers 4.30.2
207
+ - Datasets 2.13.1
208
+ - Tokenizers 0.13.3
209
+
210
+ ## Additional information
211
+
212
+ ### Author
213
+ The Language Technologies Unit from Barcelona Supercomputing Center.
214
+
215
+ ### Contact
216
+ For further information, please send an email to <langtech@bsc.es>.
217
+
218
+ ### Copyright
219
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
220
+
221
+ ### License
222
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
223
+
224
+ ### Funding
225
+ This work was funded by:
226
+ - The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
227
+ - The [Spanish State Secretariat for Digitalization and Artificial Intelligence](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan de Impulso de las Tecnologías del Lenguaje](https://plantl.mineco.gob.es/Paginas/index.aspx).
228
+
229
+ ### Disclaimer
230
+
231
+ <details>
232
+ <summary>Click to expand</summary>
233
+
234
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
235
+
236
+ Be aware that the model may have biases and/or any other undesirable distortions.
237
+
238
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
239
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
240
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
241
+
242
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
243
+ be liable for any results arising from the use made by third parties.
244
+
245
+ </details>
config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "layer_norm_epsilon": null,
5
+ "unk_token": "<|endoftext|>"
6
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "clean_up_tokenization_spaces": true,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "<|endoftext|>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "errors": "replace",
22
+ "model_max_length": 2048,
23
+ "pad_token": null,
24
+ "tokenizer_class": "GPT2Tokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<|endoftext|>",
28
+ "lstrip": false,
29
+ "normalized": true,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
vocabulary.json ADDED
The diff for this file is too large to render. See raw diff