gonzalez-agirre commited on
Commit
620f77d
1 Parent(s): 4f5e331

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +214 -1
README.md CHANGED
@@ -1,3 +1,216 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - es
5
+ - ca
6
+ licence:
7
+ - apache-2.0
8
+ tags:
9
+ - FLOR
10
+ - bloom
11
+ - spanish
12
+ - catalan
13
+ - english
14
+ pipeline_tag: text-generation
15
+ widget:
16
+ - text: |-
17
+ Respon a la pregunta següent.
18
+ Pregunta: "Quina és la capital de Suècia?"
19
+ Resposta: "La capital de Suècia és Estocolm."
20
+ ----
21
+ Respon a la pregunta següent.
22
+ Pregunta: "Quina beguda es consumeix als matins per despertar-se?"
23
+ Resposta: "La majoria de gent consumeix cafè per despertar-se."
24
+ ----
25
+ Respon a la pregunta següent.
26
+ Pregunta: "Explica com funciona un motor de combustió"
27
+ Resposta:
28
+ example_title: Pregunta-Resposta
29
+ - text: |-
30
+ Extrae las entidades nombradas del siguiente texto:
31
+ Texto: "Me llamo Wolfgang y vivo en Berlin"
32
+ Entidades: Wolfgang:PER, Berlin:LOC
33
+ ----
34
+ Extrae las entidades nombradas del siguiente texto:
35
+ Texto: "Hoy voy a visitar el parc güell tras salir del barcelona supercomputing center"
36
+ Entidades: parc güell:LOC, barcelona supercomputing center:LOC
37
+ ----
38
+ Extrae las entidades nombradas del siguiente texto:
39
+ Texto: "Maria y Miguel no tienen ningún problema contigo"
40
+ Entidades: Maria:PER, Miguel:PER
41
+ ----
42
+ Extrae las entidades nombradas del siguiente texto:
43
+ Texto: "Damián se cortó el pelo"
44
+ Entidades: Damián:PER
45
+ ----
46
+ Extrae las entidades nombradas del siguiente texto:
47
+ Texto: "Lo mejor de Barcelona és el bar de mi amigo Pablo"
48
+ Entidades: Pablo:PER, Barcelona:LOC
49
+ ----
50
+ Extrae las entidades nombradas del siguiente texto:
51
+ Texto: "Carlos comparte piso con Marc"
52
+ Entidades:
53
+ example_title: Entidades-Nombradas
54
  ---
55
+
56
+ # FLOR-6.3B
57
+
58
+ ## Table of Contents
59
+ <details>
60
+ <summary>Click to expand</summary>
61
+
62
+ - [Model description](#model-description)
63
+ - [Intended uses and limitations](#intended-uses-and-limitations)
64
+ - [How to use](#how-to-use)
65
+ - [Limitations and bias](#limitations-and-bias)
66
+ - [Training](#training)
67
+ - [Evaluation](#evaluation)
68
+ - [Additional information](#additional-information)
69
+
70
+ </details>
71
+
72
+ ## Model description
73
+
74
+ **FLOR-6.3B** is a 6.3B-parameter transformer-based causal language model for Catalan, Spanish, and English.
75
+ It is the result of a language adaptation technique performed on [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1),
76
+ which involves modifying the model's vocabulary and embedding layer, and continuously pre-training the model with 140B tokens in our target languages.
77
+
78
+ ## Intended uses and limitations
79
+
80
+ The **FLOR-6.3B** model is ready-to-use only for causal language modeling.
81
+ It can perform text-generation tasks and be fine-tuned for specific scenarios.
82
+
83
+ ## How to use
84
+ ```python
85
+ import torch
86
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
87
+
88
+ input_text = "Sovint em trobo pensant en tot allò que"
89
+
90
+ model_id = "projecte-aina/FLOR-6.3B"
91
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
92
+ generator = pipeline(
93
+ "text-generation",
94
+ model=model_id,
95
+ tokenizer=tokenizer,
96
+ torch_dtype=torch.bfloat16,
97
+ trust_remote_code=True,
98
+ device_map="auto",
99
+ )
100
+ generation = generator(
101
+ input_text,
102
+ do_sample=True,
103
+ top_k=10,
104
+ eos_token_id=tokenizer.eos_token_id,
105
+ )
106
+
107
+ print(f"Result: {generation[0]['generated_text']}")
108
+ ```
109
+
110
+ ## Limitations and bias
111
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
112
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
113
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
114
+
115
+
116
+ ## Training
117
+
118
+ ### Language adaptation and training
119
+
120
+ The language adaptation technique used to create FLOR-6.3B requires the vocabulary of the source model
121
+ to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
122
+ 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 7.1B parameters to 6.3B.
123
+ 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
124
+ 3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
125
+ 4) The model was initialized with the weights from BOOM-7.1B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
126
+ 5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
127
+
128
+ ### Training data
129
+
130
+ The training corpus is composed of 140B tokens gathered from web crawlings and public domain data.
131
+
132
+ TBC
133
+
134
+ ### Languages
135
+
136
+ The training data has the same amount of Catalan, Spanish, and English texts.
137
+ The table below shows the final language distribution:
138
+
139
+ |Language|Percentage|
140
+ |--------|----------|
141
+ | Catalan (CA) | 33.39% |
142
+ | Spanish (ES) | 33.32% |
143
+ | English (EN) | 33.29% |
144
+
145
+ ### Training hyperparameters
146
+
147
+ TBC
148
+
149
+ ### Framework
150
+ The training was conducted in 16 Cerebras' [CS-2 systems](https://www.cerebras.net/product-system/)
151
+ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1.9.1) release of their software.
152
+
153
+ ## Evaluation
154
+ FLOR-6.3B has been evaluated in a 5-shot setting, using EleutherAI's *LM Evaluation Harness*.
155
+ The evaluation benchmark includes tasks in Catalan, Spanish, and English, with particular emphasis on Catalan datasets.
156
+
157
+ The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities.
158
+ The baselines used to compare our results are multilingual and English open-source 7B models and smaller models of the FLOR family of models: **TBC**.
159
+
160
+ Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/FLOR-eval).
161
+
162
+ The following is a list of evaluation areas and their respective datasets:
163
+ - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
164
+ - Question Answering: [XQuAD](https://huggingface.co/datasets/xquad), [CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa), [CoQCat](https://huggingface.co/datasets/projecte-aina/CoQCat)
165
+ - Natural Language Inference: [XNLI](https://huggingface.co/datasets/xnli) and its translation to Catalan ([XNLI-ca](https://huggingface.co/datasets/projecte-aina/xnli-ca)), [TE-ca](https://huggingface.co/datasets/projecte-aina/teca)
166
+ - Paraphrase Identification: [PAWS-X](https://huggingface.co/datasets/paws-x) and its translation to Catalan ([PAWS-ca](https://huggingface.co/datasets/projecte-aina/PAWS-ca)), [Parafraseja](https://huggingface.co/datasets/projecte-aina/Parafraseja)
167
+ - Commonsense Reasoning: [COPA](https://people.ict.usc.edu/~gordon/copa.html) and its translation to Catalan ([COPA-ca](https://huggingface.co/datasets/projecte-aina/COPA-ca))
168
+ - Translation: [FLoRes](https://huggingface.co/datasets/flores)
169
+
170
+ ### Reading Comprehension and Questions Answering
171
+
172
+ TBC
173
+
174
+ ### Natural Language Inference and Paraphrase Identification
175
+
176
+ TBC
177
+
178
+
179
+ ### Commonsense Reasoning and Translation
180
+
181
+ TBC
182
+
183
+ ## Additional information
184
+
185
+ ### Author
186
+ The Language Technologies Unit from Barcelona Supercomputing Center.
187
+
188
+ ### Contact
189
+ For further information, please send an email to <langtech@bsc.es>.
190
+
191
+ ### Copyright
192
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
193
+
194
+ ### License
195
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
196
+
197
+ ### Funding
198
+ This work was funded by [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
199
+
200
+ ### Disclaimer
201
+
202
+ <details>
203
+ <summary>Click to expand</summary>
204
+
205
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
206
+
207
+ Be aware that the model may have biases and/or any other undesirable distortions.
208
+
209
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
210
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
211
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
212
+
213
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
214
+ be liable for any results arising from the use made by third parties.
215
+
216
+ </details>