mariagrandury commited on
Commit
a6354e1
1 Parent(s): f5f761a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -13,7 +13,9 @@ datasets:
13
  library_name: transformers
14
  ---
15
 
16
- **LINCE ZERO** (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by **[Clibrain](https://www.clibrain.com/)**, it is a causal decoder-only model with 7B parameters. LINCE-ZERO is based on **[Falcon-7B](https://huggingface.co/tiiuae/falcon-7b/blob/main/README.md)** and has been fine-tuned using an augmented combination of the **[Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)** and **[Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)** datasets, both translated into Spanish.
 
 
17
 
18
  The model is released under the Apache 2.0 license.
19
 
@@ -21,7 +23,6 @@ The model is released under the Apache 2.0 license.
21
  <img src="https://huggingface.co/clibrain/lince-zero/resolve/main/LINCE-CLIBRAIN-HD.jpg" alt="lince logo"">
22
  </div>
23
 
24
- # Model Card for LINCE-ZERO
25
 
26
  # Table of Contents
27
 
@@ -59,14 +60,13 @@ The model is released under the Apache 2.0 license.
59
 
60
  ## Model Description
61
 
62
- LINCE-ZERO (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by **[Clibrain](https://www.clibrain.com/)**, it is a causal decoder-only model with 7B parameters. LINCE-ZERO is based on **[Falcon-7B**](https://huggingface.co/tiiuae/falcon-7b/blob/main/README.md) and has been fine-tuned using an augmented combination of the **[Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)** and **[Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)** datasets, both translated into Spanish.
63
 
64
  - **Developed by:** [Clibrain](https://www.clibrain.com/)
65
  - **Model type:** Language model, instruction model, causal decoder-only
66
  - **Language(s) (NLP):** es
67
  - **License:** apache-2.0
68
  - **Parent Model:** [https://huggingface.co/tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
69
- - **Resources for more information:** Paper coming soon
70
 
71
  ## Model Sources
72
 
@@ -95,7 +95,7 @@ LINCE-ZERO has limitations associated with both the underlying language model an
95
 
96
  Since the model has been fine-tuned on translated versions of the Alpaca and Dolly datasets, it has potentially inherited certain limitations and biases:
97
 
98
- - Alpaca: The Alpaca dataset is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases inherent in that model. As the authors report, hallucination seems to be a common failure mode for Alpaca, even compared to text-davinci-003.
99
  - Dolly: The Dolly dataset incorporates information from Wikipedia, which is a crowdsourced corpus. Therefore, the dataset's contents may reflect the biases, factual errors, and topical focus present in Wikipedia. Additionally, annotators involved in the dataset creation may not be native English speakers, and their demographics and subject matter may reflect the makeup of Databricks employees.
100
 
101
  ## Recommendations
@@ -108,7 +108,7 @@ If considering LINCE-ZERO for production use, it is crucial to thoroughly evalua
108
 
109
  ## Training Data
110
 
111
- LINCE-ZERO is based on **[Falcon-7B](https://huggingface.co/tiiuae/falcon-7b)** and has been fine-tuned using an augmented combination of the **[Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)** and **[Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)** datasets, both translated into Spanish.
112
 
113
  Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
114
 
@@ -122,7 +122,7 @@ For detailed information about the model architecture and compute infrastructure
122
 
123
  To prepare the training data, both the Alpaca and Dolly datasets, originally in English, were translated into Spanish using …
124
 
125
- The data was tokenized using LINCE-ZERO’s tokenizer, which is based on the Falcon-**[7B](https://huggingface.co/tiiuae/falcon-7b)**/**[40B](https://huggingface.co/tiiuae/falcon-40b)** tokenizer.
126
 
127
  ### Training Hyperparameters
128
 
 
13
  library_name: transformers
14
  ---
15
 
16
+ # Model Card for LINCE-ZERO
17
+
18
+ **LINCE ZERO** (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by [Clibrain](https://www.clibrain.com/), it is a causal decoder-only model with 7B parameters. LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
19
 
20
  The model is released under the Apache 2.0 license.
21
 
 
23
  <img src="https://huggingface.co/clibrain/lince-zero/resolve/main/LINCE-CLIBRAIN-HD.jpg" alt="lince logo"">
24
  </div>
25
 
 
26
 
27
  # Table of Contents
28
 
 
60
 
61
  ## Model Description
62
 
63
+ LINCE-ZERO (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by [Clibrain](https://www.clibrain.com/), it is a causal decoder-only model with 7B parameters. LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
64
 
65
  - **Developed by:** [Clibrain](https://www.clibrain.com/)
66
  - **Model type:** Language model, instruction model, causal decoder-only
67
  - **Language(s) (NLP):** es
68
  - **License:** apache-2.0
69
  - **Parent Model:** [https://huggingface.co/tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
 
70
 
71
  ## Model Sources
72
 
 
95
 
96
  Since the model has been fine-tuned on translated versions of the Alpaca and Dolly datasets, it has potentially inherited certain limitations and biases:
97
 
98
+ - Alpaca: The Alpaca dataset is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases inherent in that model. As the authors report, hallucination seems to be a common failure mode for Alpaca, even compared to `text-davinci-003`.
99
  - Dolly: The Dolly dataset incorporates information from Wikipedia, which is a crowdsourced corpus. Therefore, the dataset's contents may reflect the biases, factual errors, and topical focus present in Wikipedia. Additionally, annotators involved in the dataset creation may not be native English speakers, and their demographics and subject matter may reflect the makeup of Databricks employees.
100
 
101
  ## Recommendations
 
108
 
109
  ## Training Data
110
 
111
+ LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
112
 
113
  Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
114
 
 
122
 
123
  To prepare the training data, both the Alpaca and Dolly datasets, originally in English, were translated into Spanish using …
124
 
125
+ The data was tokenized using LINCE-ZERO’s tokenizer, which is based on the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.
126
 
127
  ### Training Hyperparameters
128