mrm8488 commited on
Commit
76ca8d5
1 Parent(s): bd7f636

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -38
README.md CHANGED
@@ -40,14 +40,7 @@ The model is released under the Apache 2.0 license.
40
  - [Recommendations](#recommendations)
41
  - [Training Details](#training-details)
42
  - [Training Data](#training-data)
43
- - [Training Procedure](#training-procedure)
44
- - [Preprocessing](#preprocessing)
45
- - [Speeds, Sizes, Times](#speeds-sizes-times)
46
  - [Evaluation](#evaluation)
47
- - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
48
- - [Testing Data](#testing-data)
49
- - [Factors](#factors)
50
- - [Metrics](#metrics)
51
  - [Results](#results)
52
  - [Environmental Impact](#environmental-impact)
53
  - [Technical Specifications](#technical-specifications)
@@ -55,9 +48,9 @@ The model is released under the Apache 2.0 license.
55
  - [Compute Infrastructure](#compute-infrastructure)
56
  - [Hardware](#hardware)
57
  - [Software](#software)
 
58
  - [Citation](#citation)
59
  - [Contact](#contact)
60
- - [How to Get Started with the Model](#how-to-get-started-with-the-model)
61
 
62
  # 🐯 Model Details
63
 
@@ -111,45 +104,18 @@ If considering LINCE-ZERO for production use, it is crucial to thoroughly evalua
111
 
112
  ## Training Data
113
 
114
- LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
115
 
116
  Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
117
 
118
  Dolly is a 13.1 MB dataset of 15,011 instruction-following records in American English. It was generated by thousands of Databricks employees, who were requested to provide reference texts copied from Wikipedia for specific categories. To learn more, consult [Dolly’s Data Card](https://huggingface.co/datasets/databricks/databricks-dolly-15k).
119
 
120
- After combining both translations, the dataset was augmented to a total of 80k examples.
121
-
122
- ## Training Procedure
123
-
124
- For detailed information about the model architecture and compute infrastructure, please refer to the Technical Specifications section.
125
-
126
- ### Preprocessing
127
-
128
- To prepare the training data, both the Alpaca and Dolly datasets, originally in English, were translated into Spanish using …
129
-
130
- The data was tokenized using LINCE-ZERO’s tokenizer, which is based on the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.
131
-
132
- ### Training Hyperparameters
133
 
134
- More information needed
135
-
136
- ### Speeds, Sizes, Times
137
-
138
- More information needed (throughput, start/end time, checkpoint size if relevant, etc.)
139
 
140
  # ✅ Evaluation
141
 
142
- ## Testing Data, Factors & Metrics
143
-
144
- ### Testing Data
145
-
146
- The model has been tested on a X% of the augmented combination of Alpaca (24.2 MB) and Dolly (13.1 MB) translated into Spanish.
147
-
148
- ### Metrics
149
-
150
- Since LINCE-ZERO is an instruction model, the metrics used to evaluate it are:
151
-
152
- - X: <value>
153
 
154
  ### Results
155
 
 
40
  - [Recommendations](#recommendations)
41
  - [Training Details](#training-details)
42
  - [Training Data](#training-data)
 
 
 
43
  - [Evaluation](#evaluation)
 
 
 
 
44
  - [Results](#results)
45
  - [Environmental Impact](#environmental-impact)
46
  - [Technical Specifications](#technical-specifications)
 
48
  - [Compute Infrastructure](#compute-infrastructure)
49
  - [Hardware](#hardware)
50
  - [Software](#software)
51
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
52
  - [Citation](#citation)
53
  - [Contact](#contact)
 
54
 
55
  # 🐯 Model Details
56
 
 
104
 
105
  ## Training Data
106
 
107
+ LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated with the best quality into Spanish.
108
 
109
  Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
110
 
111
  Dolly is a 13.1 MB dataset of 15,011 instruction-following records in American English. It was generated by thousands of Databricks employees, who were requested to provide reference texts copied from Wikipedia for specific categories. To learn more, consult [Dolly’s Data Card](https://huggingface.co/datasets/databricks/databricks-dolly-15k).
112
 
113
+ After combining both translations, the dataset was augmented to reach a total of 80k examples.
 
 
 
 
 
 
 
 
 
 
 
 
114
 
 
 
 
 
 
115
 
116
  # ✅ Evaluation
117
 
118
+ This is WIP.
 
 
 
 
 
 
 
 
 
 
119
 
120
  ### Results
121