Gerard-1705
commited on
Update Readme.md
Browse filesUpdate readme to fix some spanish paragraph
README.md
CHANGED
@@ -54,7 +54,7 @@ Future steps:
|
|
54 |
- **Fine-tuned from model:** [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
|
55 |
- **Dataset used:** [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection)
|
56 |
|
57 |
-
###
|
58 |
|
59 |
- **Repository:** [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. -->
|
60 |
- **Demo:** [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico)
|
@@ -75,14 +75,14 @@ Future steps:
|
|
75 |
- The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation.
|
76 |
|
77 |
## Bias, Risks, and Limitations
|
78 |
-
|
79 |
-
-
|
80 |
-
-
|
81 |
-
-
|
82 |
|
83 |
### Recommendations
|
84 |
|
85 |
-
-
|
86 |
|
87 |
## How to Get Started with the Model
|
88 |
|
@@ -158,8 +158,8 @@ The following hyperparameters were used during training:
|
|
158 |
- num_epochs: 2
|
159 |
|
160 |
#### Speeds, Sizes, Times
|
161 |
-
|
162 |
-
|
163 |
|
164 |
|
165 |
#### Resultados del entrenamiento:
|
@@ -211,12 +211,12 @@ Recall 0.99
|
|
211 |
F1 score 0.951
|
212 |
|
213 |
## Environmental Impact
|
214 |
-
|
215 |
-
- **
|
216 |
-
- **
|
217 |
-
- **
|
218 |
-
- **
|
219 |
-
- **
|
220 |
|
221 |
|
222 |
## Technical Specifications
|
|
|
54 |
- **Fine-tuned from model:** [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
|
55 |
- **Dataset used:** [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection)
|
56 |
|
57 |
+
### Model resurces:
|
58 |
|
59 |
- **Repository:** [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. -->
|
60 |
- **Demo:** [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico)
|
|
|
75 |
- The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation.
|
76 |
|
77 |
## Bias, Risks, and Limitations
|
78 |
+
No specific studies on biases and limitations have been carried out at this point, however, we make the following points based on previous experience and model tests:
|
79 |
+
- It inherits the biases and limitations of the base model with which it was trained, for more details see: [BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403). However, they are not so obvious to find because of the type of task in which the model is being implemented, such as text classification.
|
80 |
+
- Direct biases such as the majority use of high-level language in the dataset due to the use of texts extracted from news, legal documentation of companies that can complicate the identification of texts with low-level language (e.g. colloquial). To mitigate these biases, diverse opinions on climate change issues extracted from sources such as social networks were included in the dataset, in addition to a rebalancing of the labels.
|
81 |
+
- The dataset inherits other limitations such as: the model loses performance on short texts, this is due to the fact that most of the texts used in the dataset have a long length between 200 - 500 words. Again, we tried to mitigate these limitations by including short texts.
|
82 |
|
83 |
### Recommendations
|
84 |
|
85 |
+
- As we have mentioned, the model tends to lower performance in short texts, so it is advisable to establish a selection criterion for long texts whose subject matter needs to be identified.
|
86 |
|
87 |
## How to Get Started with the Model
|
88 |
|
|
|
158 |
- num_epochs: 2
|
159 |
|
160 |
#### Speeds, Sizes, Times
|
161 |
+
The model was trained in 2 epochs with a total training duration of 14.22 minutes, 'train_runtime': 853.6759.
|
162 |
+
Additional information: No mixed precision (FP16 or BF16) was used.
|
163 |
|
164 |
|
165 |
#### Resultados del entrenamiento:
|
|
|
211 |
F1 score 0.951
|
212 |
|
213 |
## Environmental Impact
|
214 |
+
Using the tool [ML CO2 IMPACT](https://mlco2.github.io/impact/#co2eq) we estimate the following environmental impact due to training:
|
215 |
+
- **Type of hardware:** T4
|
216 |
+
- **Total Hours for iterations and tests:** 4 horas
|
217 |
+
- **Cloud provider** Google Cloud (colab)
|
218 |
+
- **Computational region** us-east
|
219 |
+
- **Carbon footprint** 0.1kg CO2
|
220 |
|
221 |
|
222 |
## Technical Specifications
|