usmiva commited on
Commit
f8606ec
1 Parent(s): c771548

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -4
README.md CHANGED
@@ -76,6 +76,8 @@ Users (both direct and downstream) should be made aware of the risks, biases and
76
 
77
  Use the code below to get started with the model.
78
 
 
 
79
  ```python
80
  from transformers import pipeline, set_seed
81
  gpt_web_bg = pipeline('text-generation', model='/usmiva/gpt_web_bg', max_length=50, num_beams=3, temperature=0.8)
@@ -88,14 +90,24 @@ gpt_web_bg("По професия той е ")
88
  ```
89
  [{'generated_text': 'По професия той е строителен работник, който е �'}]
90
 
 
 
91
  ## Training Details
92
 
93
  ### Training Data
94
 
95
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
96
 
97
- [More Information Needed]
98
 
 
 
 
 
 
 
 
 
99
  ### Training Procedure
100
 
101
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
@@ -169,11 +181,11 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
169
 
170
  ### Model Architecture and Objective
171
 
172
- [More Information Needed]
173
 
174
  ### Compute Infrastructure
175
 
176
- [More Information Needed]
177
 
178
  #### Hardware
179
 
@@ -186,10 +198,11 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
186
  ## Citation [optional]
187
 
188
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
189
 
190
  **BibTeX:**
191
 
192
- [More Information Needed]
193
 
194
  **APA:**
195
 
 
76
 
77
  Use the code below to get started with the model.
78
 
79
+ Using pipeline
80
+
81
  ```python
82
  from transformers import pipeline, set_seed
83
  gpt_web_bg = pipeline('text-generation', model='/usmiva/gpt_web_bg', max_length=50, num_beams=3, temperature=0.8)
 
90
  ```
91
  [{'generated_text': 'По професия той е строителен работник, който е �'}]
92
 
93
+
94
+
95
  ## Training Details
96
 
97
  ### Training Data
98
 
99
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
100
 
101
+ The process of creating a diverse, bias-proof, and ethically fair dataset requires a meticulous and effective approach to clean the raw text data extracted from the internet. To address this challenge, we propose a specialized, multi-step procedure organized into the following stages:
102
 
103
+ Deduplication- Duplicate text sequences, often caused by web scraping, are removed from the dataset, thus ensuring that each entry contributes unique information to the training data.
104
+ Topic Classification- To guarantee diverse subject matter and reduce the risk of topic bias, topic classification is employed to categorize text entries based on their content.
105
+ Sentiment Classification- By categorizing entries with sentiment, the dataset diversity is further enhanced, enabling models to better interpret and handle the inherent emotional aspects of human language.
106
+ Hate-Speech Detection- To exclude content promoting hate speech from the dataset, automatic detection methods for Bulgarian are utilized.
107
+ Balancing Topics and Sentiment in the Data- The emphasis is placed on ensuring an adequate balance between topics and sentiment classes, as an imbalanced dataset can lead to biased results. By carefully redistributing instances across topics and sentiment categories, a more representative and inclusive dataset can be assembled, resulting in more robust and adaptable models.
108
+ Cleaning Abusive Content- To further refine the dataset, abusive content, including profanities, vulgar language, and other offensive expressions were cleaned from the text utilizing algorithms for abusive language detection.
109
+ Minimum Sentence Threshold- To ensure that the dataset includes meaningful and coherent text instances, a minimum sentence threshold is imposed, requiring that each entry contains at least five sentences. This condition ensures that models are trained on richer linguistic contexts and promotes more accurate and nuanced text generation.
110
+ Cleaning non Bulgarian content
111
  ### Training Procedure
112
 
113
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
181
 
182
  ### Model Architecture and Objective
183
 
184
+ The model is trained BERT with the masked language modelling objective from scratch on Bulgarian dataset containing data from opensource online media.
185
 
186
  ### Compute Infrastructure
187
 
188
+ 1 NVIDIA V100 video card
189
 
190
  #### Hardware
191
 
 
198
  ## Citation [optional]
199
 
200
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
201
+ Transformer-Based Language Models for Bulgarian, Marinova et. al. - TBA
202
 
203
  **BibTeX:**
204
 
205
+ TBA
206
 
207
  **APA:**
208