usmiva commited on
Commit
5962cf0
1 Parent(s): f8606ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -11
README.md CHANGED
@@ -11,7 +11,7 @@ pipeline_tag: text-generation
11
 
12
  This model is pre-trained with the causal language modelling objective on a private dataset with web scraped content created at the Bulgarian Academy of Sciences under the [ClaDa-BG Project](https://clada-bg.eu/en/).
13
 
14
- The dataset is cleaned and balanced with a specialized procedure to avoid cultural, political, racial and other biases. The procedure is described in the paper dedicated to this model- coming soon!
15
 
16
  ## Model Details
17
 
@@ -98,23 +98,32 @@ gpt_web_bg("По професия той е ")
98
 
99
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
100
 
101
- The process of creating a diverse, bias-proof, and ethically fair dataset requires a meticulous and effective approach to clean the raw text data extracted from the internet. To address this challenge, we propose a specialized, multi-step procedure organized into the following stages:
102
 
103
- Deduplication- Duplicate text sequences, often caused by web scraping, are removed from the dataset, thus ensuring that each entry contributes unique information to the training data.
104
- Topic Classification- To guarantee diverse subject matter and reduce the risk of topic bias, topic classification is employed to categorize text entries based on their content.
105
- Sentiment Classification- By categorizing entries with sentiment, the dataset diversity is further enhanced, enabling models to better interpret and handle the inherent emotional aspects of human language.
106
- Hate-Speech Detection- To exclude content promoting hate speech from the dataset, automatic detection methods for Bulgarian are utilized.
107
- Balancing Topics and Sentiment in the Data- The emphasis is placed on ensuring an adequate balance between topics and sentiment classes, as an imbalanced dataset can lead to biased results. By carefully redistributing instances across topics and sentiment categories, a more representative and inclusive dataset can be assembled, resulting in more robust and adaptable models.
108
- Cleaning Abusive Content- To further refine the dataset, abusive content, including profanities, vulgar language, and other offensive expressions were cleaned from the text utilizing algorithms for abusive language detection.
109
- Minimum Sentence Threshold- To ensure that the dataset includes meaningful and coherent text instances, a minimum sentence threshold is imposed, requiring that each entry contains at least five sentences. This condition ensures that models are trained on richer linguistic contexts and promotes more accurate and nuanced text generation.
110
- Cleaning non Bulgarian content
111
  ### Training Procedure
112
 
113
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
114
 
 
 
115
  #### Preprocessing [optional]
116
 
117
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
 
120
  #### Training Hyperparameters
 
11
 
12
  This model is pre-trained with the causal language modelling objective on a private dataset with web scraped content created at the Bulgarian Academy of Sciences under the [ClaDa-BG Project](https://clada-bg.eu/en/).
13
 
14
+ The dataset is cleaned and balanced with a specialized procedure to avoid gender, cultural, political, racial and other biases. The procedure is described in the paper dedicated to this model- coming soon!
15
 
16
  ## Model Details
17
 
 
98
 
99
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
100
 
 
101
 
 
 
 
 
 
 
 
 
102
  ### Training Procedure
103
 
104
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
105
 
106
+
107
+
108
  #### Preprocessing [optional]
109
 
110
+ The process of creating a diverse, bias-proof, and ethically fair dataset requires a meticulous and effective approach to clean the raw text data extracted from the internet. To address this challenge, we propose a specialized, multi-step procedure organized into the following stages:
111
+
112
+ #### Deduplication
113
+ Duplicate text sequences, often caused by web scraping, are removed from the dataset, thus ensuring that each entry contributes unique information to the training data.
114
+ #### Topic Classification
115
+ To guarantee diverse subject matter and reduce the risk of topic bias, topic classification is employed to categorize text entries based on their content.
116
+ #### Sentiment Classification
117
+ By categorizing entries with sentiment, the dataset diversity is further enhanced, enabling models to better interpret and handle the inherent emotional aspects of human language.
118
+ #### Hate-Speech Detection
119
+ To exclude content promoting hate speech from the dataset, automatic detection methods for Bulgarian are utilized.
120
+ #### Balancing Topics and Sentiment in the Data
121
+ The emphasis is placed on ensuring an adequate balance between topics and sentiment classes, as an imbalanced dataset can lead to biased results. By carefully redistributing instances across topics and sentiment categories, a more representative and inclusive dataset can be assembled, resulting in more robust and adaptable models.
122
+ #### Cleaning Abusive Content
123
+ To further refine the dataset, abusive content, including profanities, vulgar language, and other offensive expressions were cleaned from the text utilizing algorithms for abusive language detection.
124
+ #### Minimum Sentence Threshold
125
+ To ensure that the dataset includes meaningful and coherent text instances, a minimum sentence threshold is imposed, requiring that each entry contains at least five sentences. This condition ensures that models are trained on richer linguistic contexts and promotes more accurate and nuanced text generation.
126
+ #### Cleaning non Bulgarian content
127
 
128
 
129
  #### Training Hyperparameters