iamdenay
/

roberta-azerbaijani

@@ -6,9 +6,25 @@ language:
 library_name: transformers
 ---
-Roberta base model trained on Azerbaijani subset of OSCAR corpus.
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelWithLMHead
@@ -23,59 +39,28 @@ model_mask = pipeline('fill-mask', model='iamdenay/roberta-azerbaijani')
 model_mask("Le tweet <mask>.")
 ```
-## Examples
 ```python
-fill_mask("azərtac xəbər <mask> ki")
-```
-```
 [{'sequence': 'azərtac xəbər verir ki',
-  'score': 0.9791690707206726,
   'token': 1053,
-  'token_str': ' verir'},
  {'sequence': 'azərtac xəbər verib ki',
-  'score': 0.004408467561006546,
   'token': 2313,
-  'token_str': ' verib'},
- {'sequence': 'azərtac xəbər yayıb ki',
-  'score': 0.00216124439612031,
-  'token': 6580,
-  'token_str': ' yayıb'},
- {'sequence': 'azərtac xəbər agentliyi ki',
-  'score': 0.0014381826622411609,
-  'token': 14711,
-  'token_str': ' agentliyi'},
- {'sequence': 'azərtac xəbəraz ki',
-  'score': 0.0012858203845098615,
-  'token': 320,
-  'token_str': 'az'}]
 ```
-```python
-fill_mask("Mənə o yumşaq fransız bulkalarından <mask> çox ver")
-```
-```
-[{'sequence': 'Mənə o yumşaq fransız bulkalarından daha çox ver',
-  'score': 0.5982716083526611,
-  'token': 716,
-  'token_str': ' daha'},
- {'sequence': 'Mənə o yumşaq fransız bulkalarından bir çox ver',
-  'score': 0.1061108186841011,
-  'token': 374,
-  'token_str': ' bir'},
- {'sequence': 'Mənə o yumşaq fransız bulkalarından biri çox ver',
-  'score': 0.05577299743890762,
-  'token': 1331,
-  'token_str': ' biri'},
- {'sequence': 'Mənə o yumşaq fransız bulkalarından ən çox ver',
-  'score': 0.029407601803541183,
-  'token': 745,
-  'token_str': ' ən'},
- {'sequence': 'Mənə o yumşaq fransız bulkalarından çox çox ver',
-  'score': 0.011952652595937252,
-  'token': 524,
-  'token_str': ' çox'}]
-```
 ## Config
 ```json

 library_name: transformers
 ---
+Roberta base model trained on Azerbaijani subset of OSCAR corpus as a part of  [research](https://peerj.com/articles/cs-1974/)  on application of text augentation for low-resource languages.
+It was developed to enhance text classification tasks in Azerbaijani, a low-resource language in the NLP domain. The model was trained using the Azerbaijani subset of the OSCAR corpus and further fine-tuned on a labeled news dataset.
+## Training Data
+The model was pre-trained on the Azerbaijani subset of the OSCAR corpus, and fine-tuned on approximately 3 million sentences from Azertag News Agency covering diverse topics such as politics, economy, culture, sports, technology, and health.
+## Citation
+```bibtex
+@article{ziyaden2024augmentation,
+	title        = {Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages},
+	author       = {Ziyaden, Atabay and Yelenov, Amir and Hajiyev, Fuad and Rustamov, Samir and Pak, Alexandr},
+	year         = 2024,
+	journal      = {PeerJ Computer Science},
+	doi          = {10.7717/peerj-cs.1974},
+	url          = {https://doi.org/10.7717/peerj-cs.1974}
+}
+```
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelWithLMHead
 model_mask("Le tweet <mask>.")
 ```
+## Output
 ```python
 [{'sequence': 'azərtac xəbər verir ki',
+  'score': 0.9791,
   'token': 1053,
+  'token_str': 'verir'},
  {'sequence': 'azərtac xəbər verib ki',
+  'score': 0.0044,
   'token': 2313,
+  'token_str': 'verib'},
+ ... ]
 ```
+## Limitations
+- Language Specificity: The model is trained exclusively on Azerbaijani and may not generalize well to other languages.
+- Data Bias: The fine-tuning data is sourced from news articles, which may contain biases or specific journalistic styles.
+- Agglutinative Language Challenges: Azerbaijani's agglutinative nature can lead to sparsity in the word space due to numerous morphological variations.
+## Ethical Considerations
+- Content Sensitivity: The dataset may include sensitive topics. Users should ensure compliance with ethical standards when deploying the model.
+- Bias and Fairness: Be aware of potential biases in the training data that could affect model predictions.
 ## Config
 ```json