iamdenay commited on
Commit
bcf51c5
1 Parent(s): 2be973e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -47
README.md CHANGED
@@ -6,9 +6,25 @@ language:
6
  library_name: transformers
7
  ---
8
 
9
- Roberta base model trained on Azerbaijani subset of OSCAR corpus.
 
10
 
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## Usage
13
  ```python
14
  from transformers import AutoTokenizer, AutoModelWithLMHead
@@ -23,59 +39,28 @@ model_mask = pipeline('fill-mask', model='iamdenay/roberta-azerbaijani')
23
  model_mask("Le tweet <mask>.")
24
  ```
25
 
26
- ## Examples
27
  ```python
28
-
29
- fill_mask("azərtac xəbər <mask> ki")
30
- ```
31
- ```
32
  [{'sequence': 'azərtac xəbər verir ki',
33
- 'score': 0.9791690707206726,
34
  'token': 1053,
35
- 'token_str': ' verir'},
36
  {'sequence': 'azərtac xəbər verib ki',
37
- 'score': 0.004408467561006546,
38
  'token': 2313,
39
- 'token_str': ' verib'},
40
- {'sequence': 'azərtac xəbər yayıb ki',
41
- 'score': 0.00216124439612031,
42
- 'token': 6580,
43
- 'token_str': ' yayıb'},
44
- {'sequence': 'azərtac xəbər agentliyi ki',
45
- 'score': 0.0014381826622411609,
46
- 'token': 14711,
47
- 'token_str': ' agentliyi'},
48
- {'sequence': 'azərtac xəbəraz ki',
49
- 'score': 0.0012858203845098615,
50
- 'token': 320,
51
- 'token_str': 'az'}]
52
  ```
53
 
54
- ```python
55
- fill_mask("Mənə o yumşaq fransız bulkalarından <mask> çox ver")
56
- ```
57
- ```
58
- [{'sequence': 'Mənə o yumşaq fransız bulkalarından daha çox ver',
59
- 'score': 0.5982716083526611,
60
- 'token': 716,
61
- 'token_str': ' daha'},
62
- {'sequence': 'Mənə o yumşaq fransız bulkalarından bir çox ver',
63
- 'score': 0.1061108186841011,
64
- 'token': 374,
65
- 'token_str': ' bir'},
66
- {'sequence': 'Mənə o yumşaq fransız bulkalarından biri çox ver',
67
- 'score': 0.05577299743890762,
68
- 'token': 1331,
69
- 'token_str': ' biri'},
70
- {'sequence': 'Mənə o yumşaq fransız bulkalarından ən çox ver',
71
- 'score': 0.029407601803541183,
72
- 'token': 745,
73
- 'token_str': ' ən'},
74
- {'sequence': 'Mənə o yumşaq fransız bulkalarından çox çox ver',
75
- 'score': 0.011952652595937252,
76
- 'token': 524,
77
- 'token_str': ' çox'}]
78
- ```
79
 
80
  ## Config
81
  ```json
 
6
  library_name: transformers
7
  ---
8
 
9
+ Roberta base model trained on Azerbaijani subset of OSCAR corpus as a part of [research](https://peerj.com/articles/cs-1974/) on application of text augentation for low-resource languages.
10
+ It was developed to enhance text classification tasks in Azerbaijani, a low-resource language in the NLP domain. The model was trained using the Azerbaijani subset of the OSCAR corpus and further fine-tuned on a labeled news dataset.
11
 
12
 
13
+ ## Training Data
14
+ The model was pre-trained on the Azerbaijani subset of the OSCAR corpus, and fine-tuned on approximately 3 million sentences from Azertag News Agency covering diverse topics such as politics, economy, culture, sports, technology, and health.
15
+
16
+ ## Citation
17
+ ```bibtex
18
+ @article{ziyaden2024augmentation,
19
+ title = {Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages},
20
+ author = {Ziyaden, Atabay and Yelenov, Amir and Hajiyev, Fuad and Rustamov, Samir and Pak, Alexandr},
21
+ year = 2024,
22
+ journal = {PeerJ Computer Science},
23
+ doi = {10.7717/peerj-cs.1974},
24
+ url = {https://doi.org/10.7717/peerj-cs.1974}
25
+ }
26
+ ```
27
+
28
  ## Usage
29
  ```python
30
  from transformers import AutoTokenizer, AutoModelWithLMHead
 
39
  model_mask("Le tweet <mask>.")
40
  ```
41
 
42
+ ## Output
43
  ```python
 
 
 
 
44
  [{'sequence': 'azərtac xəbər verir ki',
45
+ 'score': 0.9791,
46
  'token': 1053,
47
+ 'token_str': 'verir'},
48
  {'sequence': 'azərtac xəbər verib ki',
49
+ 'score': 0.0044,
50
  'token': 2313,
51
+ 'token_str': 'verib'},
52
+ ... ]
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
+ ## Limitations
56
+
57
+ - Language Specificity: The model is trained exclusively on Azerbaijani and may not generalize well to other languages.
58
+ - Data Bias: The fine-tuning data is sourced from news articles, which may contain biases or specific journalistic styles.
59
+ - Agglutinative Language Challenges: Azerbaijani's agglutinative nature can lead to sparsity in the word space due to numerous morphological variations.
60
+
61
+ ## Ethical Considerations
62
+ - Content Sensitivity: The dataset may include sensitive topics. Users should ensure compliance with ethical standards when deploying the model.
63
+ - Bias and Fairness: Be aware of potential biases in the training data that could affect model predictions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ## Config
66
  ```json