Update README.md
Browse files
README.md
CHANGED
|
@@ -116,6 +116,9 @@ language:
|
|
| 116 |
- yi
|
| 117 |
- zh
|
| 118 |
- zu
|
|
|
|
|
|
|
|
|
|
| 119 |
datasets:
|
| 120 |
- wikimedia/wikipedia
|
| 121 |
- HuggingFaceFW/finetranslations
|
|
@@ -125,9 +128,6 @@ datasets:
|
|
| 125 |
- DerivedFunction/finetranslations-filtered
|
| 126 |
- DerivedFunction/tatoeba-filtered
|
| 127 |
pipeline_tag: token-classification
|
| 128 |
-
model-index:
|
| 129 |
-
- name: polyglot-tagger
|
| 130 |
-
results: []
|
| 131 |
---
|
| 132 |
|
| 133 |
|
|
@@ -138,7 +138,7 @@ Fine-tuned `xlm-roberta-base` for sentence-level language tagging across 100 lan
|
|
| 138 |
The model predicts BIO-style language tags over tokens, which makes it useful for
|
| 139 |
language identification, code-switch detection, and multilingual document analysis.
|
| 140 |
|
| 141 |
-
> Compared to version 2.
|
| 142 |
|
| 143 |
## Model description
|
| 144 |
|
|
@@ -152,7 +152,7 @@ Note that as a general language tagging model, it can potentially get confused f
|
|
| 152 |
The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental
|
| 153 |
and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results.
|
| 154 |
|
| 155 |
-
> Note that Romanized versions of any language may
|
| 156 |
|
| 157 |
### Training and Evaluation Data
|
| 158 |
|
|
@@ -164,133 +164,132 @@ factors were used to simulate messy text, and to reduce single character bias on
|
|
| 164 |
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 165 |
- Low chance of simulating OCR and messy text with character mutation.
|
| 166 |
|
| 167 |
-
To generalize well on both the target language and code switching a
|
| 168 |
- Pure documents 55%: Single language to learn its vocabulary, simulating a short paragraph of a single language.
|
| 169 |
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching.
|
| 170 |
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 171 |
- Mixed 10%: Generic mix of any languages.
|
| 172 |
|
| 173 |
|
| 174 |
-
### Training Data Breakdown
|
| 175 |
-
| lang | train sentences | train tokens | eval sentences | eval tokens | all sentences | all tokens |
|
| 176 |
-
| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 177 |
-
| en | 342138 (2.14%) | 8515554 (1.58%) | 2925 (3.89%) | 29279 (1.57%) | 345063 (2.14%) | 8544833 (1.58%) |
|
| 178 |
-
| es | 290248 (1.81%) | 8250416 (1.53%) | 1864 (2.48%) | 19826 (1.06%) | 292112 (1.82%) | 8270242 (1.53%) |
|
| 179 |
-
| ru | 289806 (1.81%) | 7911565 (1.47%) | 1963 (2.61%) | 19501 (1.04%) | 291769 (1.81%) | 7931066 (1.47%) |
|
| 180 |
-
| ja | 288677 (1.80%) | 6935500 (1.29%) | 1836 (2.44%) | 20115 (1.08%) | 290513 (1.81%) | 6955615 (1.29%) |
|
| 181 |
-
| fr | 285022 (1.78%) | 8950594 (1.66%) | 1849 (2.46%) | 22785 (1.22%) | 286871 (1.78%) | 8973379 (1.66%) |
|
| 182 |
-
| zh | 282136 (1.76%) | 6585294 (1.22%) | 1780 (2.37%) | 17110 (0.92%) | 283916 (1.76%) | 6602404 (1.22%) |
|
| 183 |
-
| de | 279766 (1.75%) | 7421366 (1.38%) | 1761 (2.34%) | 20090 (1.08%) | 281527 (1.75%) | 7441456 (1.37%) |
|
| 184 |
-
| pt | 277705 (1.73%) | 7518951 (1.39%) | 1789 (2.38%) | 16872 (0.90%) | 279494 (1.74%) | 7535823 (1.39%) |
|
| 185 |
-
| it | 274463 (1.71%) | 7647936 (1.42%) | 1641 (2.18%) | 14284 (0.77%) | 276104 (1.72%) | 7662220 (1.42%) |
|
| 186 |
-
| uk | 244679 (1.53%) | 6480580 (1.20%) | 1187 (1.58%) | 9893 (0.53%) | 245866 (1.53%) | 6490473 (1.20%) |
|
| 187 |
-
| fi | 243388 (1.52%) | 5886670 (1.09%) | 1521 (2.02%) | 14613 (0.78%) | 244909 (1.52%) | 5901283 (1.09%) |
|
| 188 |
-
| ar | 243319 (1.52%) | 5055410 (0.94%) | 1237 (1.65%) | 18277 (0.98%) | 244556 (1.52%) | 5073687 (0.94%) |
|
| 189 |
-
| pl | 239763 (1.50%) | 6747779 (1.25%) | 1162 (1.55%) | 11588 (0.62%) | 240925 (1.50%) | 6759367 (1.25%) |
|
| 190 |
-
| he | 235680 (1.47%) | 6872873 (1.27%) | 842 (1.12%) | 7301 (0.39%) | 236522 (1.47%) | 6880174 (1.27%) |
|
| 191 |
-
| hu | 234082 (1.46%) | 6422364 (1.19%) | 1093 (1.45%) | 10598 (0.57%) | 235175 (1.46%) | 6432962 (1.19%) |
|
| 192 |
-
| tr | 231785 (1.45%) | 5649577 (1.05%) | 1078 (1.43%) | 9102 (0.49%) | 232863 (1.45%) | 5658679 (1.05%) |
|
| 193 |
-
| cs | 229623 (1.43%) | 6157905 (1.14%) | 1010 (1.34%) | 8638 (0.46%) | 230633 (1.43%) | 6166543 (1.14%) |
|
| 194 |
-
| nl | 227744 (1.42%) | 5208953 (0.97%) | 1086 (1.45%) | 9966 (0.53%) | 228830 (1.42%) | 5218919 (0.96%) |
|
| 195 |
-
| lt | 220746 (1.38%) | 5523460 (1.02%) | 953 (1.27%) | 8390 (0.45%) | 221699 (1.38%) | 5531850 (1.02%) |
|
| 196 |
-
| mk | 219568 (1.37%) | 6321823 (1.17%) | 830 (1.10%) | 7363 (0.39%) | 220398 (1.37%) | 6329186 (1.17%) |
|
| 197 |
-
| mr | 218564 (1.36%) | 5737670 (1.06%) | 755 (1.00%) | 6419 (0.34%) | 219319 (1.36%) | 5744089 (1.06%) |
|
| 198 |
-
| eo | 214879 (1.34%) | 5575371 (1.03%) | 806 (1.07%) | 9199 (0.49%) | 215685 (1.34%) | 5584570 (1.03%) |
|
| 199 |
-
| no | 212572 (1.33%) | 6119841 (1.13%) | 1520 (2.02%) | 46805 (2.51%) | 214092 (1.33%) | 6166646 (1.14%) |
|
| 200 |
-
| da | 210913 (1.32%) | 5244616 (0.97%) | 1244 (1.66%) | 11307 (0.61%) | 212157 (1.32%) | 5255923 (0.97%) |
|
| 201 |
-
| tl | 198531 (1.24%) | 5385091 (1.00%) | 980 (1.30%) | 11451 (0.61%) | 199511 (1.24%) | 5396542 (1.00%) |
|
| 202 |
-
| hy | 197579 (1.23%) | 6260512 (1.16%) | 741 (0.99%) | 14164 (0.76%) | 198320 (1.23%) | 6274676 (1.16%) |
|
| 203 |
-
| hi | 196290 (1.23%) | 7473616 (1.39%) | 995 (1.32%) | 43438 (2.33%) | 197285 (1.23%) | 7517054 (1.39%) |
|
| 204 |
-
| ko | 196247 (1.23%) | 6256562 (1.16%) | 1021 (1.36%) | 28154 (1.51%) | 197268 (1.23%) | 6284716 (1.16%) |
|
| 205 |
-
| el | 192981 (1.21%) | 7010969 (1.30%) | 739 (0.98%) | 17209 (0.92%) | 193720 (1.20%) | 7028178 (1.30%) |
|
| 206 |
-
| ro | 185669 (1.16%) | 6171646 (1.14%) | 771 (1.03%) | 21626 (1.16%) | 186440 (1.16%) | 6193272 (1.14%) |
|
| 207 |
-
| fa | 182012 (1.14%) | 5634518 (1.04%) | 724 (0.96%) | 21266 (1.14%) | 182736 (1.14%) | 5655784 (1.04%) |
|
| 208 |
-
| sk | 181257 (1.13%) | 5323589 (0.99%) | 868 (1.16%) | 30324 (1.62%) | 182125 (1.13%) | 5353913 (0.99%) |
|
| 209 |
-
| la | 178735 (1.12%) | 4452161 (0.83%) | 752 (1.00%) | 7803 (0.42%) | 179487 (1.12%) | 4459964 (0.82%) |
|
| 210 |
-
| bg | 178477 (1.11%) | 5772043 (1.07%) | 681 (0.91%) | 18306 (0.98%) | 179158 (1.11%) | 5790349 (1.07%) |
|
| 211 |
-
| is | 174213 (1.09%) | 6024579 (1.12%) | 1005 (1.34%) | 47601 (2.55%) | 175218 (1.09%) | 6072180 (1.12%) |
|
| 212 |
-
| be | 174166 (1.09%) | 6368950 (1.18%) | 880 (1.17%) | 30097 (1.61%) | 175046 (1.09%) | 6399047 (1.18%) |
|
| 213 |
-
| lv | 170404 (1.06%) | 5706364 (1.06%) | 702 (0.93%) | 35541 (1.90%) | 171106 (1.06%) | 5741905 (1.06%) |
|
| 214 |
-
| ckb | 166543 (1.04%) | 7530043 (1.40%) | 591 (0.79%) | 25103 (1.35%) | 167134 (1.04%) | 7555146 (1.40%) |
|
| 215 |
-
| ms | 165285 (1.03%) | 4430830 (0.82%) | 778 (1.04%) | 24919 (1.34%) | 166063 (1.03%) | 4455749 (0.82%) |
|
| 216 |
-
| kk | 163582 (1.02%) | 4925313 (0.91%) | 629 (0.84%) | 17536 (0.94%) | 164211 (1.02%) | 4942849 (0.91%) |
|
| 217 |
-
| ka | 162558 (1.02%) | 5244466 (0.97%) | 527 (0.70%) | 16440 (0.88%) | 163085 (1.01%) | 5260906 (0.97%) |
|
| 218 |
-
| bn | 162058 (1.01%) | 6155732 (1.14%) | 416 (0.55%) | 14535 (0.78%) | 162474 (1.01%) | 6170267 (1.14%) |
|
| 219 |
-
| eu | 160479 (1.00%) | 5375791 (1.00%) | 675 (0.90%) | 34708 (1.86%) | 161154 (1.00%) | 5410499 (1.00%) |
|
| 220 |
-
| as | 160525 (1.00%) | 8228319 (1.53%) | 377 (0.50%) | 20723 (1.11%) | 160902 (1.00%) | 8249042 (1.52%) |
|
| 221 |
-
| mn | 160161 (1.00%) | 5433430 (1.01%) | 645 (0.86%) | 19610 (1.05%) | 160806 (1.00%) | 5453040 (1.01%) |
|
| 222 |
-
| ur | 158216 (0.99%) | 5101965 (0.95%) | 582 (0.77%) | 19433 (1.04%) | 158798 (0.99%) | 5121398 (0.95%) |
|
| 223 |
-
| ky | 157623 (0.98%) | 4996516 (0.93%) | 640 (0.85%) | 18469 (0.99%) | 158263 (0.98%) | 5014985 (0.93%) |
|
| 224 |
-
| ba | 157391 (0.98%) | 7998542 (1.48%) | 591 (0.79%) | 30781 (1.65%) | 157982 (0.98%) | 8029323 (1.48%) |
|
| 225 |
-
| oc | 157265 (0.98%) | 5676181 (1.05%) | 669 (0.89%) | 25130 (1.35%) | 157934 (0.98%) | 5701311 (1.05%) |
|
| 226 |
-
| th | 156439 (0.98%) | 5222662 (0.97%) | 579 (0.77%) | 20033 (1.07%) | 157018 (0.98%) | 5242695 (0.97%) |
|
| 227 |
-
| hr | 156094 (0.97%) | 4838805 (0.90%) | 716 (0.95%) | 31608 (1.69%) | 156810 (0.97%) | 4870413 (0.90%) |
|
| 228 |
-
| af | 155450 (0.97%) | 4536957 (0.84%) | 995 (1.32%) | 29783 (1.60%) | 156445 (0.97%) | 4566740 (0.84%) |
|
| 229 |
-
| ps | 155591 (0.97%) | 4514560 (0.84%) | 537 (0.71%) | 15173 (0.81%) | 156128 (0.97%) | 4529733 (0.84%) |
|
| 230 |
-
| id | 154998 (0.97%) | 4010110 (0.74%) | 737 (0.98%) | 18766 (1.01%) | 155735 (0.97%) | 4028876 (0.74%) |
|
| 231 |
-
| pa | 155053 (0.97%) | 7277866 (1.35%) | 602 (0.80%) | 29831 (1.60%) | 155655 (0.97%) | 7307697 (1.35%) |
|
| 232 |
-
| sw | 154323 (0.96%) | 4695422 (0.87%) | 562 (0.75%) | 20760 (1.11%) | 154885 (0.96%) | 4716182 (0.87%) |
|
| 233 |
-
| tt | 152959 (0.96%) | 5329166 (0.99%) | 642 (0.85%) | 9018 (0.48%) | 153601 (0.95%) | 5338184 (0.99%) |
|
| 234 |
-
| jv | 149846 (0.94%) | 4508978 (0.84%) | 512 (0.68%) | 18744 (1.00%) | 150358 (0.93%) | 4527722 (0.84%) |
|
| 235 |
-
| cy | 147138 (0.92%) | 5344455 (0.99%) | 664 (0.88%) | 27600 (1.48%) | 147802 (0.92%) | 5372055 (0.99%) |
|
| 236 |
-
| ga | 144559 (0.90%) | 5145701 (0.95%) | 681 (0.91%) | 28351 (1.52%) | 145240 (0.90%) | 5174052 (0.96%) |
|
| 237 |
-
| bs | 142871 (0.89%) | 4246487 (0.79%) | 664 (0.88%) | 23293 (1.25%) | 143535 (0.89%) | 4269780 (0.79%) |
|
| 238 |
-
| ca | 142813 (0.89%) | 5286099 (0.98%) | 614 (0.82%) | 20417 (1.09%) | 143427 (0.89%) | 5306516 (0.98%) |
|
| 239 |
-
| kn | 142453 (0.89%) | 14440106 (2.68%) | 636 (0.85%) | 48433 (2.60%) | 143089 (0.89%) | 14488539 (2.68%) |
|
| 240 |
-
| ne | 141617 (0.88%) | 4633072 (0.86%) | 441 (0.59%) | 13357 (0.72%) | 142058 (0.88%) | 4646429 (0.86%) |
|
| 241 |
-
| gl | 140293 (0.88%) | 4370302 (0.81%) | 580 (0.77%) | 17719 (0.95%) | 140873 (0.88%) | 4388021 (0.81%) |
|
| 242 |
-
| ku | 140150 (0.88%) | 4681613 (0.87%) | 539 (0.72%) | 25315 (1.36%) | 140689 (0.87%) | 4706928 (0.87%) |
|
| 243 |
-
| uz | 137971 (0.86%) | 4402565 (0.82%) | 512 (0.68%) | 18239 (0.98%) | 138483 (0.86%) | 4420804 (0.82%) |
|
| 244 |
-
| sl | 137474 (0.86%) | 3689209 (0.68%) | 602 (0.80%) | 15975 (0.86%) | 138076 (0.86%) | 3705184 (0.68%) |
|
| 245 |
-
| sv | 136227 (0.85%) | 3888997 (0.72%) | 951 (1.27%) | 7503 (0.40%) | 137178 (0.85%) | 3896500 (0.72%) |
|
| 246 |
-
| tg | 129047 (0.81%) | 7212798 (1.34%) | 507 (0.67%) | 30575 (1.64%) | 129554 (0.81%) | 7243373 (1.34%) |
|
| 247 |
-
| et | 125334 (0.78%) | 3134904 (0.58%) | 516 (0.69%) | 13566 (0.73%) | 125850 (0.78%) | 3148470 (0.58%) |
|
| 248 |
-
| br | 124276 (0.78%) | 4304633 (0.80%) | 555 (0.74%) | 16741 (0.90%) | 124831 (0.78%) | 4321374 (0.80%) |
|
| 249 |
-
| su | 123580 (0.77%) | 3996363 (0.74%) | 485 (0.65%) | 18936 (1.01%) | 124065 (0.77%) | 4015299 (0.74%) |
|
| 250 |
-
| lb | 123335 (0.77%) | 4218363 (0.78%) | 494 (0.66%) | 18236 (0.98%) | 123829 (0.77%) | 4236599 (0.78%) |
|
| 251 |
-
| mt | 122262 (0.76%) | 6326484 (1.17%) | 448 (0.60%) | 23893 (1.28%) | 122710 (0.76%) | 6350377 (1.17%) |
|
| 252 |
-
| sr | 121328 (0.76%) | 3374324 (0.63%) | 461 (0.61%) | 4054 (0.22%) | 121789 (0.76%) | 3378378 (0.62%) |
|
| 253 |
-
| sq | 114178 (0.71%) | 3917105 (0.73%) | 519 (0.69%) | 17283 (0.93%) | 114697 (0.71%) | 3934388 (0.73%) |
|
| 254 |
-
| or | 105680 (0.66%) | 3746174 (0.69%) | 415 (0.55%) | 13597 (0.73%) | 106095 (0.66%) | 3759771 (0.69%) |
|
| 255 |
-
| ml | 104847 (0.65%) | 10467180 (1.94%) | 432 (0.58%) | 34289 (1.84%) | 105279 (0.65%) | 10501469 (1.94%) |
|
| 256 |
-
| yi | 99296 (0.62%) | 4037607 (0.75%) | 356 (0.47%) | 7656 (0.41%) | 99652 (0.62%) | 4045263 (0.75%) |
|
| 257 |
-
| te | 96687 (0.60%) | 9339722 (1.73%) | 383 (0.51%) | 32688 (1.75%) | 97070 (0.60%) | 9372410 (1.73%) |
|
| 258 |
-
| ta | 89670 (0.56%) | 7564484 (1.40%) | 378 (0.50%) | 25504 (1.37%) | 90048 (0.56%) | 7589988 (1.40%) |
|
| 259 |
-
| mg | 89513 (0.56%) | 3128095 (0.58%) | 343 (0.46%) | 11046 (0.59%) | 89856 (0.56%) | 3139141 (0.58%) |
|
| 260 |
-
| si | 88038 (0.55%) | 5030862 (0.93%) | 345 (0.46%) | 17895 (0.96%) | 88383 (0.55%) | 5048757 (0.93%) |
|
| 261 |
-
| rm | 71273 (0.45%) | 2704067 (0.50%) | 254 (0.34%) | 10060 (0.54%) | 71527 (0.44%) | 2714127 (0.50%) |
|
| 262 |
-
| vi | 70788 (0.44%) | 2476955 (0.46%) | 333 (0.44%) | 3428 (0.18%) | 71121 (0.44%) | 2480383 (0.46%) |
|
| 263 |
-
| gu | 67542 (0.42%) | 7557129 (1.40%) | 299 (0.40%) | 28080 (1.50%) | 67841 (0.42%) | 7585209 (1.40%) |
|
| 264 |
-
| bo | 66467 (0.42%) | 1308971 (0.24%) | 218 (0.29%) | 4357 (0.23%) | 66685 (0.41%) | 1313328 (0.24%) |
|
| 265 |
-
| ug | 61389 (0.38%) | 1398443 (0.26%) | 213 (0.28%) | 4674 (0.25%) | 61602 (0.38%) | 1403117 (0.26%) |
|
| 266 |
-
| dv | 57776 (0.36%) | 1485603 (0.28%) | 205 (0.27%) | 5705 (0.31%) | 57981 (0.36%) | 1491308 (0.28%) |
|
| 267 |
-
| am | 56487 (0.35%) | 2563835 (0.48%) | 227 (0.30%) | 11582 (0.62%) | 56714 (0.35%) | 2575417 (0.48%) |
|
| 268 |
-
| yo | 56313 (0.35%) | 3444596 (0.64%) | 230 (0.31%) | 18061 (0.97%) | 56543 (0.35%) | 3462657 (0.64%) |
|
| 269 |
-
| my | 55960 (0.35%) | 2062131 (0.38%) | 212 (0.28%) | 8300 (0.44%) | 56172 (0.35%) | 2070431 (0.38%) |
|
| 270 |
-
| km | 53890 (0.34%) | 2889004 (0.54%) | 188 (0.25%) | 9842 (0.53%) | 54078 (0.34%) | 2898846 (0.54%) |
|
| 271 |
-
| so | 53862 (0.34%) | 1963263 (0.36%) | 202 (0.27%) | 7961 (0.43%) | 54064 (0.34%) | 1971224 (0.36%) |
|
| 272 |
-
| sd | 52494 (0.33%) | 3075283 (0.57%) | 200 (0.27%) | 10379 (0.56%) | 52694 (0.33%) | 3085662 (0.57%) |
|
| 273 |
-
| zu | 50081 (0.31%) | 2287742 (0.42%) | 187 (0.25%) | 8913 (0.48%) | 50268 (0.31%) | 2296655 (0.42%) |
|
| 274 |
-
| lo | 47982 (0.30%) | 1659406 (0.31%) | 188 (0.25%) | 5980 (0.32%) | 48170 (0.30%) | 1665386 (0.31%) |
|
| 275 |
-
| ti | 45884 (0.29%) | 2804218 (0.52%) | 193 (0.26%) | 11862 (0.64%) | 46077 (0.29%) | 2816080 (0.52%) |
|
| 276 |
-
| ce | 43040 (0.27%) | 2319707 (0.43%) | 181 (0.24%) | 9398 (0.50%) | 43221 (0.27%) | 2329105 (0.43%) |
|
| 277 |
-
| ny | 41897 (0.26%) | 1969591 (0.37%) | 159 (0.21%) | 8122 (0.44%) | 42056 (0.26%) | 1977713 (0.37%) |
|
| 278 |
-
| gd | 35576 (0.22%) | 1240445 (0.23%) | 142 (0.19%) | 3449 (0.18%) | 35718 (0.22%) | 1243894 (0.23%) |
|
| 279 |
-
| xh | 23590 (0.15%) | 877812 (0.16%) | 96 (0.13%) | 3597 (0.19%) | 23686 (0.15%) | 881409 (0.16%) |
|
| 280 |
-
| om | 14738 (0.09%) | 523551 (0.10%) | 55 (0.07%) | 1935 (0.10%) | 14793 (0.09%) | 525486 (0.10%) |
|
| 281 |
-
| sco | 8374 (0.05%) | 224424 (0.04%) | 30 (0.04%) | 1055 (0.06%) | 8404 (0.05%) | 225479 (0.04%) |
|
| 282 |
-
| **total** | 16012306 (100.00%) | 539378202 (100.00%) | 75126 (100.00%) | 1866305 (100.00%) | 16087432 (100.00%) | 541244507 (100.00%) |
|
| 283 |
|
| 284 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 285 |
|
| 286 |
|
| 287 |
-
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
|
| 288 |
It achieves the following results on the evaluation set:
|
| 289 |
-
- Loss: 0.
|
| 290 |
-
- Precision: 0.
|
| 291 |
-
- Recall: 0.
|
| 292 |
-
- F1: 0.
|
| 293 |
-
- Accuracy: 0.
|
| 294 |
|
| 295 |
## Training procedure
|
| 296 |
|
|
@@ -298,10 +297,10 @@ It achieves the following results on the evaluation set:
|
|
| 298 |
|
| 299 |
The following hyperparameters were used during training:
|
| 300 |
- learning_rate: 5e-05
|
| 301 |
-
- train_batch_size:
|
| 302 |
-
- eval_batch_size:
|
| 303 |
- seed: 42
|
| 304 |
-
- gradient_accumulation_steps:
|
| 305 |
- total_train_batch_size: 144
|
| 306 |
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 307 |
- lr_scheduler_type: linear
|
|
@@ -312,30 +311,33 @@ The following hyperparameters were used during training:
|
|
| 312 |
|
| 313 |
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|
| 314 |
|:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
|
| 315 |
-
| 0.
|
| 316 |
-
| 0.
|
| 317 |
-
| 0.
|
| 318 |
-
| 0.
|
| 319 |
-
| 0.
|
| 320 |
-
| 0.
|
| 321 |
-
| 0.
|
| 322 |
-
| 0.
|
| 323 |
-
| 0.
|
| 324 |
-
| 0.
|
| 325 |
-
| 0.
|
| 326 |
-
| 0.
|
| 327 |
-
| 0.
|
| 328 |
-
| 0.
|
| 329 |
-
| 0.
|
| 330 |
-
| 0.
|
| 331 |
-
| 0.
|
| 332 |
-
| 0.
|
| 333 |
-
| 0.
|
| 334 |
-
| 0.
|
| 335 |
-
| 0.
|
| 336 |
-
| 0.
|
| 337 |
-
| 0.
|
| 338 |
-
| 0.
|
|
|
|
|
|
|
|
|
|
| 339 |
|
| 340 |
|
| 341 |
### Framework versions
|
|
@@ -343,4 +345,4 @@ The following hyperparameters were used during training:
|
|
| 343 |
- Transformers 5.0.0
|
| 344 |
- Pytorch 2.10.0+cu128
|
| 345 |
- Datasets 4.0.0
|
| 346 |
-
- Tokenizers 0.22.2
|
|
|
|
| 116 |
- yi
|
| 117 |
- zh
|
| 118 |
- zu
|
| 119 |
+
model-index:
|
| 120 |
+
- name: polyglot-tagger
|
| 121 |
+
results: []
|
| 122 |
datasets:
|
| 123 |
- wikimedia/wikipedia
|
| 124 |
- HuggingFaceFW/finetranslations
|
|
|
|
| 128 |
- DerivedFunction/finetranslations-filtered
|
| 129 |
- DerivedFunction/tatoeba-filtered
|
| 130 |
pipeline_tag: token-classification
|
|
|
|
|
|
|
|
|
|
| 131 |
---
|
| 132 |
|
| 133 |
|
|
|
|
| 138 |
The model predicts BIO-style language tags over tokens, which makes it useful for
|
| 139 |
language identification, code-switch detection, and multilingual document analysis.
|
| 140 |
|
| 141 |
+
> Compared to version 2.2, this version had training data that attempted to fix the model scoring common grade-school words from major langauges as low confidence or in a minor language bucket.
|
| 142 |
|
| 143 |
## Model description
|
| 144 |
|
|
|
|
| 152 |
The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental
|
| 153 |
and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results.
|
| 154 |
|
| 155 |
+
> Note that Romanized versions of any language may have no representation in the training set, such as Romanized Russian, and Hindi.
|
| 156 |
|
| 157 |
### Training and Evaluation Data
|
| 158 |
|
|
|
|
| 164 |
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 165 |
- Low chance of simulating OCR and messy text with character mutation.
|
| 166 |
|
| 167 |
+
To generalize well on both the target language and code switching a curriculum is provided:
|
| 168 |
- Pure documents 55%: Single language to learn its vocabulary, simulating a short paragraph of a single language.
|
| 169 |
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching.
|
| 170 |
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 171 |
- Mixed 10%: Generic mix of any languages.
|
| 172 |
|
| 173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
|
| 176 |
+
| lang | train sentences | train tokens | eval sentences | eval tokens | all sentences | all tokens |
|
| 177 |
+
| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 178 |
+
| en | 423264 (2.41%) | 9841704 (1.71%) | 3157 (3.84%) | 35025 (1.79%) | 426421 (2.41%) | 9876729 (1.71%) |
|
| 179 |
+
| es | 359106 (2.04%) | 9729675 (1.69%) | 2201 (2.68%) | 22340 (1.14%) | 361307 (2.04%) | 9752015 (1.69%) |
|
| 180 |
+
| ru | 356083 (2.02%) | 8945224 (1.56%) | 2226 (2.71%) | 21578 (1.10%) | 358309 (2.03%) | 8966802 (1.55%) |
|
| 181 |
+
| fr | 354645 (2.02%) | 10591338 (1.84%) | 2213 (2.69%) | 26148 (1.34%) | 356858 (2.02%) | 10617486 (1.84%) |
|
| 182 |
+
| ja | 352243 (2.00%) | 7945312 (1.38%) | 2219 (2.70%) | 25849 (1.32%) | 354462 (2.01%) | 7971161 (1.38%) |
|
| 183 |
+
| pt | 346042 (1.97%) | 8793674 (1.53%) | 2059 (2.50%) | 20881 (1.07%) | 348101 (1.97%) | 8814555 (1.53%) |
|
| 184 |
+
| de | 344644 (1.96%) | 8847457 (1.54%) | 2151 (2.61%) | 24958 (1.27%) | 346795 (1.96%) | 8872415 (1.54%) |
|
| 185 |
+
| it | 343887 (1.95%) | 8790806 (1.53%) | 2000 (2.43%) | 17342 (0.89%) | 345887 (1.96%) | 8808148 (1.53%) |
|
| 186 |
+
| fi | 299568 (1.70%) | 6905536 (1.20%) | 1576 (1.92%) | 14458 (0.74%) | 301144 (1.70%) | 6919994 (1.20%) |
|
| 187 |
+
| uk | 297565 (1.69%) | 7228987 (1.26%) | 1398 (1.70%) | 11391 (0.58%) | 298963 (1.69%) | 7240378 (1.26%) |
|
| 188 |
+
| zh | 294064 (1.67%) | 7329413 (1.28%) | 1717 (2.09%) | 33433 (1.71%) | 295781 (1.67%) | 7362846 (1.28%) |
|
| 189 |
+
| tr | 289328 (1.64%) | 6606625 (1.15%) | 1384 (1.68%) | 11089 (0.57%) | 290712 (1.64%) | 6617714 (1.15%) |
|
| 190 |
+
| he | 289239 (1.64%) | 7792338 (1.36%) | 1100 (1.34%) | 10342 (0.53%) | 290339 (1.64%) | 7802680 (1.35%) |
|
| 191 |
+
| pl | 288423 (1.64%) | 7707293 (1.34%) | 1305 (1.59%) | 11306 (0.58%) | 289728 (1.64%) | 7718599 (1.34%) |
|
| 192 |
+
| hu | 286880 (1.63%) | 7547282 (1.31%) | 1232 (1.50%) | 11115 (0.57%) | 288112 (1.63%) | 7558397 (1.31%) |
|
| 193 |
+
| nl | 280682 (1.60%) | 6057971 (1.05%) | 1296 (1.58%) | 10453 (0.53%) | 281978 (1.60%) | 6068424 (1.05%) |
|
| 194 |
+
| lt | 273386 (1.55%) | 6507776 (1.13%) | 1165 (1.42%) | 10491 (0.54%) | 274551 (1.55%) | 6518267 (1.13%) |
|
| 195 |
+
| eo | 267885 (1.52%) | 6554654 (1.14%) | 1055 (1.28%) | 13271 (0.68%) | 268940 (1.52%) | 6567925 (1.14%) |
|
| 196 |
+
| ar | 257437 (1.46%) | 5125573 (0.89%) | 1327 (1.61%) | 15180 (0.78%) | 258764 (1.46%) | 5140753 (0.89%) |
|
| 197 |
+
| cs | 240507 (1.37%) | 6406086 (1.11%) | 1082 (1.32%) | 9473 (0.48%) | 241589 (1.37%) | 6415559 (1.11%) |
|
| 198 |
+
| mk | 231103 (1.31%) | 6478376 (1.13%) | 953 (1.16%) | 7713 (0.39%) | 232056 (1.31%) | 6486089 (1.12%) |
|
| 199 |
+
| mr | 228596 (1.30%) | 5886608 (1.02%) | 776 (0.94%) | 6332 (0.32%) | 229372 (1.30%) | 5892940 (1.02%) |
|
| 200 |
+
| no | 223605 (1.27%) | 6137131 (1.07%) | 1396 (1.70%) | 40226 (2.05%) | 225001 (1.27%) | 6177357 (1.07%) |
|
| 201 |
+
| da | 222243 (1.26%) | 5375746 (0.94%) | 1201 (1.46%) | 10373 (0.53%) | 223444 (1.26%) | 5386119 (0.93%) |
|
| 202 |
+
| hy | 207937 (1.18%) | 6345675 (1.10%) | 791 (0.96%) | 9276 (0.47%) | 208728 (1.18%) | 6354951 (1.10%) |
|
| 203 |
+
| tl | 207674 (1.18%) | 5561702 (0.97%) | 1017 (1.24%) | 10926 (0.56%) | 208691 (1.18%) | 5572628 (0.97%) |
|
| 204 |
+
| hi | 206552 (1.17%) | 7796062 (1.36%) | 1079 (1.31%) | 47351 (2.42%) | 207631 (1.17%) | 7843413 (1.36%) |
|
| 205 |
+
| ko | 205625 (1.17%) | 6481034 (1.13%) | 1156 (1.41%) | 32355 (1.65%) | 206781 (1.17%) | 6513389 (1.13%) |
|
| 206 |
+
| el | 202334 (1.15%) | 7105554 (1.24%) | 826 (1.00%) | 13412 (0.68%) | 203160 (1.15%) | 7118966 (1.23%) |
|
| 207 |
+
| ro | 194999 (1.11%) | 6206913 (1.08%) | 820 (1.00%) | 14575 (0.74%) | 195819 (1.11%) | 6221488 (1.08%) |
|
| 208 |
+
| fa | 192050 (1.09%) | 5728246 (1.00%) | 696 (0.85%) | 14765 (0.75%) | 192746 (1.09%) | 5743011 (1.00%) |
|
| 209 |
+
| sk | 189330 (1.08%) | 5318617 (0.93%) | 873 (1.06%) | 20779 (1.06%) | 190203 (1.08%) | 5339396 (0.93%) |
|
| 210 |
+
| la | 188201 (1.07%) | 4591159 (0.80%) | 824 (1.00%) | 8557 (0.44%) | 189025 (1.07%) | 4599716 (0.80%) |
|
| 211 |
+
| bg | 187685 (1.07%) | 5860353 (1.02%) | 762 (0.93%) | 16804 (0.86%) | 188447 (1.07%) | 5877157 (1.02%) |
|
| 212 |
+
| be | 181543 (1.03%) | 6528657 (1.14%) | 869 (1.06%) | 25944 (1.32%) | 182412 (1.03%) | 6554601 (1.14%) |
|
| 213 |
+
| is | 180452 (1.03%) | 6146455 (1.07%) | 959 (1.17%) | 39591 (2.02%) | 181411 (1.03%) | 6186046 (1.07%) |
|
| 214 |
+
| lv | 179142 (1.02%) | 5867897 (1.02%) | 762 (0.93%) | 33481 (1.71%) | 179904 (1.02%) | 5901378 (1.02%) |
|
| 215 |
+
| ckb | 174282 (0.99%) | 7825141 (1.36%) | 667 (0.81%) | 28756 (1.47%) | 174949 (0.99%) | 7853897 (1.36%) |
|
| 216 |
+
| ms | 172573 (0.98%) | 4614764 (0.80%) | 815 (0.99%) | 24769 (1.26%) | 173388 (0.98%) | 4639533 (0.80%) |
|
| 217 |
+
| ka | 170876 (0.97%) | 5505127 (0.96%) | 673 (0.82%) | 20651 (1.05%) | 171549 (0.97%) | 5525778 (0.96%) |
|
| 218 |
+
| kk | 170695 (0.97%) | 5132560 (0.89%) | 676 (0.82%) | 18695 (0.95%) | 171371 (0.97%) | 5151255 (0.89%) |
|
| 219 |
+
| bn | 170721 (0.97%) | 6393448 (1.11%) | 441 (0.54%) | 14727 (0.75%) | 171162 (0.97%) | 6408175 (1.11%) |
|
| 220 |
+
| eu | 168462 (0.96%) | 5737310 (1.00%) | 746 (0.91%) | 37196 (1.90%) | 169208 (0.96%) | 5774506 (1.00%) |
|
| 221 |
+
| as | 168746 (0.96%) | 8564682 (1.49%) | 445 (0.54%) | 24444 (1.25%) | 169191 (0.96%) | 8589126 (1.49%) |
|
| 222 |
+
| mn | 167543 (0.95%) | 5678049 (0.99%) | 703 (0.85%) | 20347 (1.04%) | 168246 (0.95%) | 5698396 (0.99%) |
|
| 223 |
+
| ur | 165992 (0.94%) | 5361179 (0.93%) | 684 (0.83%) | 22622 (1.16%) | 166676 (0.94%) | 5383801 (0.93%) |
|
| 224 |
+
| oc | 165863 (0.94%) | 5735536 (1.00%) | 730 (0.89%) | 18599 (0.95%) | 166593 (0.94%) | 5754135 (1.00%) |
|
| 225 |
+
| ba | 164919 (0.94%) | 8387828 (1.46%) | 699 (0.85%) | 35927 (1.83%) | 165618 (0.94%) | 8423755 (1.46%) |
|
| 226 |
+
| th | 164429 (0.93%) | 5495248 (0.96%) | 649 (0.79%) | 22113 (1.13%) | 165078 (0.93%) | 5517361 (0.96%) |
|
| 227 |
+
| ky | 164374 (0.93%) | 5199548 (0.90%) | 683 (0.83%) | 18956 (0.97%) | 165057 (0.93%) | 5218504 (0.90%) |
|
| 228 |
+
| hr | 163828 (0.93%) | 5183677 (0.90%) | 711 (0.86%) | 33845 (1.73%) | 164539 (0.93%) | 5217522 (0.90%) |
|
| 229 |
+
| ps | 163238 (0.93%) | 4735113 (0.82%) | 674 (0.82%) | 18515 (0.95%) | 163912 (0.93%) | 4753628 (0.82%) |
|
| 230 |
+
| id | 163187 (0.93%) | 4025079 (0.70%) | 723 (0.88%) | 13371 (0.68%) | 163910 (0.93%) | 4038450 (0.70%) |
|
| 231 |
+
| pa | 162180 (0.92%) | 7621059 (1.33%) | 581 (0.71%) | 29036 (1.48%) | 162761 (0.92%) | 7650095 (1.33%) |
|
| 232 |
+
| sw | 161777 (0.92%) | 5013161 (0.87%) | 653 (0.79%) | 26493 (1.35%) | 162430 (0.92%) | 5039654 (0.87%) |
|
| 233 |
+
| af | 160455 (0.91%) | 4676798 (0.81%) | 932 (1.13%) | 27369 (1.40%) | 161387 (0.91%) | 4704167 (0.82%) |
|
| 234 |
+
| jv | 156292 (0.89%) | 4752381 (0.83%) | 576 (0.70%) | 22573 (1.15%) | 156868 (0.89%) | 4774954 (0.83%) |
|
| 235 |
+
| tt | 154833 (0.88%) | 5165763 (0.90%) | 578 (0.70%) | 7298 (0.37%) | 155411 (0.88%) | 5173061 (0.90%) |
|
| 236 |
+
| cy | 153551 (0.87%) | 5656404 (0.98%) | 653 (0.79%) | 29503 (1.51%) | 154204 (0.87%) | 5685907 (0.99%) |
|
| 237 |
+
| ga | 150458 (0.86%) | 5488243 (0.95%) | 680 (0.83%) | 33471 (1.71%) | 151138 (0.86%) | 5521714 (0.96%) |
|
| 238 |
+
| kn | 150184 (0.85%) | 14992479 (2.61%) | 697 (0.85%) | 49288 (2.52%) | 150881 (0.85%) | 15041767 (2.61%) |
|
| 239 |
+
| bs | 150037 (0.85%) | 4582900 (0.80%) | 649 (0.79%) | 25588 (1.31%) | 150686 (0.85%) | 4608488 (0.80%) |
|
| 240 |
+
| ca | 149401 (0.85%) | 5477662 (0.95%) | 629 (0.76%) | 21391 (1.09%) | 150030 (0.85%) | 5499053 (0.95%) |
|
| 241 |
+
| ne | 148716 (0.85%) | 4855198 (0.84%) | 535 (0.65%) | 16246 (0.83%) | 149251 (0.84%) | 4871444 (0.84%) |
|
| 242 |
+
| ku | 147702 (0.84%) | 4973601 (0.87%) | 574 (0.70%) | 28196 (1.44%) | 148276 (0.84%) | 5001797 (0.87%) |
|
| 243 |
+
| gl | 147011 (0.84%) | 4554907 (0.79%) | 658 (0.80%) | 20127 (1.03%) | 147669 (0.84%) | 4575034 (0.79%) |
|
| 244 |
+
| uz | 145433 (0.83%) | 4704898 (0.82%) | 573 (0.70%) | 21862 (1.12%) | 146006 (0.83%) | 4726760 (0.82%) |
|
| 245 |
+
| sl | 144084 (0.82%) | 3851696 (0.67%) | 651 (0.79%) | 18164 (0.93%) | 144735 (0.82%) | 3869860 (0.67%) |
|
| 246 |
+
| sv | 143041 (0.81%) | 4006332 (0.70%) | 905 (1.10%) | 7012 (0.36%) | 143946 (0.81%) | 4013344 (0.70%) |
|
| 247 |
+
| tg | 136703 (0.78%) | 7664329 (1.33%) | 572 (0.70%) | 34220 (1.75%) | 137275 (0.78%) | 7698549 (1.33%) |
|
| 248 |
+
| et | 131007 (0.74%) | 3280590 (0.57%) | 549 (0.67%) | 14021 (0.72%) | 131556 (0.74%) | 3294611 (0.57%) |
|
| 249 |
+
| br | 130223 (0.74%) | 4495403 (0.78%) | 546 (0.66%) | 17304 (0.88%) | 130769 (0.74%) | 4512707 (0.78%) |
|
| 250 |
+
| lb | 129528 (0.74%) | 4421411 (0.77%) | 495 (0.60%) | 17761 (0.91%) | 130023 (0.74%) | 4439172 (0.77%) |
|
| 251 |
+
| su | 129144 (0.73%) | 4215719 (0.73%) | 535 (0.65%) | 21391 (1.09%) | 129679 (0.73%) | 4237110 (0.73%) |
|
| 252 |
+
| mt | 128626 (0.73%) | 6671441 (1.16%) | 508 (0.62%) | 26729 (1.36%) | 129134 (0.73%) | 6698170 (1.16%) |
|
| 253 |
+
| sq | 119431 (0.68%) | 4107917 (0.71%) | 561 (0.68%) | 18633 (0.95%) | 119992 (0.68%) | 4126550 (0.72%) |
|
| 254 |
+
| sr | 117855 (0.67%) | 3160599 (0.55%) | 427 (0.52%) | 3505 (0.18%) | 118282 (0.67%) | 3164104 (0.55%) |
|
| 255 |
+
| or | 110709 (0.63%) | 3922431 (0.68%) | 410 (0.50%) | 13094 (0.67%) | 111119 (0.63%) | 3935525 (0.68%) |
|
| 256 |
+
| ml | 110085 (0.63%) | 10929013 (1.90%) | 464 (0.56%) | 36922 (1.89%) | 110549 (0.63%) | 10965935 (1.90%) |
|
| 257 |
+
| yi | 104494 (0.59%) | 4085563 (0.71%) | 400 (0.49%) | 6005 (0.31%) | 104894 (0.59%) | 4091568 (0.71%) |
|
| 258 |
+
| te | 101076 (0.57%) | 9757033 (1.70%) | 430 (0.52%) | 37897 (1.94%) | 101506 (0.57%) | 9794930 (1.70%) |
|
| 259 |
+
| ta | 94122 (0.53%) | 7917169 (1.38%) | 386 (0.47%) | 26610 (1.36%) | 94508 (0.53%) | 7943779 (1.38%) |
|
| 260 |
+
| mg | 93939 (0.53%) | 3291017 (0.57%) | 391 (0.48%) | 11698 (0.60%) | 94330 (0.53%) | 3302715 (0.57%) |
|
| 261 |
+
| si | 92723 (0.53%) | 5275463 (0.92%) | 364 (0.44%) | 18426 (0.94%) | 93087 (0.53%) | 5293889 (0.92%) |
|
| 262 |
+
| vi | 74916 (0.43%) | 2535825 (0.44%) | 335 (0.41%) | 3396 (0.17%) | 75251 (0.43%) | 2539221 (0.44%) |
|
| 263 |
+
| rm | 74806 (0.43%) | 2826708 (0.49%) | 318 (0.39%) | 12654 (0.65%) | 75124 (0.43%) | 2839362 (0.49%) |
|
| 264 |
+
| gu | 70961 (0.40%) | 7859622 (1.37%) | 335 (0.41%) | 28389 (1.45%) | 71296 (0.40%) | 7888011 (1.37%) |
|
| 265 |
+
| bo | 69565 (0.40%) | 1378245 (0.24%) | 263 (0.32%) | 5407 (0.28%) | 69828 (0.40%) | 1383652 (0.24%) |
|
| 266 |
+
| ug | 64297 (0.37%) | 1427585 (0.25%) | 260 (0.32%) | 4769 (0.24%) | 64557 (0.37%) | 1432354 (0.25%) |
|
| 267 |
+
| dv | 60328 (0.34%) | 1557497 (0.27%) | 215 (0.26%) | 5844 (0.30%) | 60543 (0.34%) | 1563341 (0.27%) |
|
| 268 |
+
| am | 59339 (0.34%) | 2705311 (0.47%) | 235 (0.29%) | 10768 (0.55%) | 59574 (0.34%) | 2716079 (0.47%) |
|
| 269 |
+
| yo | 59246 (0.34%) | 3649130 (0.63%) | 260 (0.32%) | 21157 (1.08%) | 59506 (0.34%) | 3670287 (0.64%) |
|
| 270 |
+
| my | 58575 (0.33%) | 2165089 (0.38%) | 214 (0.26%) | 8142 (0.42%) | 58789 (0.33%) | 2173231 (0.38%) |
|
| 271 |
+
| km | 57081 (0.32%) | 3056236 (0.53%) | 193 (0.23%) | 10606 (0.54%) | 57274 (0.32%) | 3066842 (0.53%) |
|
| 272 |
+
| so | 56160 (0.32%) | 2044409 (0.36%) | 212 (0.26%) | 8847 (0.45%) | 56372 (0.32%) | 2053256 (0.36%) |
|
| 273 |
+
| sd | 55359 (0.31%) | 3226018 (0.56%) | 217 (0.26%) | 10847 (0.55%) | 55576 (0.31%) | 3236865 (0.56%) |
|
| 274 |
+
| zu | 52465 (0.30%) | 2406841 (0.42%) | 203 (0.25%) | 9751 (0.50%) | 52668 (0.30%) | 2416592 (0.42%) |
|
| 275 |
+
| lo | 50641 (0.29%) | 1747495 (0.30%) | 189 (0.23%) | 6221 (0.32%) | 50830 (0.29%) | 1753716 (0.30%) |
|
| 276 |
+
| ti | 47785 (0.27%) | 2895617 (0.50%) | 195 (0.24%) | 12699 (0.65%) | 47980 (0.27%) | 2908316 (0.50%) |
|
| 277 |
+
| ce | 45014 (0.26%) | 2425219 (0.42%) | 188 (0.23%) | 9950 (0.51%) | 45202 (0.26%) | 2435169 (0.42%) |
|
| 278 |
+
| ny | 43552 (0.25%) | 2051132 (0.36%) | 171 (0.21%) | 8286 (0.42%) | 43723 (0.25%) | 2059418 (0.36%) |
|
| 279 |
+
| gd | 36623 (0.21%) | 1273243 (0.22%) | 156 (0.19%) | 3615 (0.18%) | 36779 (0.21%) | 1276858 (0.22%) |
|
| 280 |
+
| xh | 24432 (0.14%) | 911850 (0.16%) | 93 (0.11%) | 3528 (0.18%) | 24525 (0.14%) | 915378 (0.16%) |
|
| 281 |
+
| om | 15372 (0.09%) | 545603 (0.09%) | 77 (0.09%) | 2564 (0.13%) | 15449 (0.09%) | 548167 (0.10%) |
|
| 282 |
+
| sco | 8772 (0.05%) | 233030 (0.04%) | 37 (0.04%) | 828 (0.04%) | 8809 (0.05%) | 233858 (0.04%) |
|
| 283 |
+
| **total** | 17593786 (100.00%) | 574735483 (100.00%) | 82270 (100.00%) | 1958217 (100.00%) | 17676056 (100.00%) | 576693700 (100.00%) |
|
| 284 |
|
| 285 |
|
| 286 |
+
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
|
| 287 |
It achieves the following results on the evaluation set:
|
| 288 |
+
- Loss: 0.0306
|
| 289 |
+
- Precision: 0.9507
|
| 290 |
+
- Recall: 0.9644
|
| 291 |
+
- F1: 0.9575
|
| 292 |
+
- Accuracy: 0.9917
|
| 293 |
|
| 294 |
## Training procedure
|
| 295 |
|
|
|
|
| 297 |
|
| 298 |
The following hyperparameters were used during training:
|
| 299 |
- learning_rate: 5e-05
|
| 300 |
+
- train_batch_size: 72
|
| 301 |
+
- eval_batch_size: 36
|
| 302 |
- seed: 42
|
| 303 |
+
- gradient_accumulation_steps: 2
|
| 304 |
- total_train_batch_size: 144
|
| 305 |
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 306 |
- lr_scheduler_type: linear
|
|
|
|
| 311 |
|
| 312 |
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|
| 313 |
|:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
|
| 314 |
+
| 0.0918 | 0.0731 | 2500 | 0.1050 | 0.7984 | 0.8818 | 0.8381 | 0.9735 |
|
| 315 |
+
| 0.0717 | 0.1463 | 5000 | 0.0797 | 0.8393 | 0.9041 | 0.8705 | 0.9782 |
|
| 316 |
+
| 0.0624 | 0.2194 | 7500 | 0.0762 | 0.8664 | 0.9166 | 0.8908 | 0.9804 |
|
| 317 |
+
| 0.0562 | 0.2925 | 10000 | 0.0620 | 0.8758 | 0.9247 | 0.8995 | 0.9830 |
|
| 318 |
+
| 0.0516 | 0.3657 | 12500 | 0.0576 | 0.8844 | 0.9298 | 0.9065 | 0.9845 |
|
| 319 |
+
| 0.0465 | 0.4388 | 15000 | 0.0543 | 0.8993 | 0.9357 | 0.9172 | 0.9857 |
|
| 320 |
+
| 0.0433 | 0.5119 | 17500 | 0.0558 | 0.9005 | 0.9356 | 0.9177 | 0.9856 |
|
| 321 |
+
| 0.0411 | 0.5851 | 20000 | 0.0499 | 0.9012 | 0.9385 | 0.9195 | 0.9867 |
|
| 322 |
+
| 0.0420 | 0.6582 | 22500 | 0.0460 | 0.9167 | 0.9438 | 0.9300 | 0.9873 |
|
| 323 |
+
| 0.0392 | 0.7313 | 25000 | 0.0441 | 0.9149 | 0.9448 | 0.9296 | 0.9878 |
|
| 324 |
+
| 0.0386 | 0.8045 | 27500 | 0.0434 | 0.9200 | 0.9476 | 0.9336 | 0.9885 |
|
| 325 |
+
| 0.0357 | 0.8776 | 30000 | 0.0422 | 0.9235 | 0.9503 | 0.9367 | 0.9886 |
|
| 326 |
+
| 0.0356 | 0.9507 | 32500 | 0.0404 | 0.9272 | 0.9520 | 0.9395 | 0.9890 |
|
| 327 |
+
| 0.0261 | 1.0238 | 35000 | 0.0381 | 0.9293 | 0.9529 | 0.9409 | 0.9898 |
|
| 328 |
+
| 0.0322 | 1.0970 | 37500 | 0.0371 | 0.9346 | 0.9558 | 0.9451 | 0.9899 |
|
| 329 |
+
| 0.0303 | 1.1701 | 40000 | 0.0374 | 0.9375 | 0.9580 | 0.9476 | 0.9903 |
|
| 330 |
+
| 0.0276 | 1.2432 | 42500 | 0.0378 | 0.9355 | 0.9566 | 0.9460 | 0.9901 |
|
| 331 |
+
| 0.0264 | 1.3164 | 45000 | 0.0353 | 0.9373 | 0.9574 | 0.9472 | 0.9904 |
|
| 332 |
+
| 0.0228 | 1.3895 | 47500 | 0.0366 | 0.9398 | 0.9589 | 0.9493 | 0.9903 |
|
| 333 |
+
| 0.0234 | 1.4626 | 50000 | 0.0343 | 0.9430 | 0.9602 | 0.9516 | 0.9907 |
|
| 334 |
+
| 0.0274 | 1.5358 | 52500 | 0.0339 | 0.9396 | 0.9591 | 0.9492 | 0.9906 |
|
| 335 |
+
| 0.0236 | 1.6089 | 55000 | 0.0324 | 0.9438 | 0.9613 | 0.9525 | 0.9913 |
|
| 336 |
+
| 0.0244 | 1.6820 | 57500 | 0.0322 | 0.9478 | 0.9624 | 0.9551 | 0.9914 |
|
| 337 |
+
| 0.0222 | 1.7552 | 60000 | 0.0323 | 0.9483 | 0.9628 | 0.9555 | 0.9914 |
|
| 338 |
+
| 0.0238 | 1.8283 | 62500 | 0.0320 | 0.9480 | 0.9630 | 0.9554 | 0.9913 |
|
| 339 |
+
| 0.0223 | 1.9014 | 65000 | 0.0320 | 0.9485 | 0.9637 | 0.9560 | 0.9913 |
|
| 340 |
+
| 0.0208 | 1.9746 | 67500 | 0.0306 | 0.9507 | 0.9644 | 0.9575 | 0.9917 |
|
| 341 |
|
| 342 |
|
| 343 |
### Framework versions
|
|
|
|
| 345 |
- Transformers 5.0.0
|
| 346 |
- Pytorch 2.10.0+cu128
|
| 347 |
- Datasets 4.0.0
|
| 348 |
+
- Tokenizers 0.22.2
|