omarkamali commited on
Commit
0cb2d39
·
verified ·
1 Parent(s): 247cf96

Upload all models and assets for ace (latest)

Browse files
Files changed (39) hide show
  1. README.md +95 -95
  2. models/embeddings/aligned/ace_128d.bin +1 -1
  3. models/embeddings/aligned/ace_128d.projection.npy +1 -1
  4. models/embeddings/aligned/ace_32d.bin +1 -1
  5. models/embeddings/aligned/ace_32d.projection.npy +1 -1
  6. models/embeddings/aligned/ace_64d.bin +1 -1
  7. models/embeddings/aligned/ace_64d.projection.npy +1 -1
  8. models/embeddings/monolingual/ace_128d.bin +1 -1
  9. models/embeddings/monolingual/ace_32d.bin +1 -1
  10. models/embeddings/monolingual/ace_64d.bin +1 -1
  11. models/subword_markov/ace_markov_ctx1_subword.parquet +2 -2
  12. models/subword_markov/ace_markov_ctx2_subword.parquet +2 -2
  13. models/subword_markov/ace_markov_ctx3_subword.parquet +2 -2
  14. models/subword_markov/ace_markov_ctx4_subword.parquet +2 -2
  15. models/subword_ngram/ace_2gram_subword.parquet +2 -2
  16. models/subword_ngram/ace_3gram_subword.parquet +2 -2
  17. models/subword_ngram/ace_4gram_subword.parquet +2 -2
  18. models/subword_ngram/ace_5gram_subword.parquet +2 -2
  19. models/tokenizer/ace_tokenizer_16k.model +1 -1
  20. models/tokenizer/ace_tokenizer_32k.model +1 -1
  21. models/tokenizer/ace_tokenizer_64k.model +1 -1
  22. models/tokenizer/ace_tokenizer_8k.model +1 -1
  23. models/word_markov/ace_markov_ctx1_word.parquet +2 -2
  24. models/word_markov/ace_markov_ctx2_word.parquet +2 -2
  25. models/word_markov/ace_markov_ctx3_word.parquet +2 -2
  26. models/word_markov/ace_markov_ctx4_word.parquet +2 -2
  27. models/word_ngram/ace_2gram_word.parquet +2 -2
  28. models/word_ngram/ace_3gram_word.parquet +2 -2
  29. models/word_ngram/ace_4gram_word.parquet +2 -2
  30. models/word_ngram/ace_5gram_word.parquet +2 -2
  31. visualizations/embedding_alignment_quality.png +0 -0
  32. visualizations/embedding_isotropy.png +0 -0
  33. visualizations/embedding_norms.png +0 -0
  34. visualizations/embedding_similarity.png +2 -2
  35. visualizations/embedding_tsne_multilingual.png +2 -2
  36. visualizations/performance_dashboard.png +2 -2
  37. visualizations/position_encoding_comparison.png +2 -2
  38. visualizations/tsne_sentences.png +2 -2
  39. visualizations/tsne_words.png +2 -2
README.md CHANGED
@@ -36,7 +36,7 @@ metrics:
36
  value: 4.925
37
  - name: best_isotropy
38
  type: isotropy
39
- value: 0.5616
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
- **Sample 1:** `Propinsi Champasak nakeuh saboh propinsi di Laos. Nang nanggroejih nakeuh Pakse.`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
- | 8k | `▁propinsichamp as ak nakeuh ▁saboh ▁propinsidi ▁laos . ... (+6 more)` | 16 |
107
- | 16k | `▁propinsichamp asaknakeuh ▁saboh ▁propinsidi ▁laos . ▁nang ... (+5 more)` | 15 |
108
- | 32k | `▁propinsichampasaknakeuh ▁saboh ▁propinsidi ▁laos .nangnanggroejih ... (+4 more)` | 14 |
109
- | 64k | `▁propinsichampasaknakeuh ▁saboh ▁propinsidilaos . nangnanggroejih ... (+3 more)` | 13 |
110
 
111
- **Sample 2:** `Mesjid Keumangan nakeuh gampông di Mutiara, Kabupatèn Pidie, Acèh. Lumbôi gampôn...`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
- | 8k | `▁mesjid ▁keum angannakeuh ▁gampôngdimutiara , kabupatènpidie ... (+14 more)` | 24 |
116
- | 16k | `▁mesjid ▁keumangan ▁nakeuhgampôngdimutiara , kabupatènpidie , ... (+13 more)` | 23 |
117
- | 32k | `▁mesjidkeumangan ▁nakeuh ▁gampôngdimutiara , kabupatènpidie , ... (+13 more)` | 23 |
118
- | 64k | `▁mesjidkeumangan ▁nakeuh ▁gampôngdimutiara , kabupatènpidie , ... (+13 more)` | 23 |
119
 
120
- **Sample 3:** `Jurông Pandé nakeuh gampông di Geulumpang Tiga, Kabupatèn Pidie, Acèh. Lumbôi ga...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
- | 8k | `▁jurôngpand é ▁nakeuh ▁gampông ▁digeulumpangtiga ,kabupatèn ... (+17 more)` | 27 |
125
- | 16k | `▁jurôngpandé ▁nakeuh ▁gampông ▁digeulumpangtiga , kabupatènpidie ... (+16 more)` | 26 |
126
- | 32k | `▁jurôngpandé ▁nakeuh ▁gampông ▁digeulumpangtiga , kabupatènpidie ... (+16 more)` | 26 |
127
- | 64k | `▁jurôngpandé ▁nakeuh ▁gampông ▁digeulumpangtiga , kabupatènpidie ... (+16 more)` | 26 |
128
 
129
 
130
  ### Key Findings
@@ -176,7 +176,7 @@ Below are sample sentences tokenized with each vocabulary size:
176
  | 2 | `nyoe bak laman` | 3,694 |
177
  | 3 | `lumbôi gampông nyoe` | 3,567 |
178
  | 4 | `acèh lumbôi gampông` | 3,564 |
179
- | 5 | `lam data peumeurèntah` | 3,499 |
180
 
181
  **4-grams (Word):**
182
 
@@ -184,16 +184,16 @@ Below are sample sentences tokenized with each vocabulary size:
184
  |------|--------|-------|
185
  | 1 | `gunong nyoe bak laman` | 3,694 |
186
  | 2 | `acèh lumbôi gampông nyoe` | 3,564 |
187
- | 3 | `nyoe lam data peumeurèntah` | 3,499 |
188
- | 4 | `lam data peumeurèntah nakeuh` | 3,499 |
189
  | 5 | `gampông nyoe lam data` | 3,499 |
190
 
191
  **5-grams (Word):**
192
 
193
  | Rank | N-gram | Count |
194
  |------|--------|-------|
195
- | 1 | `gampông nyoe lam data peumeurèntah` | 3,499 |
196
- | 2 | `nyoe lam data peumeurèntah nakeuh` | 3,499 |
197
  | 3 | `lumbôi gampông nyoe lam data` | 3,498 |
198
  | 4 | `acèh lumbôi gampông nyoe lam` | 3,495 |
199
  | 5 | `lam data peumeurèntah nakeuh nè` | 3,489 |
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
274
 
275
  **Context Size 1:**
276
 
277
- 1. `di da irah bak wikidata data cuaca daerah gunong nyoe bak di acèh indonesia laos nang`
278
- 2. `nakeuh saboh propinsi acèh timu burundi rwanda madagaskar nakeuh gampông lam data peumeurèntah nakeu...`
279
- 3. `bak laman geonames data peumeurèntah nakeuh di gayo lues provinsi acèh barat pulo wèh lam`
280
 
281
  **Context Size 2:**
282
 
283
- 1. `bak laman nasa data matauroe teubiet teunom di da irah bak laman nasa data matauroe teubiet teunom`
284
- 2. `gunong nyoe nakeuh bagian nibak inggréh pangiran maurits dari beulanda natom cit meukirém surat keu ...`
285
- 3. `nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di acèh`
286
 
287
  **Context Size 3:**
288
 
289
- 1. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di ac...`
290
- 2. `nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di acèh`
291
- 3. `lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di acèh timu jernih acèh timu`
292
 
293
  **Context Size 4:**
294
 
295
  1. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di ac...`
296
- 2. `acèh lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di acèh barôh acèh barôh`
297
- 3. `gampông nyoe lam data peumeurèntah nakeuh nè di acèh rayek kawan peukan bada acèh rayek ngön nan awa...`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
303
 
304
  **Context Size 1:**
305
 
306
- 1. `_onirastak_lh_ak`
307
- 2. `ansa_pônng_39_n.`
308
- 3. `naneum_l_()._dam`
309
 
310
  **Context Size 2:**
311
 
312
- 1. `euh_aoyatèktiong_`
313
- 2. `_nya_droë:_teukeu`
314
- 3. `an_ak_di_istreng_`
315
 
316
  **Context Size 3:**
317
 
318
- 1. `ng_geukheungui_gam`
319
- 2. `_na_data_pranté_ab`
320
- 3. `_bak_da'irahmada_u`
321
 
322
  **Context Size 4:**
323
 
324
- 1. `euh_gampông_na_di_a`
325
- 2. `bak_encyclopedia_of`
326
- 3. `_di_tunong_nyoë,_bh`
327
 
328
 
329
  ### Key Findings
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
- | **mono_32d** | 32 | 0.5616 🏆 | 0.3940 | N/A | N/A |
432
- | **mono_64d** | 64 | 0.2087 | 0.3984 | N/A | N/A |
433
- | **mono_128d** | 128 | 0.0274 | 0.4044 | N/A | N/A |
434
- | **aligned_32d** | 32 | 0.5616 | 0.4083 | 0.0220 | 0.1860 |
435
- | **aligned_64d** | 64 | 0.2087 | 0.4071 | 0.0460 | 0.2660 |
436
- | **aligned_128d** | 128 | 0.0274 | 0.4087 | 0.0440 | 0.2760 |
437
 
438
  ### Key Findings
439
 
440
- - **Best Isotropy:** mono_32d with 0.5616 (more uniform distribution)
441
- - **Semantic Density:** Average pairwise similarity of 0.4035. Lower values indicate better semantic separation.
442
- - **Alignment Quality:** Aligned models achieve up to 4.6% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
@@ -461,18 +461,18 @@ These are the most productive prefixes and suffixes identified by sampling the v
461
  #### Productive Prefixes
462
  | Prefix | Examples |
463
  |--------|----------|
464
- | `-me` | meusuci, meutapi, meukheuluk |
465
- | `-meu` | meusuci, meutapi, meukheuluk |
466
- | `-ge` | geuôseuha, geuplueng, geupeuhu |
467
- | `-geu` | geuôseuha, geuplueng, geupeuhu |
468
- | `-pe` | peuneujeutneuh, peutinggai, pelelangan |
469
 
470
  #### Productive Suffixes
471
  | Suffix | Examples |
472
  |--------|----------|
473
- | `-ng` | geuplueng, loyang, berperang |
474
- | `-an` | pelelangan, onekotan, kerobokan |
475
- | `-ah` | ketukah, beulasah, beudarah |
476
 
477
  ### 6.3 Bound Stems (Lexical Roots)
478
 
@@ -480,18 +480,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
480
 
481
  | Stem | Cohesion | Substitutability | Examples |
482
  |------|----------|------------------|----------|
483
- | `eung` | 1.41x | 64 contexts | meung, reung, jeung |
484
- | `uneu` | 1.70x | 28 contexts | runeu, uneun, meuneu |
485
- | `euen` | 1.53x | 38 contexts | meuen, leuen, eueng |
486
- | `euna` | 1.35x | 60 contexts | beuna, keuna, peuna |
487
- | `ubeu` | 1.43x | 22 contexts | ubeut, neubeu, keubeu |
488
- | `umeu` | 1.40x | 23 contexts | jumeu, jeumeu, geumeu |
489
- | `meur` | 1.59x | 15 contexts | meuri, meurô, meurah |
490
- | `neub` | 1.58x | 14 contexts | neuba, neubôk, neubut |
491
- | `teun` | 1.31x | 25 contexts | uteun, ateung, teuntè |
492
- | `beue` | 1.49x | 16 contexts | beuet, tabeue, abeuek |
493
- | `anga` | 1.31x | 23 contexts | langa, panga, manga |
494
- | `eune` | 1.61x | 12 contexts | jeuneh, meuneu, geuneu |
495
 
496
  ### 6.4 Affix Compatibility (Co-occurrence)
497
 
@@ -499,15 +499,15 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
499
 
500
  | Prefix | Suffix | Frequency | Examples |
501
  |--------|--------|-----------|----------|
502
- | `-pe` | `-an` | 53 words | peureumponan, pertahanan |
503
- | `-ge` | `-ng` | 52 words | geumeugabong, geutamöng |
504
- | `-me` | `-ng` | 33 words | meuulang, meunatang |
505
- | `-pe` | `-ng` | 27 words | peunayông, peudong |
506
- | `-ge` | `-ah` | 21 words | geupeuleumah, geupisah |
507
- | `-me` | `-ah` | 17 words | meubatah, meuseudeukah |
508
- | `-pe` | `-ah` | 15 words | peuneugah, peumerintah |
509
- | `-me` | `-an` | 13 words | meukawan, mediterranian |
510
- | `-ge` | `-an` | 4 words | geurakan, gerakan |
511
 
512
  ### 6.5 Recursive Morpheme Segmentation
513
 
@@ -515,21 +515,21 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
515
 
516
  | Word | Suggested Split | Confidence | Stem |
517
  |------|-----------------|------------|------|
518
- | geumeujuang | **`geu-meu-juang`** | 6.0 | `juang` |
519
  | geulumbang | **`geu-lumba-ng`** | 6.0 | `lumba` |
520
- | geumeunarit | **`geu-meu-narit`** | 6.0 | `narit` |
521
- | geumeuripèe | **`geu-meu-ripèe`** | 6.0 | `ripèe` |
522
  | geumeupakat | **`geu-meu-pakat`** | 6.0 | `pakat` |
523
- | geumeusipheuët | **`geu-meu-sipheuët`** | 6.0 | `sipheuët` |
524
- | geumeuduëk | **`geu-meu-duëk`** | 6.0 | `duëk` |
 
 
 
 
 
 
 
525
  | meubintéh | **`meu-bintéh`** | 4.5 | `bintéh` |
526
- | geutanyöe | **`geu-tanyöe`** | 4.5 | `tanyöe` |
527
- | geupeuriwang | **`geu-pe-uriwa-ng`** | 4.5 | `uriwa` |
528
- | meuadaptasi | **`meu-adaptasi`** | 4.5 | `adaptasi` |
529
- | geumigrasi | **`geu-migrasi`** | 4.5 | `migrasi` |
530
- | geutimbak | **`geu-timbak`** | 4.5 | `timbak` |
531
- | geupageuë | **`geu-pageuë`** | 4.5 | `pageuë` |
532
- | meutugaih | **`meu-tugaih`** | 4.5 | `tugaih` |
533
 
534
  ### 6.6 Linguistic Interpretation
535
 
@@ -763,4 +763,4 @@ MIT License - Free for academic and commercial use.
763
  ---
764
  *Generated by Wikilangs Models Pipeline*
765
 
766
- *Report Date: 2026-01-03 14:04:07*
 
36
  value: 4.925
37
  - name: best_isotropy
38
  type: isotropy
39
+ value: 0.4644
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
 
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
+ **Sample 1:** `Jonathan Alberto "John" Leguizamo ) nakeuh sidroe aktor asay Amirika Syarikat.`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
+ | 8k | `▁jonathanalbert o" john "leg ui zam o ... (+9 more)` | 19 |
107
+ | 16k | `▁jonathanalbert o" john "leg ui zam o ... (+9 more)` | 19 |
108
+ | 32k | `▁jonathanalberto" john "leg uizamo ▁–)nakeuh ... (+6 more)` | 16 |
109
+ | 64k | `▁jonathanalberto" john "leguizamo ▁– )nakeuhsidroe ... (+5 more)` | 15 |
110
 
111
+ **Sample 2:** `Spencer Breslin nakeuh sidroe aktor asay Amirika Utara.`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
+ | 8k | `▁sp en cerbr es lin nakeuhsidroeaktorasay ... (+3 more)` | 13 |
116
+ | 16k | `▁sp en cerbr es lin nakeuhsidroeaktorasay ... (+3 more)` | 13 |
117
+ | 32k | `▁spencerbr es lin ▁nakeuh ▁sidroeaktorasayamirikautara ... (+1 more)` | 11 |
118
+ | 64k | `▁spencerbreslin ▁nakeuh ▁sidroeaktorasayamirikautara .` | 9 |
119
 
120
+ **Sample 3:** `Pasi Mali nakeuh saboh gampông nyang na lam keucamatan Woyla Barat, Kabupaten Ac...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
+ | 8k | `▁pasimali ▁nakeuh ▁saboh ▁gampông ▁nyangnalam ▁keucamatanwoyla ... (+11 more)` | 21 |
125
+ | 16k | `▁pasimali ▁nakeuh ▁saboh ▁gampông ▁nyangnalamkeucamatanwoyla ... (+11 more)` | 21 |
126
+ | 32k | `▁pasimali ▁nakeuh ▁saboh ▁gampông ▁nyangnalamkeucamatanwoyla ... (+11 more)` | 21 |
127
+ | 64k | `▁pasimali ▁nakeuh ▁saboh ▁gampông ▁nyangnalamkeucamatanwoyla ... (+11 more)` | 21 |
128
 
129
 
130
  ### Key Findings
 
176
  | 2 | `nyoe bak laman` | 3,694 |
177
  | 3 | `lumbôi gampông nyoe` | 3,567 |
178
  | 4 | `acèh lumbôi gampông` | 3,564 |
179
+ | 5 | `nyoe lam data` | 3,499 |
180
 
181
  **4-grams (Word):**
182
 
 
184
  |------|--------|-------|
185
  | 1 | `gunong nyoe bak laman` | 3,694 |
186
  | 2 | `acèh lumbôi gampông nyoe` | 3,564 |
187
+ | 3 | `lam data peumeurèntah nakeuh` | 3,499 |
188
+ | 4 | `nyoe lam data peumeurèntah` | 3,499 |
189
  | 5 | `gampông nyoe lam data` | 3,499 |
190
 
191
  **5-grams (Word):**
192
 
193
  | Rank | N-gram | Count |
194
  |------|--------|-------|
195
+ | 1 | `nyoe lam data peumeurèntah nakeuh` | 3,499 |
196
+ | 2 | `gampông nyoe lam data peumeurèntah` | 3,499 |
197
  | 3 | `lumbôi gampông nyoe lam data` | 3,498 |
198
  | 4 | `acèh lumbôi gampông nyoe lam` | 3,495 |
199
  | 5 | `lam data peumeurèntah nakeuh nè` | 3,489 |
 
274
 
275
  **Context Size 1:**
276
 
277
+ 1. `di ateuh keude neulop ii dari mèssana strabô ngön sichuan jinoë sukèë calameae aseuli 苗族 haraih`
278
+ 2. `nakeuh saboh gampông nyoe bak wikidata data peumeurèntah nakeuh saboh spèsiès nibak volume 82 nibak ...`
279
+ 3. `bak laman sunrisesunset com di jeupun shogakkukan seuneubeuet bak laman sunrisesunset com di s...`
280
 
281
  **Context Size 2:**
282
 
283
+ 1. `bak laman nasa data matauroe teubiet teunom di da irah bak laman geonames data gunong nyoe bak`
284
+ 2. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah ajyad 500 ngon 700 meté`
285
+ 3. `nyoe bak wikidata data cuaca daerah gunong nyoe bak wikidata data cuaca daerah gunong nyoe bak laman`
286
 
287
  **Context Size 3:**
288
 
289
+ 1. `gunong nyoe bak wikidata data cuaca daerah gunong nyoe bak laman nasa data matauroe teubiet teunom d...`
290
+ 2. `nyoe bak laman geonames data gunong nyoe bak laman geonames data gunong nyoe bak wikidata data cuaca...`
291
+ 3. `lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di pidie pidie`
292
 
293
  **Context Size 4:**
294
 
295
  1. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di ac...`
296
+ 2. `acèh lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di acèh rayek acèh rayek`
297
+ 3. `gampông nyoe lam data peumeurèntah nakeuh nè di acèh seulatan raja acèh seulatan`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
 
303
 
304
  **Context Size 1:**
305
 
306
+ 1. `_peulopôt_onohoo`
307
+ 2. `acoeuh_dd_teumph`
308
+ 3. `nta'ôn_1,_ba),_b`
309
 
310
  **Context Size 2:**
311
 
312
+ 1. `eurènteuh_nè_deuh`
313
+ 2. `_nakeuneuropinak_`
314
+ 3. `an_acilife_39_nya`
315
 
316
  **Context Size 3:**
317
 
318
+ 1. `ng_di_daerah_cuaca`
319
+ 2. `_najôh,_sha_peunaw`
320
+ 3. `_bagoë_di_kabupatè`
321
 
322
  **Context Size 4:**
323
 
324
+ 1. `euh_babah_la'èn_nya`
325
+ 2. `bak_jijak_ulee_stud`
326
+ 3. `_di_muhammouaneuh'e`
327
 
328
 
329
  ### Key Findings
 
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
+ | **mono_32d** | 32 | 0.4644 | 0.4250 | N/A | N/A |
432
+ | **mono_64d** | 64 | 0.1432 | 0.4182 | N/A | N/A |
433
+ | **mono_128d** | 128 | 0.0251 | 0.4207 | N/A | N/A |
434
+ | **aligned_32d** | 32 | 0.4644 🏆 | 0.4392 | 0.0240 | 0.1600 |
435
+ | **aligned_64d** | 64 | 0.1432 | 0.4223 | 0.0340 | 0.2120 |
436
+ | **aligned_128d** | 128 | 0.0251 | 0.4223 | 0.0540 | 0.2900 |
437
 
438
  ### Key Findings
439
 
440
+ - **Best Isotropy:** aligned_32d with 0.4644 (more uniform distribution)
441
+ - **Semantic Density:** Average pairwise similarity of 0.4246. Lower values indicate better semantic separation.
442
+ - **Alignment Quality:** Aligned models achieve up to 5.4% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
 
461
  #### Productive Prefixes
462
  | Prefix | Examples |
463
  |--------|----------|
464
+ | `-ge` | geudapeuta, geuseutöt, geutanyoe |
465
+ | `-me` | meuubah, meuasai, meupawôt |
466
+ | `-geu` | geudapeuta, geuseutöt, geutanyoe |
467
+ | `-meu` | meuubah, meuasai, meupawôt |
468
+ | `-pe` | perdagangan, peunténg, peuradaban |
469
 
470
  #### Productive Suffixes
471
  | Suffix | Examples |
472
  |--------|----------|
473
+ | `-ng` | lambéng, peunténg, gadông |
474
+ | `-an` | perdagangan, azerbaijan, pikeran |
475
+ | `-ah` | pamarèntah, meuubah, bhah |
476
 
477
  ### 6.3 Bound Stems (Lexical Roots)
478
 
 
480
 
481
  | Stem | Cohesion | Substitutability | Examples |
482
  |------|----------|------------------|----------|
483
+ | `eung` | 1.43x | 64 contexts | reung, jeung, meung |
484
+ | `uneu` | 1.75x | 28 contexts | uneun, runeu, meuneu |
485
+ | `euna` | 1.43x | 60 contexts | keuna, beuna, peuna |
486
+ | `euen` | 1.53x | 38 contexts | leuen, eueng, meuen |
487
+ | `ubeu` | 1.48x | 22 contexts | ubeut, neubeu, keubeu |
488
+ | `umeu` | 1.43x | 23 contexts | jumeu, geumeu, jeumeu |
489
+ | `meur` | 1.61x | 15 contexts | meurô, meuri, meurak |
490
+ | `beue` | 1.55x | 16 contexts | beuet, rabeue, abeuek |
491
+ | `teun` | 1.34x | 25 contexts | uteun, ateung, teuntè |
492
+ | `neub` | 1.61x | 14 contexts | neuba, neubeu, neubôh |
493
+ | `eune` | 1.65x | 12 contexts | meuneu, seuneu, jeuneh |
494
+ | `anga` | 1.33x | 23 contexts | langa, manga, panga |
495
 
496
  ### 6.4 Affix Compatibility (Co-occurrence)
497
 
 
499
 
500
  | Prefix | Suffix | Frequency | Examples |
501
  |--------|--------|-----------|----------|
502
+ | `-ge` | `-ng` | 64 words | geumeujuang, geulumpang |
503
+ | `-pe` | `-an` | 54 words | permulaan, peumeréntahan |
504
+ | `-me` | `-ng` | 27 words | meuteureubang, meugang |
505
+ | `-pe` | `-ng` | 27 words | peuseunang, peujuang |
506
+ | `-me` | `-ah` | 21 words | meriah, meutuwah |
507
+ | `-ge` | `-ah` | 20 words | geuminah, geujajah |
508
+ | `-pe` | `-ah` | 15 words | pemerintah, peumeuréntah |
509
+ | `-me` | `-an` | 14 words | mediterranian, meurakan |
510
+ | `-ge` | `-an` | 6 words | geuritan, geulawan |
511
 
512
  ### 6.5 Recursive Morpheme Segmentation
513
 
 
515
 
516
  | Word | Suggested Split | Confidence | Stem |
517
  |------|-----------------|------------|------|
 
518
  | geulumbang | **`geu-lumba-ng`** | 6.0 | `lumba` |
519
+ | geutanyong | **`geu-tanyo-ng`** | 6.0 | `tanyo` |
 
520
  | geumeupakat | **`geu-meu-pakat`** | 6.0 | `pakat` |
521
+ | geulanggang | **`geu-langga-ng`** | 6.0 | `langga` |
522
+ | gelombang | **`ge-lomba-ng`** | 6.0 | `lomba` |
523
+ | meupangkat | **`meu-pangkat`** | 4.5 | `pangkat` |
524
+ | meuhubôngan | **`meu-hubô-ng-an`** | 4.5 | `hubô` |
525
+ | meujangeun | **`meu-jangeun`** | 4.5 | `jangeun` |
526
+ | meuneunguy | **`meu-neunguy`** | 4.5 | `neunguy` |
527
+ | meusayeuëp | **`meu-sayeuëp`** | 4.5 | `sayeuëp` |
528
+ | meupapeuen | **`meu-papeuen`** | 4.5 | `papeuen` |
529
+ | geupeuleumah | **`geu-pe-uleum-ah`** | 4.5 | `uleum` |
530
  | meubintéh | **`meu-bintéh`** | 4.5 | `bintéh` |
531
+ | meupoliték | **`meu-politék`** | 4.5 | `politék` |
532
+ | meuteukeubi | **`meu-teukeubi`** | 4.5 | `teukeubi` |
 
 
 
 
 
533
 
534
  ### 6.6 Linguistic Interpretation
535
 
 
763
  ---
764
  *Generated by Wikilangs Models Pipeline*
765
 
766
+ *Report Date: 2026-01-03 16:16:20*
models/embeddings/aligned/ace_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5f03f3819979ca174222ea3d23b6d7de0b1f4b570a29fb44a02a8e759136e5c0
3
  size 1030450066
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d883b4bc83c20d273368d8d70573c8c67cae4787e5f02d7374e449b877ebff9a
3
  size 1030450066
models/embeddings/aligned/ace_128d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:88867f0cac00844ea8cc080676475c23d28d323b7517734505a58aaf2d4a6181
3
  size 65664
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8362df9937d44fbf419131a805835b69c6edfbd9ea683352d7b63c788f1dc1d
3
  size 65664
models/embeddings/aligned/ace_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e5448de312894c92e7b4bed52b5ecf83aadef491e286bed26ea293aaac01108
3
  size 257688466
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecfbc856f76ea1e8212443985dcbe9685906da9609756f7089491a5aacd1156a
3
  size 257688466
models/embeddings/aligned/ace_32d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:52dcedd993e567a2687112252defc4b0171e62f32ccd9169e40c8f9542aa7dd6
3
  size 4224
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0363a4d6ee1f5cd96d46033b58692b835de71ce097cb1ec66805dc60ae45ff9d
3
  size 4224
models/embeddings/aligned/ace_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e85918583ce1dcf80f4ed4c21ef5bf9fbbcfc6a1b4562cef02b8c398a799c834
3
  size 515275666
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88c2ad986d148a05ac992601ce4ce0f2296e8d009e4f853eb686c870e6386abe
3
  size 515275666
models/embeddings/aligned/ace_64d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:507f0da0845b6878874d0c64c9cc6d2df612728219f1020eb11838e0d5193067
3
  size 16512
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca60fb8f7490338e4711224100d81fca8a62db223dc0db37e329e0b72d289735
3
  size 16512
models/embeddings/monolingual/ace_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5f03f3819979ca174222ea3d23b6d7de0b1f4b570a29fb44a02a8e759136e5c0
3
  size 1030450066
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d883b4bc83c20d273368d8d70573c8c67cae4787e5f02d7374e449b877ebff9a
3
  size 1030450066
models/embeddings/monolingual/ace_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e5448de312894c92e7b4bed52b5ecf83aadef491e286bed26ea293aaac01108
3
  size 257688466
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecfbc856f76ea1e8212443985dcbe9685906da9609756f7089491a5aacd1156a
3
  size 257688466
models/embeddings/monolingual/ace_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e85918583ce1dcf80f4ed4c21ef5bf9fbbcfc6a1b4562cef02b8c398a799c834
3
  size 515275666
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88c2ad986d148a05ac992601ce4ce0f2296e8d009e4f853eb686c870e6386abe
3
  size 515275666
models/subword_markov/ace_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3e072c754c56da9e95c9c200c695cd6b471ee6689a0e1a698d30935c43cdafe4
3
- size 59998
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0843e784a3c44797061e644376193419d686e1b600157cd27459e0c99de46709
3
+ size 59838
models/subword_markov/ace_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e6afbd7f0d3da86c66f7176070cff11e496daaa5c85cd0f88eec18a4b951b764
3
- size 268894
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f0dda851452bd6c562a837afc9323bd8f1e928f6ea5385ceba00a34e7ebdd3c
3
+ size 268837
models/subword_markov/ace_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:99d97f9386bffc3efb8fe0009cd6a6ed6d18de98abe9815278c8893869470442
3
- size 892058
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e726e8769caf10a2581704e335a780eb9237763d9dbd8cbff310b0ad7e82f4a
3
+ size 884478
models/subword_markov/ace_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5c75772e985795218463a1d4dbea72b935f78d17de04a246cd54921d15bd6127
3
- size 2090712
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29e9e2ab394a7c3c2ca5d66a115d4203633a37027f0cd58199d3aef932c15860
3
+ size 2085280
models/subword_ngram/ace_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ee94d573e335a2d4be7247c2a12242fe807e58afabb45eeefc19c58537fc242c
3
- size 30922
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98e4a840f470720f6c704be45037ecbd3f668174a398aaef3fb50dc488917e5e
3
+ size 30958
models/subword_ngram/ace_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6089db142d0a01a8ea615485f0561eafea4b0f9e38c1b90e77d230871c2eb996
3
- size 178914
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0757d70ef038b857f144dcbeefc6c2800a5bc25c0d422ff3849f15d066f9e67
3
+ size 177929
models/subword_ngram/ace_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:23c45b8e33584a590113811267828deedd9f32c90ce26cd12b20b61ad8507758
3
- size 711121
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:891400b4a7bc0c312552dcf98a394200ef1365b0003c570bae7586bac9cfb0c0
3
+ size 708211
models/subword_ngram/ace_5gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b59e6034705aedd81a216a5c9cafd068958aef9196abf3285e0a0a7b64325b24
3
- size 1335561
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98b5cbba679bc0f71f1a1f2a36cab1d26f1d71b6a0a26b02f9ef25a397c78493
3
+ size 1332174
models/tokenizer/ace_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:72d32d166640b817121ffe9e19b8c9c7a5746b6b0771e6230d16250f372188cf
3
  size 504006
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da6b854819c4593c73ca58c84646cac110bf58808734ac20b3ea8a7c8b9b1b47
3
  size 504006
models/tokenizer/ace_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:165e0dfc018dd109d7f186031ebab0ecf3322a59360785ef69e9cc6df607a025
3
  size 784687
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6eb04b27a3d5c3e7a7e62769e64e03cd39984f47676f3dd402251423f483a96
3
  size 784687
models/tokenizer/ace_tokenizer_64k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3edc9a3b51a0f7207be7594e5268c9ae5beb9e82faeccbc7d45b72c6f6c8dd74
3
  size 1329031
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd440f36b64ad7821f3862226bcd4974075610c692d67e00fbbaec7a8a537943
3
  size 1329031
models/tokenizer/ace_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5a75923b49929bbad8c27677f13bfd6d07a5e1e8f2cf1c69010042def7150a74
3
  size 371090
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:424d8aa725c78749213aa6208831872c80e7e3f739fade05386b15a1c98530e7
3
  size 371090
models/word_markov/ace_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:948199af90e300f25e496a4eedb3192c80a7bfe576a56d57660902469ac200fa
3
- size 1247545
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eadb9b2f15495f73d09af99664033e1e0c0235f7d8c4ba07cf27cb7b05b17737
3
+ size 1233510
models/word_markov/ace_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:626d87fcc6d59b74ebbbe7bd7226acbac283028cc291478986933611b278d2c7
3
- size 2658404
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b7f2220fb29e9d761771a6a54d74835c3eae5a1a308b46bfe2ccb587428bf51
3
+ size 2653269
models/word_markov/ace_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:201982069b944a95eb903056c9f6c77533e31fa1294d47511a801b21679e52b1
3
- size 3581524
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2af55670b0662e2664ab09f4000207a9d7e3e29a5b5fd0355358cecf0458366
3
+ size 3580475
models/word_markov/ace_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:903b3658ca01ecb644fe7d5b4795460954aa13c7a962108a2954743ed32cb7ad
3
- size 4121318
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45255b0a8a4c6616877dc66cebdaa4940d17126b507ef6ae54f5a61390971710
3
+ size 4142911
models/word_ngram/ace_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:99f5102e2d051f8ccdd7f6d43c2adb8284a8ed8f4526019131a5389491b7d8cc
3
- size 98390
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03d7f009a4a547b72a287c7dbd8ab2222595d886825d272789e1129ba88c3366
3
+ size 98660
models/word_ngram/ace_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1683788d4d759af290fa472f98fdb4e5d19ec68aa26f75e98240d0948bd1b39d
3
- size 130668
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6b63af10f6cebaff64da46911dcc2220ccc5ba4f5186ac03275baef06874c4f
3
+ size 129641
models/word_ngram/ace_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b21ca904ad127ee2b2b13c4e6d2cfda4fcd5c39cb1025228125b5e656c3d627d
3
- size 212840
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a3d4aa550cfc8103b9ac592b2298a37a95afa2b34a16ee69ad492ef99a7e419
3
+ size 212032
models/word_ngram/ace_5gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ff9f326cfa19762caa49d687f8ece3735f17303f69fae9914000d02210df03db
3
- size 170303
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:093108f7f43d792e52a0193cca5efec09a15636f44c561d2c1fc96ae850fc1a1
3
+ size 169986
visualizations/embedding_alignment_quality.png CHANGED
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 25306a006adbe95ea669e430ee7f9fcacab0f3ad9a4036a7040e83155d4acb35
  • Pointer size: 131 Bytes
  • Size of remote file: 162 kB

Git LFS Details

  • SHA256: 5b4dd7c3ee62c5b070dc7844b1053b7c8c9e37fb43ab75d5bde762adeeea7cf3
  • Pointer size: 131 Bytes
  • Size of remote file: 169 kB
visualizations/embedding_tsne_multilingual.png CHANGED

Git LFS Details

  • SHA256: 88342ef0f77425ba7a47a175685700c2ddaee0800e541e16377e5606a866441d
  • Pointer size: 131 Bytes
  • Size of remote file: 269 kB

Git LFS Details

  • SHA256: 2ef96a52d5b0317e8ac36013a978747c4571e8a0cd16347727c70eb65edd2e7b
  • Pointer size: 131 Bytes
  • Size of remote file: 255 kB
visualizations/performance_dashboard.png CHANGED

Git LFS Details

  • SHA256: 74e37b8d5651ada87adeacf6ee916ac666b9ffe0a3d5439ce61144dd5532204d
  • Pointer size: 131 Bytes
  • Size of remote file: 370 kB

Git LFS Details

  • SHA256: 7a4410d9d9e128e3e3866e22cea3d21624c68efa66dcdaaf6a398b32263bea6d
  • Pointer size: 131 Bytes
  • Size of remote file: 365 kB
visualizations/position_encoding_comparison.png CHANGED

Git LFS Details

  • SHA256: 00a6475508effacc5c79af4a6fc2b84034830fe49cf3d518837dd1e8f5f1a748
  • Pointer size: 131 Bytes
  • Size of remote file: 117 kB

Git LFS Details

  • SHA256: 2cfe9eeae2da66caeb6af189f5988cc3ea56c6380a7674b59113391bdbb60244
  • Pointer size: 131 Bytes
  • Size of remote file: 117 kB
visualizations/tsne_sentences.png CHANGED

Git LFS Details

  • SHA256: 0f070be7d7fb43a0d2d2e16409f2c922c1e2b7f934cdc5c99edf6b06344d0ff0
  • Pointer size: 131 Bytes
  • Size of remote file: 271 kB

Git LFS Details

  • SHA256: 426e624d7af2fc700e64f678aada5d4d39a61dd82582a65e60f797fea27ff6f0
  • Pointer size: 131 Bytes
  • Size of remote file: 273 kB
visualizations/tsne_words.png CHANGED

Git LFS Details

  • SHA256: 53996555e7ca04f4096f43a98638b1a7ffc63d46b1494ac2788b8029b78e3d28
  • Pointer size: 131 Bytes
  • Size of remote file: 717 kB

Git LFS Details

  • SHA256: fdcd1e6597fb92bca1b08040d6ae7ff0e8ad82c234e3caa2bc87cb5003350692
  • Pointer size: 131 Bytes
  • Size of remote file: 712 kB