Upload all models and assets for ace (latest)
Browse files- README.md +95 -95
- models/embeddings/aligned/ace_128d.bin +1 -1
- models/embeddings/aligned/ace_128d.projection.npy +1 -1
- models/embeddings/aligned/ace_32d.bin +1 -1
- models/embeddings/aligned/ace_32d.projection.npy +1 -1
- models/embeddings/aligned/ace_64d.bin +1 -1
- models/embeddings/aligned/ace_64d.projection.npy +1 -1
- models/embeddings/monolingual/ace_128d.bin +1 -1
- models/embeddings/monolingual/ace_32d.bin +1 -1
- models/embeddings/monolingual/ace_64d.bin +1 -1
- models/subword_markov/ace_markov_ctx1_subword.parquet +2 -2
- models/subword_markov/ace_markov_ctx2_subword.parquet +2 -2
- models/subword_markov/ace_markov_ctx3_subword.parquet +2 -2
- models/subword_markov/ace_markov_ctx4_subword.parquet +2 -2
- models/subword_ngram/ace_2gram_subword.parquet +2 -2
- models/subword_ngram/ace_3gram_subword.parquet +2 -2
- models/subword_ngram/ace_4gram_subword.parquet +2 -2
- models/subword_ngram/ace_5gram_subword.parquet +2 -2
- models/tokenizer/ace_tokenizer_16k.model +1 -1
- models/tokenizer/ace_tokenizer_32k.model +1 -1
- models/tokenizer/ace_tokenizer_64k.model +1 -1
- models/tokenizer/ace_tokenizer_8k.model +1 -1
- models/word_markov/ace_markov_ctx1_word.parquet +2 -2
- models/word_markov/ace_markov_ctx2_word.parquet +2 -2
- models/word_markov/ace_markov_ctx3_word.parquet +2 -2
- models/word_markov/ace_markov_ctx4_word.parquet +2 -2
- models/word_ngram/ace_2gram_word.parquet +2 -2
- models/word_ngram/ace_3gram_word.parquet +2 -2
- models/word_ngram/ace_4gram_word.parquet +2 -2
- models/word_ngram/ace_5gram_word.parquet +2 -2
- visualizations/embedding_alignment_quality.png +0 -0
- visualizations/embedding_isotropy.png +0 -0
- visualizations/embedding_norms.png +0 -0
- visualizations/embedding_similarity.png +2 -2
- visualizations/embedding_tsne_multilingual.png +2 -2
- visualizations/performance_dashboard.png +2 -2
- visualizations/position_encoding_comparison.png +2 -2
- visualizations/tsne_sentences.png +2 -2
- visualizations/tsne_words.png +2 -2
README.md
CHANGED
|
@@ -36,7 +36,7 @@ metrics:
|
|
| 36 |
value: 4.925
|
| 37 |
- name: best_isotropy
|
| 38 |
type: isotropy
|
| 39 |
-
value: 0.
|
| 40 |
- name: vocabulary_size
|
| 41 |
type: vocab
|
| 42 |
value: 0
|
|
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 99 |
|
| 100 |
Below are sample sentences tokenized with each vocabulary size:
|
| 101 |
|
| 102 |
-
**Sample 1:** `
|
| 103 |
|
| 104 |
| Vocab | Tokens | Count |
|
| 105 |
|-------|--------|-------|
|
| 106 |
-
| 8k | `▁
|
| 107 |
-
| 16k | `▁
|
| 108 |
-
| 32k | `▁
|
| 109 |
-
| 64k | `▁
|
| 110 |
|
| 111 |
-
**Sample 2:** `
|
| 112 |
|
| 113 |
| Vocab | Tokens | Count |
|
| 114 |
|-------|--------|-------|
|
| 115 |
-
| 8k | `▁
|
| 116 |
-
| 16k | `▁
|
| 117 |
-
| 32k | `▁
|
| 118 |
-
| 64k | `▁
|
| 119 |
|
| 120 |
-
**Sample 3:** `
|
| 121 |
|
| 122 |
| Vocab | Tokens | Count |
|
| 123 |
|-------|--------|-------|
|
| 124 |
-
| 8k | `▁
|
| 125 |
-
| 16k | `▁
|
| 126 |
-
| 32k | `▁
|
| 127 |
-
| 64k | `▁
|
| 128 |
|
| 129 |
|
| 130 |
### Key Findings
|
|
@@ -176,7 +176,7 @@ Below are sample sentences tokenized with each vocabulary size:
|
|
| 176 |
| 2 | `nyoe bak laman` | 3,694 |
|
| 177 |
| 3 | `lumbôi gampông nyoe` | 3,567 |
|
| 178 |
| 4 | `acèh lumbôi gampông` | 3,564 |
|
| 179 |
-
| 5 | `lam data
|
| 180 |
|
| 181 |
**4-grams (Word):**
|
| 182 |
|
|
@@ -184,16 +184,16 @@ Below are sample sentences tokenized with each vocabulary size:
|
|
| 184 |
|------|--------|-------|
|
| 185 |
| 1 | `gunong nyoe bak laman` | 3,694 |
|
| 186 |
| 2 | `acèh lumbôi gampông nyoe` | 3,564 |
|
| 187 |
-
| 3 | `
|
| 188 |
-
| 4 | `lam data peumeurèntah
|
| 189 |
| 5 | `gampông nyoe lam data` | 3,499 |
|
| 190 |
|
| 191 |
**5-grams (Word):**
|
| 192 |
|
| 193 |
| Rank | N-gram | Count |
|
| 194 |
|------|--------|-------|
|
| 195 |
-
| 1 | `
|
| 196 |
-
| 2 | `nyoe lam data peumeurèntah
|
| 197 |
| 3 | `lumbôi gampông nyoe lam data` | 3,498 |
|
| 198 |
| 4 | `acèh lumbôi gampông nyoe lam` | 3,495 |
|
| 199 |
| 5 | `lam data peumeurèntah nakeuh nè` | 3,489 |
|
|
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
|
|
| 274 |
|
| 275 |
**Context Size 1:**
|
| 276 |
|
| 277 |
-
1. `di
|
| 278 |
-
2. `nakeuh saboh
|
| 279 |
-
3. `bak laman
|
| 280 |
|
| 281 |
**Context Size 2:**
|
| 282 |
|
| 283 |
-
1. `bak laman nasa data matauroe teubiet teunom di da irah bak laman
|
| 284 |
-
2. `gunong nyoe
|
| 285 |
-
3. `nyoe bak
|
| 286 |
|
| 287 |
**Context Size 3:**
|
| 288 |
|
| 289 |
-
1. `gunong nyoe bak
|
| 290 |
-
2. `nyoe bak laman
|
| 291 |
-
3. `lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di
|
| 292 |
|
| 293 |
**Context Size 4:**
|
| 294 |
|
| 295 |
1. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di ac...`
|
| 296 |
-
2. `acèh lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di acèh
|
| 297 |
-
3. `gampông nyoe lam data peumeurèntah nakeuh nè di acèh
|
| 298 |
|
| 299 |
|
| 300 |
### Generated Text Samples (Subword-based)
|
|
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 303 |
|
| 304 |
**Context Size 1:**
|
| 305 |
|
| 306 |
-
1. `
|
| 307 |
-
2. `
|
| 308 |
-
3. `
|
| 309 |
|
| 310 |
**Context Size 2:**
|
| 311 |
|
| 312 |
-
1. `
|
| 313 |
-
2. `
|
| 314 |
-
3. `
|
| 315 |
|
| 316 |
**Context Size 3:**
|
| 317 |
|
| 318 |
-
1. `
|
| 319 |
-
2. `
|
| 320 |
-
3. `
|
| 321 |
|
| 322 |
**Context Size 4:**
|
| 323 |
|
| 324 |
-
1. `
|
| 325 |
-
2. `
|
| 326 |
-
3. `
|
| 327 |
|
| 328 |
|
| 329 |
### Key Findings
|
|
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 428 |
|
| 429 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 430 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 431 |
-
| **mono_32d** | 32 | 0.
|
| 432 |
-
| **mono_64d** | 64 | 0.
|
| 433 |
-
| **mono_128d** | 128 | 0.
|
| 434 |
-
| **aligned_32d** | 32 | 0.
|
| 435 |
-
| **aligned_64d** | 64 | 0.
|
| 436 |
-
| **aligned_128d** | 128 | 0.
|
| 437 |
|
| 438 |
### Key Findings
|
| 439 |
|
| 440 |
-
- **Best Isotropy:**
|
| 441 |
-
- **Semantic Density:** Average pairwise similarity of 0.
|
| 442 |
-
- **Alignment Quality:** Aligned models achieve up to 4
|
| 443 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 444 |
|
| 445 |
---
|
|
@@ -461,18 +461,18 @@ These are the most productive prefixes and suffixes identified by sampling the v
|
|
| 461 |
#### Productive Prefixes
|
| 462 |
| Prefix | Examples |
|
| 463 |
|--------|----------|
|
| 464 |
-
| `-
|
| 465 |
-
| `-
|
| 466 |
-
| `-
|
| 467 |
-
| `-
|
| 468 |
-
| `-pe` |
|
| 469 |
|
| 470 |
#### Productive Suffixes
|
| 471 |
| Suffix | Examples |
|
| 472 |
|--------|----------|
|
| 473 |
-
| `-ng` |
|
| 474 |
-
| `-an` |
|
| 475 |
-
| `-ah` |
|
| 476 |
|
| 477 |
### 6.3 Bound Stems (Lexical Roots)
|
| 478 |
|
|
@@ -480,18 +480,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
|
|
| 480 |
|
| 481 |
| Stem | Cohesion | Substitutability | Examples |
|
| 482 |
|------|----------|------------------|----------|
|
| 483 |
-
| `eung` | 1.
|
| 484 |
-
| `uneu` | 1.
|
| 485 |
-
| `
|
| 486 |
-
| `
|
| 487 |
-
| `ubeu` | 1.
|
| 488 |
-
| `umeu` | 1.
|
| 489 |
-
| `meur` | 1.
|
| 490 |
-
| `
|
| 491 |
-
| `teun` | 1.
|
| 492 |
-
| `
|
| 493 |
-
| `
|
| 494 |
-
| `
|
| 495 |
|
| 496 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 497 |
|
|
@@ -499,15 +499,15 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
|
|
| 499 |
|
| 500 |
| Prefix | Suffix | Frequency | Examples |
|
| 501 |
|--------|--------|-----------|----------|
|
| 502 |
-
| `-
|
| 503 |
-
| `-
|
| 504 |
-
| `-me` | `-ng` |
|
| 505 |
-
| `-pe` | `-ng` | 27 words |
|
| 506 |
-
| `-
|
| 507 |
-
| `-
|
| 508 |
-
| `-pe` | `-ah` | 15 words |
|
| 509 |
-
| `-me` | `-an` |
|
| 510 |
-
| `-ge` | `-an` |
|
| 511 |
|
| 512 |
### 6.5 Recursive Morpheme Segmentation
|
| 513 |
|
|
@@ -515,21 +515,21 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
|
|
| 515 |
|
| 516 |
| Word | Suggested Split | Confidence | Stem |
|
| 517 |
|------|-----------------|------------|------|
|
| 518 |
-
| geumeujuang | **`geu-meu-juang`** | 6.0 | `juang` |
|
| 519 |
| geulumbang | **`geu-lumba-ng`** | 6.0 | `lumba` |
|
| 520 |
-
|
|
| 521 |
-
| geumeuripèe | **`geu-meu-ripèe`** | 6.0 | `ripèe` |
|
| 522 |
| geumeupakat | **`geu-meu-pakat`** | 6.0 | `pakat` |
|
| 523 |
-
|
|
| 524 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 525 |
| meubintéh | **`meu-bintéh`** | 4.5 | `bintéh` |
|
| 526 |
-
|
|
| 527 |
-
|
|
| 528 |
-
| meuadaptasi | **`meu-adaptasi`** | 4.5 | `adaptasi` |
|
| 529 |
-
| geumigrasi | **`geu-migrasi`** | 4.5 | `migrasi` |
|
| 530 |
-
| geutimbak | **`geu-timbak`** | 4.5 | `timbak` |
|
| 531 |
-
| geupageuë | **`geu-pageuë`** | 4.5 | `pageuë` |
|
| 532 |
-
| meutugaih | **`meu-tugaih`** | 4.5 | `tugaih` |
|
| 533 |
|
| 534 |
### 6.6 Linguistic Interpretation
|
| 535 |
|
|
@@ -763,4 +763,4 @@ MIT License - Free for academic and commercial use.
|
|
| 763 |
---
|
| 764 |
*Generated by Wikilangs Models Pipeline*
|
| 765 |
|
| 766 |
-
*Report Date: 2026-01-03
|
|
|
|
| 36 |
value: 4.925
|
| 37 |
- name: best_isotropy
|
| 38 |
type: isotropy
|
| 39 |
+
value: 0.4644
|
| 40 |
- name: vocabulary_size
|
| 41 |
type: vocab
|
| 42 |
value: 0
|
|
|
|
| 99 |
|
| 100 |
Below are sample sentences tokenized with each vocabulary size:
|
| 101 |
|
| 102 |
+
**Sample 1:** `Jonathan Alberto "John" Leguizamo – ) nakeuh sidroe aktor asay Amirika Syarikat.`
|
| 103 |
|
| 104 |
| Vocab | Tokens | Count |
|
| 105 |
|-------|--------|-------|
|
| 106 |
+
| 8k | `▁jonathan ▁albert o ▁" john " ▁leg ui zam o ... (+9 more)` | 19 |
|
| 107 |
+
| 16k | `▁jonathan ▁albert o ▁" john " ▁leg ui zam o ... (+9 more)` | 19 |
|
| 108 |
+
| 32k | `▁jonathan ▁alberto ▁" john " ▁leg uizamo ▁– ▁) ▁nakeuh ... (+6 more)` | 16 |
|
| 109 |
+
| 64k | `▁jonathan ▁alberto ▁" john " ▁leguizamo ▁– ▁) ▁nakeuh ▁sidroe ... (+5 more)` | 15 |
|
| 110 |
|
| 111 |
+
**Sample 2:** `Spencer Breslin nakeuh sidroe aktor asay Amirika Utara.`
|
| 112 |
|
| 113 |
| Vocab | Tokens | Count |
|
| 114 |
|-------|--------|-------|
|
| 115 |
+
| 8k | `▁sp en cer ▁br es lin ▁nakeuh ▁sidroe ▁aktor ▁asay ... (+3 more)` | 13 |
|
| 116 |
+
| 16k | `▁sp en cer ▁br es lin ▁nakeuh ▁sidroe ▁aktor ▁asay ... (+3 more)` | 13 |
|
| 117 |
+
| 32k | `▁spencer ▁br es lin ▁nakeuh ▁sidroe ▁aktor ▁asay ▁amirika ▁utara ... (+1 more)` | 11 |
|
| 118 |
+
| 64k | `▁spencer ▁breslin ▁nakeuh ▁sidroe ▁aktor ▁asay ▁amirika ▁utara .` | 9 |
|
| 119 |
|
| 120 |
+
**Sample 3:** `Pasi Mali nakeuh saboh gampông nyang na lam keucamatan Woyla Barat, Kabupaten Ac...`
|
| 121 |
|
| 122 |
| Vocab | Tokens | Count |
|
| 123 |
|-------|--------|-------|
|
| 124 |
+
| 8k | `▁pasi ▁mali ▁nakeuh ▁saboh ▁gampông ▁nyang ▁na ▁lam ▁keucamatan ▁woyla ... (+11 more)` | 21 |
|
| 125 |
+
| 16k | `▁pasi ▁mali ▁nakeuh ▁saboh ▁gampông ▁nyang ▁na ▁lam ▁keucamatan ▁woyla ... (+11 more)` | 21 |
|
| 126 |
+
| 32k | `▁pasi ▁mali ▁nakeuh ▁saboh ▁gampông ▁nyang ▁na ▁lam ▁keucamatan ▁woyla ... (+11 more)` | 21 |
|
| 127 |
+
| 64k | `▁pasi ▁mali ▁nakeuh ▁saboh ▁gampông ▁nyang ▁na ▁lam ▁keucamatan ▁woyla ... (+11 more)` | 21 |
|
| 128 |
|
| 129 |
|
| 130 |
### Key Findings
|
|
|
|
| 176 |
| 2 | `nyoe bak laman` | 3,694 |
|
| 177 |
| 3 | `lumbôi gampông nyoe` | 3,567 |
|
| 178 |
| 4 | `acèh lumbôi gampông` | 3,564 |
|
| 179 |
+
| 5 | `nyoe lam data` | 3,499 |
|
| 180 |
|
| 181 |
**4-grams (Word):**
|
| 182 |
|
|
|
|
| 184 |
|------|--------|-------|
|
| 185 |
| 1 | `gunong nyoe bak laman` | 3,694 |
|
| 186 |
| 2 | `acèh lumbôi gampông nyoe` | 3,564 |
|
| 187 |
+
| 3 | `lam data peumeurèntah nakeuh` | 3,499 |
|
| 188 |
+
| 4 | `nyoe lam data peumeurèntah` | 3,499 |
|
| 189 |
| 5 | `gampông nyoe lam data` | 3,499 |
|
| 190 |
|
| 191 |
**5-grams (Word):**
|
| 192 |
|
| 193 |
| Rank | N-gram | Count |
|
| 194 |
|------|--------|-------|
|
| 195 |
+
| 1 | `nyoe lam data peumeurèntah nakeuh` | 3,499 |
|
| 196 |
+
| 2 | `gampông nyoe lam data peumeurèntah` | 3,499 |
|
| 197 |
| 3 | `lumbôi gampông nyoe lam data` | 3,498 |
|
| 198 |
| 4 | `acèh lumbôi gampông nyoe lam` | 3,495 |
|
| 199 |
| 5 | `lam data peumeurèntah nakeuh nè` | 3,489 |
|
|
|
|
| 274 |
|
| 275 |
**Context Size 1:**
|
| 276 |
|
| 277 |
+
1. `di ateuh keude neulop ii dari mèssana strabô ngön sichuan jinoë sukèë calameae aseuli 苗族 haraih`
|
| 278 |
+
2. `nakeuh saboh gampông nyoe bak wikidata data peumeurèntah nakeuh saboh spèsiès nibak volume 82 nibak ...`
|
| 279 |
+
3. `bak laman sunrisesunset com di jeupun lé shogakkukan nè seuneubeuet bak laman sunrisesunset com di s...`
|
| 280 |
|
| 281 |
**Context Size 2:**
|
| 282 |
|
| 283 |
+
1. `bak laman nasa data matauroe teubiet teunom di da irah bak laman geonames data gunong nyoe bak`
|
| 284 |
+
2. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah ajyad 500 ngon 700 meté`
|
| 285 |
+
3. `nyoe bak wikidata data cuaca daerah gunong nyoe bak wikidata data cuaca daerah gunong nyoe bak laman`
|
| 286 |
|
| 287 |
**Context Size 3:**
|
| 288 |
|
| 289 |
+
1. `gunong nyoe bak wikidata data cuaca daerah gunong nyoe bak laman nasa data matauroe teubiet teunom d...`
|
| 290 |
+
2. `nyoe bak laman geonames data gunong nyoe bak laman geonames data gunong nyoe bak wikidata data cuaca...`
|
| 291 |
+
3. `lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di pidie pidie`
|
| 292 |
|
| 293 |
**Context Size 4:**
|
| 294 |
|
| 295 |
1. `gunong nyoe bak laman nasa data matauroe teubiet teunom di da irah bak laman sunrisesunset com di ac...`
|
| 296 |
+
2. `acèh lumbôi gampông nyoe lam data peumeurèntah nakeuh nè di acèh rayek acèh rayek`
|
| 297 |
+
3. `gampông nyoe lam data peumeurèntah nakeuh nè di acèh seulatan raja acèh seulatan`
|
| 298 |
|
| 299 |
|
| 300 |
### Generated Text Samples (Subword-based)
|
|
|
|
| 303 |
|
| 304 |
**Context Size 1:**
|
| 305 |
|
| 306 |
+
1. `_peulopôt_onohoo`
|
| 307 |
+
2. `acoeuh_dd_teumph`
|
| 308 |
+
3. `nta'ôn_1,_ba),_b`
|
| 309 |
|
| 310 |
**Context Size 2:**
|
| 311 |
|
| 312 |
+
1. `eurènteuh_nè_deuh`
|
| 313 |
+
2. `_nakeuneuropinak_`
|
| 314 |
+
3. `an_acilife_39_nya`
|
| 315 |
|
| 316 |
**Context Size 3:**
|
| 317 |
|
| 318 |
+
1. `ng_di_daerah_cuaca`
|
| 319 |
+
2. `_najôh,_sha_peunaw`
|
| 320 |
+
3. `_bagoë_di_kabupatè`
|
| 321 |
|
| 322 |
**Context Size 4:**
|
| 323 |
|
| 324 |
+
1. `euh_babah_la'èn_nya`
|
| 325 |
+
2. `bak_jijak_ulee_stud`
|
| 326 |
+
3. `_di_muhammouaneuh'e`
|
| 327 |
|
| 328 |
|
| 329 |
### Key Findings
|
|
|
|
| 428 |
|
| 429 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 430 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 431 |
+
| **mono_32d** | 32 | 0.4644 | 0.4250 | N/A | N/A |
|
| 432 |
+
| **mono_64d** | 64 | 0.1432 | 0.4182 | N/A | N/A |
|
| 433 |
+
| **mono_128d** | 128 | 0.0251 | 0.4207 | N/A | N/A |
|
| 434 |
+
| **aligned_32d** | 32 | 0.4644 🏆 | 0.4392 | 0.0240 | 0.1600 |
|
| 435 |
+
| **aligned_64d** | 64 | 0.1432 | 0.4223 | 0.0340 | 0.2120 |
|
| 436 |
+
| **aligned_128d** | 128 | 0.0251 | 0.4223 | 0.0540 | 0.2900 |
|
| 437 |
|
| 438 |
### Key Findings
|
| 439 |
|
| 440 |
+
- **Best Isotropy:** aligned_32d with 0.4644 (more uniform distribution)
|
| 441 |
+
- **Semantic Density:** Average pairwise similarity of 0.4246. Lower values indicate better semantic separation.
|
| 442 |
+
- **Alignment Quality:** Aligned models achieve up to 5.4% R@1 in cross-lingual retrieval.
|
| 443 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 444 |
|
| 445 |
---
|
|
|
|
| 461 |
#### Productive Prefixes
|
| 462 |
| Prefix | Examples |
|
| 463 |
|--------|----------|
|
| 464 |
+
| `-ge` | geudapeuta, geuseutöt, geutanyoe |
|
| 465 |
+
| `-me` | meuubah, meuasai, meupawôt |
|
| 466 |
+
| `-geu` | geudapeuta, geuseutöt, geutanyoe |
|
| 467 |
+
| `-meu` | meuubah, meuasai, meupawôt |
|
| 468 |
+
| `-pe` | perdagangan, peunténg, peuradaban |
|
| 469 |
|
| 470 |
#### Productive Suffixes
|
| 471 |
| Suffix | Examples |
|
| 472 |
|--------|----------|
|
| 473 |
+
| `-ng` | lambéng, peunténg, gadông |
|
| 474 |
+
| `-an` | perdagangan, azerbaijan, pikeran |
|
| 475 |
+
| `-ah` | pamarèntah, meuubah, bhah |
|
| 476 |
|
| 477 |
### 6.3 Bound Stems (Lexical Roots)
|
| 478 |
|
|
|
|
| 480 |
|
| 481 |
| Stem | Cohesion | Substitutability | Examples |
|
| 482 |
|------|----------|------------------|----------|
|
| 483 |
+
| `eung` | 1.43x | 64 contexts | reung, jeung, meung |
|
| 484 |
+
| `uneu` | 1.75x | 28 contexts | uneun, runeu, meuneu |
|
| 485 |
+
| `euna` | 1.43x | 60 contexts | keuna, beuna, peuna |
|
| 486 |
+
| `euen` | 1.53x | 38 contexts | leuen, eueng, meuen |
|
| 487 |
+
| `ubeu` | 1.48x | 22 contexts | ubeut, neubeu, keubeu |
|
| 488 |
+
| `umeu` | 1.43x | 23 contexts | jumeu, geumeu, jeumeu |
|
| 489 |
+
| `meur` | 1.61x | 15 contexts | meurô, meuri, meurak |
|
| 490 |
+
| `beue` | 1.55x | 16 contexts | beuet, rabeue, abeuek |
|
| 491 |
+
| `teun` | 1.34x | 25 contexts | uteun, ateung, teuntè |
|
| 492 |
+
| `neub` | 1.61x | 14 contexts | neuba, neubeu, neubôh |
|
| 493 |
+
| `eune` | 1.65x | 12 contexts | meuneu, seuneu, jeuneh |
|
| 494 |
+
| `anga` | 1.33x | 23 contexts | langa, manga, panga |
|
| 495 |
|
| 496 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 497 |
|
|
|
|
| 499 |
|
| 500 |
| Prefix | Suffix | Frequency | Examples |
|
| 501 |
|--------|--------|-----------|----------|
|
| 502 |
+
| `-ge` | `-ng` | 64 words | geumeujuang, geulumpang |
|
| 503 |
+
| `-pe` | `-an` | 54 words | permulaan, peumeréntahan |
|
| 504 |
+
| `-me` | `-ng` | 27 words | meuteureubang, meugang |
|
| 505 |
+
| `-pe` | `-ng` | 27 words | peuseunang, peujuang |
|
| 506 |
+
| `-me` | `-ah` | 21 words | meriah, meutuwah |
|
| 507 |
+
| `-ge` | `-ah` | 20 words | geuminah, geujajah |
|
| 508 |
+
| `-pe` | `-ah` | 15 words | pemerintah, peumeuréntah |
|
| 509 |
+
| `-me` | `-an` | 14 words | mediterranian, meurakan |
|
| 510 |
+
| `-ge` | `-an` | 6 words | geuritan, geulawan |
|
| 511 |
|
| 512 |
### 6.5 Recursive Morpheme Segmentation
|
| 513 |
|
|
|
|
| 515 |
|
| 516 |
| Word | Suggested Split | Confidence | Stem |
|
| 517 |
|------|-----------------|------------|------|
|
|
|
|
| 518 |
| geulumbang | **`geu-lumba-ng`** | 6.0 | `lumba` |
|
| 519 |
+
| geutanyong | **`geu-tanyo-ng`** | 6.0 | `tanyo` |
|
|
|
|
| 520 |
| geumeupakat | **`geu-meu-pakat`** | 6.0 | `pakat` |
|
| 521 |
+
| geulanggang | **`geu-langga-ng`** | 6.0 | `langga` |
|
| 522 |
+
| gelombang | **`ge-lomba-ng`** | 6.0 | `lomba` |
|
| 523 |
+
| meupangkat | **`meu-pangkat`** | 4.5 | `pangkat` |
|
| 524 |
+
| meuhubôngan | **`meu-hubô-ng-an`** | 4.5 | `hubô` |
|
| 525 |
+
| meujangeun | **`meu-jangeun`** | 4.5 | `jangeun` |
|
| 526 |
+
| meuneunguy | **`meu-neunguy`** | 4.5 | `neunguy` |
|
| 527 |
+
| meusayeuëp | **`meu-sayeuëp`** | 4.5 | `sayeuëp` |
|
| 528 |
+
| meupapeuen | **`meu-papeuen`** | 4.5 | `papeuen` |
|
| 529 |
+
| geupeuleumah | **`geu-pe-uleum-ah`** | 4.5 | `uleum` |
|
| 530 |
| meubintéh | **`meu-bintéh`** | 4.5 | `bintéh` |
|
| 531 |
+
| meupoliték | **`meu-politék`** | 4.5 | `politék` |
|
| 532 |
+
| meuteukeubi | **`meu-teukeubi`** | 4.5 | `teukeubi` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 533 |
|
| 534 |
### 6.6 Linguistic Interpretation
|
| 535 |
|
|
|
|
| 763 |
---
|
| 764 |
*Generated by Wikilangs Models Pipeline*
|
| 765 |
|
| 766 |
+
*Report Date: 2026-01-03 16:16:20*
|
models/embeddings/aligned/ace_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1030450066
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d883b4bc83c20d273368d8d70573c8c67cae4787e5f02d7374e449b877ebff9a
|
| 3 |
size 1030450066
|
models/embeddings/aligned/ace_128d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 65664
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b8362df9937d44fbf419131a805835b69c6edfbd9ea683352d7b63c788f1dc1d
|
| 3 |
size 65664
|
models/embeddings/aligned/ace_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 257688466
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ecfbc856f76ea1e8212443985dcbe9685906da9609756f7089491a5aacd1156a
|
| 3 |
size 257688466
|
models/embeddings/aligned/ace_32d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4224
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0363a4d6ee1f5cd96d46033b58692b835de71ce097cb1ec66805dc60ae45ff9d
|
| 3 |
size 4224
|
models/embeddings/aligned/ace_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 515275666
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:88c2ad986d148a05ac992601ce4ce0f2296e8d009e4f853eb686c870e6386abe
|
| 3 |
size 515275666
|
models/embeddings/aligned/ace_64d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 16512
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ca60fb8f7490338e4711224100d81fca8a62db223dc0db37e329e0b72d289735
|
| 3 |
size 16512
|
models/embeddings/monolingual/ace_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1030450066
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d883b4bc83c20d273368d8d70573c8c67cae4787e5f02d7374e449b877ebff9a
|
| 3 |
size 1030450066
|
models/embeddings/monolingual/ace_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 257688466
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ecfbc856f76ea1e8212443985dcbe9685906da9609756f7089491a5aacd1156a
|
| 3 |
size 257688466
|
models/embeddings/monolingual/ace_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 515275666
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:88c2ad986d148a05ac992601ce4ce0f2296e8d009e4f853eb686c870e6386abe
|
| 3 |
size 515275666
|
models/subword_markov/ace_markov_ctx1_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0843e784a3c44797061e644376193419d686e1b600157cd27459e0c99de46709
|
| 3 |
+
size 59838
|
models/subword_markov/ace_markov_ctx2_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3f0dda851452bd6c562a837afc9323bd8f1e928f6ea5385ceba00a34e7ebdd3c
|
| 3 |
+
size 268837
|
models/subword_markov/ace_markov_ctx3_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7e726e8769caf10a2581704e335a780eb9237763d9dbd8cbff310b0ad7e82f4a
|
| 3 |
+
size 884478
|
models/subword_markov/ace_markov_ctx4_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:29e9e2ab394a7c3c2ca5d66a115d4203633a37027f0cd58199d3aef932c15860
|
| 3 |
+
size 2085280
|
models/subword_ngram/ace_2gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:98e4a840f470720f6c704be45037ecbd3f668174a398aaef3fb50dc488917e5e
|
| 3 |
+
size 30958
|
models/subword_ngram/ace_3gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d0757d70ef038b857f144dcbeefc6c2800a5bc25c0d422ff3849f15d066f9e67
|
| 3 |
+
size 177929
|
models/subword_ngram/ace_4gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:891400b4a7bc0c312552dcf98a394200ef1365b0003c570bae7586bac9cfb0c0
|
| 3 |
+
size 708211
|
models/subword_ngram/ace_5gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:98b5cbba679bc0f71f1a1f2a36cab1d26f1d71b6a0a26b02f9ef25a397c78493
|
| 3 |
+
size 1332174
|
models/tokenizer/ace_tokenizer_16k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 504006
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:da6b854819c4593c73ca58c84646cac110bf58808734ac20b3ea8a7c8b9b1b47
|
| 3 |
size 504006
|
models/tokenizer/ace_tokenizer_32k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 784687
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f6eb04b27a3d5c3e7a7e62769e64e03cd39984f47676f3dd402251423f483a96
|
| 3 |
size 784687
|
models/tokenizer/ace_tokenizer_64k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1329031
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fd440f36b64ad7821f3862226bcd4974075610c692d67e00fbbaec7a8a537943
|
| 3 |
size 1329031
|
models/tokenizer/ace_tokenizer_8k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 371090
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:424d8aa725c78749213aa6208831872c80e7e3f739fade05386b15a1c98530e7
|
| 3 |
size 371090
|
models/word_markov/ace_markov_ctx1_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eadb9b2f15495f73d09af99664033e1e0c0235f7d8c4ba07cf27cb7b05b17737
|
| 3 |
+
size 1233510
|
models/word_markov/ace_markov_ctx2_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7b7f2220fb29e9d761771a6a54d74835c3eae5a1a308b46bfe2ccb587428bf51
|
| 3 |
+
size 2653269
|
models/word_markov/ace_markov_ctx3_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e2af55670b0662e2664ab09f4000207a9d7e3e29a5b5fd0355358cecf0458366
|
| 3 |
+
size 3580475
|
models/word_markov/ace_markov_ctx4_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:45255b0a8a4c6616877dc66cebdaa4940d17126b507ef6ae54f5a61390971710
|
| 3 |
+
size 4142911
|
models/word_ngram/ace_2gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:03d7f009a4a547b72a287c7dbd8ab2222595d886825d272789e1129ba88c3366
|
| 3 |
+
size 98660
|
models/word_ngram/ace_3gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b6b63af10f6cebaff64da46911dcc2220ccc5ba4f5186ac03275baef06874c4f
|
| 3 |
+
size 129641
|
models/word_ngram/ace_4gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5a3d4aa550cfc8103b9ac592b2298a37a95afa2b34a16ee69ad492ef99a7e419
|
| 3 |
+
size 212032
|
models/word_ngram/ace_5gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:093108f7f43d792e52a0193cca5efec09a15636f44c561d2c1fc96ae850fc1a1
|
| 3 |
+
size 169986
|
visualizations/embedding_alignment_quality.png
CHANGED
|
|
visualizations/embedding_isotropy.png
CHANGED
|
|
visualizations/embedding_norms.png
CHANGED
|
|
visualizations/embedding_similarity.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/embedding_tsne_multilingual.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/performance_dashboard.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/position_encoding_comparison.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_sentences.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_words.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|