omarkamali commited on
Commit
d99f746
·
verified ·
1 Parent(s): 3d0ac29

Upload all models and assets for am (latest)

Browse files
Files changed (40) hide show
  1. README.md +61 -61
  2. models/embeddings/aligned/am_128d.bin +1 -1
  3. models/embeddings/aligned/am_128d.projection.npy +1 -1
  4. models/embeddings/aligned/am_32d.bin +1 -1
  5. models/embeddings/aligned/am_32d.projection.npy +1 -1
  6. models/embeddings/aligned/am_64d.bin +1 -1
  7. models/embeddings/aligned/am_64d.projection.npy +1 -1
  8. models/embeddings/monolingual/am_128d.bin +1 -1
  9. models/embeddings/monolingual/am_32d.bin +1 -1
  10. models/embeddings/monolingual/am_64d.bin +1 -1
  11. models/subword_markov/am_markov_ctx1_subword.parquet +2 -2
  12. models/subword_markov/am_markov_ctx2_subword.parquet +2 -2
  13. models/subword_markov/am_markov_ctx3_subword.parquet +2 -2
  14. models/subword_markov/am_markov_ctx4_subword.parquet +2 -2
  15. models/subword_ngram/am_2gram_subword.parquet +2 -2
  16. models/subword_ngram/am_3gram_subword.parquet +2 -2
  17. models/subword_ngram/am_4gram_subword.parquet +2 -2
  18. models/subword_ngram/am_5gram_subword.parquet +2 -2
  19. models/tokenizer/am_tokenizer_16k.model +1 -1
  20. models/tokenizer/am_tokenizer_32k.model +1 -1
  21. models/tokenizer/am_tokenizer_64k.model +1 -1
  22. models/tokenizer/am_tokenizer_8k.model +1 -1
  23. models/word_markov/am_markov_ctx1_word.parquet +2 -2
  24. models/word_markov/am_markov_ctx2_word.parquet +2 -2
  25. models/word_markov/am_markov_ctx3_word.parquet +2 -2
  26. models/word_markov/am_markov_ctx4_word.parquet +2 -2
  27. models/word_ngram/am_2gram_word.parquet +2 -2
  28. models/word_ngram/am_3gram_word.parquet +2 -2
  29. models/word_ngram/am_4gram_word.parquet +2 -2
  30. models/word_ngram/am_5gram_word.parquet +2 -2
  31. visualizations/embedding_alignment_quality.png +0 -0
  32. visualizations/embedding_isotropy.png +0 -0
  33. visualizations/embedding_norms.png +0 -0
  34. visualizations/embedding_similarity.png +2 -2
  35. visualizations/embedding_tsne_multilingual.png +2 -2
  36. visualizations/ngram_entropy.png +0 -0
  37. visualizations/performance_dashboard.png +2 -2
  38. visualizations/position_encoding_comparison.png +2 -2
  39. visualizations/tsne_sentences.png +2 -2
  40. visualizations/tsne_words.png +2 -2
README.md CHANGED
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
- **Sample 1:** `እኔ እውነት እናገራለሁ ሌላውን አስኮንናለሁ የአማርኛ ምሳሌ ነው። ትርጉሙ መደብ: ያልተተረጎመ ምሳሌ`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
- | 8k | `▁እኔ ▁እውነት ▁እና ገራ ለሁ ▁ሌላ ውን ▁አስ ንና ... (+9 more)` | 19 |
107
- | 16k | `▁እኔ ▁እውነት ▁እና ገራ ለሁ ▁ሌላውን ▁አስ ንና ለሁ ... (+8 more)` | 18 |
108
- | 32k | `▁እኔ ▁እውነት ▁እናገራ ለሁ ▁ሌላውን ▁አስ ንና ለሁ ▁የአማርኛ ... (+7 more)` | 17 |
109
- | 64k | `▁እኔ ▁እውነት ▁እናገራለሁ ▁ሌላውን ▁አስ ኮንና ለሁ ▁የአማርኛ ▁ምሳሌ ▁ነው። ... (+5 more)` | 15 |
110
 
111
- **Sample 2:** `እንኳን ለገንፎ ለሙቅም አልደነግጥ የአማርኛ ምሳሌ ነው። ትርጉሙ መደብ: ያልተተረጎመ ምሳሌ`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
- | 8k | `▁እንኳን ▁ለ ገን ▁ለ ቅም ▁አል ነግ ... (+9 more)` | 19 |
116
- | 16k | `▁እንኳን ▁ለ ገን ▁ለሙ ቅም ▁አል ደነግ ▁የአማርኛ ... (+7 more)` | 17 |
117
- | 32k | `▁እንኳን ▁ለ ገንፎ ▁ለሙ ቅም ▁አል ደነግጥ ▁የአማርኛ ▁ምሳሌ ▁ነው። ... (+5 more)` | 15 |
118
- | 64k | `▁እንኳን ▁ለገንፎ ▁ለሙ ቅም ▁አል ደነግጥ ▁የአማርኛ ▁ምሳሌ ▁ነው። ▁ትርጉሙ ... (+4 more)` | 14 |
119
 
120
- **Sample 3:** `ሞፈር ረገጠ በአማርኛ ፈሊጣዊ አነጋገር የሆነ ዘይቤ ነው። ትርጉም እራሱን ቻለ። ከቤተሰብ ቁጥጥር ውጭ ሆነ። ምሳሌ ደበበ ዕድሜ...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
- | 8k | `▁ሞ ፈር ▁ረ ▁በአማርኛ ▁ፈሊጣዊ ▁አነጋገር ▁የሆነ ▁ዘይቤ ... (+29 more)` | 39 |
125
- | 16k | `▁ሞ ፈር ▁ረገ ▁በአማርኛ ▁ፈሊጣዊ ▁አነጋገር ▁የሆነ ▁ዘይቤ ▁ነው። ... (+24 more)` | 34 |
126
- | 32k | `▁ሞፈር ▁ረገ ▁በአማርኛ ▁ፈሊጣዊ ▁አነጋገር ▁የሆነ ▁ዘይቤ ▁ነው። ▁ትርጉም ... (+21 more)` | 31 |
127
- | 64k | `▁ሞፈር ▁ረገ ▁በአማርኛ ▁ፈሊጣዊ ▁አነጋገር ▁የሆነ ▁ዘይቤ ▁ነው። ▁ትርጉም ... (+21 more)` | 31 |
128
 
129
 
130
  ### Key Findings
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
274
 
275
  **Context Size 1:**
276
 
277
- 1. `ነው ወደዚህም የተሳበው ተፈጥሮን ፀባዮች የሚመረኮዝ ፆታዊ ውሳኔ ተላልፏል ውድድሩን ኤርትራውያን የኩራት ምንጭ የበለጠ የተገደበ`
278
- 2. `እና ሳይንሶችን እንዲሁም ከላስታና ከላሊበላ ከፍተኛ የመብት ጥሰቶች የተካሄደውን መፈንቅለ መንግሥት በጌሤም የአሦርም`
279
- 3. `ላይ እንዲገኝ ስለሚያስገድድ ነው ኬንያ ወደሚገኘው ማይ ጎጋ የተባለ የህንድ ጥቃቶች የተጠበቀ እና በችግር ጊዜ የተረጋገጠ`
280
 
281
  **Context Size 2:**
282
 
283
- 1. `ዓ ም የዊስቡር ልጅ ዶማር 300 307 ተከለከለ ታጂኪስታን በነሐሴ ወር 450 ዓ`
284
- 2. `ምሳሌ ነው ጦጣ መጀመሪያ የመቀመጫዬን አለች አሉ ጦጣ ባለቤቱን ታስወጣ የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ`
285
- 3. `የአማርኛ ምሳሌ ነው ለላሙ መንጃ ለሸማው መቅደጃ የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ`
286
 
287
  **Context Size 3:**
288
 
289
- 1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ wiz`
290
- 2. `እ ኤ አ ቦራስ ስዊድን የግሪክ ዘፋኝ ነች አልበሞች protereotita my number one iparhi logos the game of`
291
- 3. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ ምሳሌ`
292
 
293
  **Context Size 4:**
294
 
295
- 1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ wiz`
296
- 2. `ምሳሌ ነው ትርጉሙ መደብ ተረትና ምሳሌ በሬ ካራጁ ይዉላል`
297
- 3. `ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ያልተተረጎመ ምሳሌ`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
303
 
304
  **Context Size 1:**
305
 
306
- 1. `_493_የለቀድ_አት_በተገ`
307
- 2. `ንዳዎች_20_የሳት_ወቀን_`
308
- 3. `ት_ገኙ_ነበትላን_ጆች_በዚ`
309
 
310
  **Context Size 2:**
311
 
312
- 1. `_የጠፈ_እጅጉ_ሙከራ_ተፈጥሮ`
313
- 2. `ት_ከተማ_እቃ_ለማብራዶሮ_ሶ`
314
- 3. `_በሁለተቸት_ስለ_ተመሳር_ከ`
315
 
316
  **Context Size 3:**
317
 
318
- 1. `_እንዲሁም_ዘር።_ከነዚህ_ጊዜ`
319
- 2. `_ነው_፡፡_አየሩ_በኋላም_ብዙ`
320
- 3. `_እና_ጁላይ_ጥይቶቹ_ላይ_(2`
321
 
322
  **Context Size 4:**
323
 
324
- 1. `_��ና_የከተማ፡-_ጎንደርና_አገ`
325
- 2. `_ነው።_ባብዛኛው_ህይወት_ውስጥ`
326
- 3. `ነው።_ዓ.ም_ኪዮሺ_ሱጊዩራ_(1`
327
 
328
 
329
  ### Key Findings
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
- | **mono_32d** | 32 | 0.9080 | 0.3255 | N/A | N/A |
432
- | **mono_64d** | 64 | 0.9137 | 0.2344 | N/A | N/A |
433
- | **mono_128d** | 128 | 0.8453 | 0.1726 | N/A | N/A |
434
- | **aligned_32d** | 32 | 0.9080 | 0.3232 | 0.0220 | 0.1700 |
435
- | **aligned_64d** | 64 | 0.9137 🏆 | 0.2323 | 0.0420 | 0.1840 |
436
- | **aligned_128d** | 128 | 0.8453 | 0.1725 | 0.0680 | 0.2480 |
437
 
438
  ### Key Findings
439
 
440
- - **Best Isotropy:** aligned_64d with 0.9137 (more uniform distribution)
441
- - **Semantic Density:** Average pairwise similarity of 0.2434. Lower values indicate better semantic separation.
442
- - **Alignment Quality:** Aligned models achieve up to 6.8% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
@@ -467,18 +467,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
467
 
468
  | Stem | Cohesion | Substitutability | Examples |
469
  |------|----------|------------------|----------|
470
- | `እንደሚ` | 2.39x | 158 contexts | እንደሚሉ, እንደሚሻ, እንደሚል |
471
- | `ርስቲያ` | 2.46x | 61 contexts | ክርስቲያ, ክርስቲያኗ, ክርስቲያኑ |
472
- | `ትዮጵያ` | 2.23x | 57 contexts | ኢትዮጵያ, እትዮጵያ, ኢትዮጵያና |
473
- | `መንግስ` | 2.21x | 49 contexts | መንግስቱ, መንግስት, መንግስተ |
474
- | `ግዚአብ` | 2.66x | 23 contexts | እግዚአብሔር, እግዚአብሐር, እግዚአብሄር |
475
- | `ኢትዮጵ` | 2.18x | 46 contexts | ኢትዮጵያ, ኢትዮጵያና, የኢትዮጵያ |
476
- | `እንግሊ` | 2.08x | 52 contexts | እንግሊኛ, እንግሊዙ, እንግሊዝ |
477
- | `መንግሥ` | 2.12x | 46 contexts | መንግሥት, መንግሥተ, መንግሥቱ |
478
- | `ጀመሪያ` | 2.29x | 33 contexts | መጀመሪያ, በመጀመሪያ, ለመጀመሪያ |
479
- | `ፈረንሳ` | 2.27x | 34 contexts | ፈረንሳይ, ፈረንሳዊ, የፈረንሳዩ |
480
- | `tion` | 2.77x | 17 contexts | nation, action, section |
481
- | `መጀመሪ` | 2.29x | 31 contexts | መጀመሪአ, መጀመሪያ, የመጀመሪ |
482
 
483
  ### 6.4 Affix Compatibility (Co-occurrence)
484
 
@@ -726,4 +726,4 @@ MIT License - Free for academic and commercial use.
726
  ---
727
  *Generated by Wikilangs Models Pipeline*
728
 
729
- *Report Date: 2026-01-03 14:11:24*
 
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
+ **Sample 1:** `ናውሩ በሰላማዊ ውቅያኖስ የሚገኝ ደሴት አገር ነው። ዋና ከተማ የለውም፣ ትልቁ ከተማ ግን ያሬን ነው።`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
+ | 8k | `▁ና ▁በሰ ላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ... (+10 more)` | 20 |
107
+ | 16k | `▁ና ውሩ ▁በሰላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ▁ዋና ▁ከተማ ... (+8 more)` | 18 |
108
+ | 32k | `▁ናውሩ ▁በሰላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ▁ዋና ▁ከተማ ▁የለውም፣ ... (+6 more)` | 16 |
109
+ | 64k | `▁ናውሩ ▁በሰላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ▁ዋና ▁ከተማ ▁የለውም፣ ... (+5 more)` | 15 |
110
 
111
+ **Sample 2:** `አሾካ ከ277 ስከ 240 ዓክልበ. ድረስ የሕንድ አገር ማውርያ መንግሥት ንጉሥ ነበር። በ271 ዓክልበ. ግድም የቡዲስም ተከታይ...`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
+ | 8k | `▁አ ▁ከ 2 7 7 ▁ስ ... (+42 more)` | 52 |
116
+ | 16k | `▁አ ▁ከ 2 7 7 ▁ስ ... (+39 more)` | 49 |
117
+ | 32k | `▁አሾ ▁ከ 2 7 7 ▁ስ 2 ... (+38 more)` | 48 |
118
+ | 64k | `▁አሾካ ▁ከ 2 7 7 ▁ስከ 2 4 0 ... (+34 more)` | 44 |
119
 
120
+ **Sample 3:** `ኔትፍሊክስ (እንግሊዝኛ: Netflix) በመስመር ላይ ፊልሞችን እና የቴሌቪዥን ፕሮግራሞችን ለመመልከት የሚያስችል የዥረት አገል...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
+ | 8k | `▁ኔ ክስ ▁( እንግሊዝኛ : ▁n et ... (+36 more)` | 46 |
125
+ | 16k | `▁ኔ ትፍ ክስ ▁( እንግሊዝኛ : ▁n et fl ... (+29 more)` | 39 |
126
+ | 32k | `▁ኔ ትፍ ሊክስ ▁( እንግሊዝኛ : ▁net fl ix ) ... (+23 more)` | 33 |
127
+ | 64k | `▁ኔ ትፍ ሊክስ ▁( እንግሊዝኛ : ▁net flix ) ▁በመስመር ... (+16 more)` | 26 |
128
 
129
 
130
  ### Key Findings
 
274
 
275
  **Context Size 1:**
276
 
277
+ 1. `ነው ያኽዱን ሊም ዓክልበ የነገሠ የሊፒት እሽታርን እርዳታ የማግኘት መብቱ የተጠበቀ ስለሆነ ፈጽሞ ይበላል ፍሬው ሳይበስል`
278
+ 2. `እና ኢኮኖሚያዊ እና አመለካከቶችን ለመግለጽ ይወዳል የወዳጅሽ የመሠወሪያው ማዕበልም ያማታዋል ዳግመኛም የከበረውን የመልክተኛዎን የቃል ትርጉም ሊያዳብር`
279
+ 3. `ላይ አፈፃፀምን በራስ መተማመን አይችሉም ከሚለው ቃል በሲቪል ደግሞ ለየተለያዩ በአፍሪካ ውስጥ የተረጋገጠ ይመስላል ከዚያም የሶቪየት`
280
 
281
  **Context Size 2:**
282
 
283
+ 1. `ዓ ም በኋላ ለሆኑት ዓመታት ግን በሌላ ቀን ላይ መሆኑን ይገንዘቡ ለእነዚያ ዓመቶች ይህ የቀን መለወጫ መሣርያ`
284
+ 2. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ ምናልባትም ከቤ`
285
+ 3. `የአማርኛ ምሳሌ ነው ትርጉሙ ሚስጥር አይደበቅ ይመስላል ትርጉሙ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ`
286
 
287
  **Context Size 3:**
288
 
289
+ 1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ምግባር ሳይኖር ስም እንደማለት ነዉ`
290
+ 2. `እ ኤ አ የእንግሊዝ ካላንደር ማሻሻያ ተከትሎ የንግሥቲቱን ሞት መመዝገብ የተለመደ ቢሆንም እንግሊዝ መጋቢት 25 ቀን ማለት ነው`
291
+ 3. `ምሳሌ ነው ትርጉሙ የተያያዙ ነገሮችን ለመለየት የሚያገለግል ፈሊጥ መደብ ተረትና ምሳሌ ምሳሌ`
292
 
293
  **Context Size 4:**
294
 
295
+ 1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ተረትና ምሳሌ በሬ ካራጁ ይዉላል`
296
+ 2. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ያልተተረጎመ ምሳሌ`
297
+ 3. `ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ሴት ሁሉን ቻይ ናት`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
 
303
 
304
  **Context Size 1:**
305
 
306
+ 1. `_በይምነዉ፡ቢቢትር_የተፅሀ`
307
+ 2. `ን_እንዋጮችት_crcue_አ`
308
+ 3. `ት__ፈርዕስክሎ_አስ_po`
309
 
310
  **Context Size 2:**
311
 
312
+ 1. `_የኢትዮጵያ_ዘን_ሳይንስ_ተ`
313
+ 2. `ት_ነው።_እንግሥታት_ናይትድ`
314
+ 3. `_በወራ_ህብረ_የሳምን_ወይ_`
315
 
316
  **Context Size 3:**
317
 
318
+ 1. `_እንዲህ፡መልክ_ፈላሁ_ዐንሁ_`
319
+ 2. `_ነው።_እንዲህም፡ዅሉ፡_ደግሞ`
320
+ 3. `_እና_ከተያዙ_እንዲሸከሙአቸው`
321
 
322
  **Context Size 4:**
323
 
324
+ 1. `_እና_ቁሳዊ_ነገሥታት_መሽኛ_ት`
325
+ 2. `_ነው።_ከግብጽ_ዘውድ_ጭነው_ነ`
326
+ 3. `ነው።_ሁሉም_የተነሳ_በኋላም_ያ`
327
 
328
 
329
  ### Key Findings
 
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
+ | **mono_32d** | 32 | 0.9098 | 0.3240 | N/A | N/A |
432
+ | **mono_64d** | 64 | 0.9137 🏆 | 0.2319 | N/A | N/A |
433
+ | **mono_128d** | 128 | 0.8452 | 0.1755 | N/A | N/A |
434
+ | **aligned_32d** | 32 | 0.9098 | 0.3259 | 0.0200 | 0.1420 |
435
+ | **aligned_64d** | 64 | 0.9137 | 0.2299 | 0.0480 | 0.1860 |
436
+ | **aligned_128d** | 128 | 0.8452 | 0.1764 | 0.0840 | 0.2800 |
437
 
438
  ### Key Findings
439
 
440
+ - **Best Isotropy:** mono_64d with 0.9137 (more uniform distribution)
441
+ - **Semantic Density:** Average pairwise similarity of 0.2439. Lower values indicate better semantic separation.
442
+ - **Alignment Quality:** Aligned models achieve up to 8.4% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
 
467
 
468
  | Stem | Cohesion | Substitutability | Examples |
469
  |------|----------|------------------|----------|
470
+ | `እንደሚ` | 2.30x | 158 contexts | እንደሚሹ, እንደሚሻ, እንደሚል |
471
+ | `ርስቲያ` | 2.39x | 61 contexts | ክርስቲያ, ከርስቲያን, ክርስቲያን |
472
+ | `ትዮጵያ` | 2.17x | 57 contexts | ኢትዮጵያ, እትዮጵያ, ኢትዮጵያን |
473
+ | `መንግስ` | 2.10x | 49 contexts | መንግስቱ, መንግስተ, መንግስት |
474
+ | `ግዚአብ` | 2.58x | 23 contexts | እግዚአብሐር, እግዚአብሔር, እግዚአብሄር |
475
+ | `ኢትዮጵ` | 2.08x | 46 contexts | ኢትዮጵያ, ኢትዮጵያን, ኢትዮጵያና |
476
+ | `እንግሊ` | 2.00x | 52 contexts | እንግሊዝ, እንግሊዙ, እንግሊኛ |
477
+ | `ፈረንሳ` | 2.23x | 34 contexts | ፈረንሳዊ, ፈረንሳይ, ከፈረንሳዩ |
478
+ | `መንግሥ` | 2.04x | 46 contexts | መንግሥቱ, መንግሥት, መንግሥተ |
479
+ | `tion` | 2.71x | 17 contexts | action, nation, section |
480
+ | `አስተዳ` | 2.21x | 33 contexts | አስተዳደጉ, አስተዳደሪ, አስተዳደጓ |
481
+ | `ግሊዝኛ` | 2.54x | 19 contexts | እንግሊዝኛ, በእንግሊዝኛ, ኢንግሊዝኛው |
482
 
483
  ### 6.4 Affix Compatibility (Co-occurrence)
484
 
 
726
  ---
727
  *Generated by Wikilangs Models Pipeline*
728
 
729
+ *Report Date: 2026-01-03 16:28:42*
models/embeddings/aligned/am_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3964ee0c4f9ca092d9f74907f56f9a3d93b19752347882a64e552d576a095e2b
3
  size 1064306440
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05365ec12b2c2fb2d8df995aee24119fee6ddb57a678bd862fd40de10e193cb1
3
  size 1064306440
models/embeddings/aligned/am_128d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cfbe39d8562cf38339d4ba377db708ff89f8f76337965c05da5c47a5511cd90d
3
  size 65664
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:deb24c9e18725a3ee6f696f5f490d96e499e66d1aef469fc3b67ca71da6c2d47
3
  size 65664
models/embeddings/aligned/am_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:43acd2454c73ec6a90e9b75922f4a4ecc9db024320951b2c88c136aaaf3e57dc
3
  size 266727688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8345b199adcc25abea3de9a656a6b112dd0b6a39d6361db11f5e425bf8e004bd
3
  size 266727688
models/embeddings/aligned/am_32d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b3d0d5ccf51dddf65743a5123bfc0ecd1944f4cfce415736d0e147acf3d55f4b
3
  size 4224
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:634cf159edda5ce6a8152867dd7521b71f1c434d47e85d17f55d1e0d6feb3316
3
  size 4224
models/embeddings/aligned/am_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ad86c60bc61db296fa76aedb6ab90d476fc19f98d61421e743d16270d0805cf5
3
  size 532587272
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93e5df8ef91db2272bca59309b0c79fd6feaa7e932515206d8059b8330fa5e72
3
  size 532587272
models/embeddings/aligned/am_64d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:56b5628675cc1634a4f9405dd4c0ad1d8ef1da74827604d4d4f4a7c37742850a
3
  size 16512
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:627287628b6e540e4db14e30e43e8a0a62ea37c30be3ad5ce1ab5d656c55c446
3
  size 16512
models/embeddings/monolingual/am_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3964ee0c4f9ca092d9f74907f56f9a3d93b19752347882a64e552d576a095e2b
3
  size 1064306440
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05365ec12b2c2fb2d8df995aee24119fee6ddb57a678bd862fd40de10e193cb1
3
  size 1064306440
models/embeddings/monolingual/am_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:43acd2454c73ec6a90e9b75922f4a4ecc9db024320951b2c88c136aaaf3e57dc
3
  size 266727688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8345b199adcc25abea3de9a656a6b112dd0b6a39d6361db11f5e425bf8e004bd
3
  size 266727688
models/embeddings/monolingual/am_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ad86c60bc61db296fa76aedb6ab90d476fc19f98d61421e743d16270d0805cf5
3
  size 532587272
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93e5df8ef91db2272bca59309b0c79fd6feaa7e932515206d8059b8330fa5e72
3
  size 532587272
models/subword_markov/am_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8839cf6220dd43e6dea9746686fa8a81a9a14dd02c6132ff81405220cf661652
3
- size 350051
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0819e23adea11e4b2d437cb14f2eb3a8095f9305eca7295c22f45869cc2398c
3
+ size 354543
models/subword_markov/am_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:27974fda5f8693d4915e9b657526e84bd0fb36fde9d966f19e785d000aedbe5c
3
- size 2143373
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6243c7c58a847e7d8a7e59e0d6cf893519ad61e931d6182b4fd19948111d9cf
3
+ size 2169861
models/subword_markov/am_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:82c019b23a13a08dd018f89e36328fb55a539c91738572761bc8164c5041b3f4
3
- size 8416653
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab2bfec0c73c9d92c58f9ffed32659a2ea045ce31cbf4ae474be8c7609c2d247
3
+ size 8421663
models/subword_markov/am_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f07c64385b4c55e182b6ffee304e1810b2d71ef22b3c5a0b251758b633fca59c
3
- size 23309934
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f389ec100097a598046200d415e719f5c015e20313c572c31733c198bd66825b
3
+ size 23300653
models/subword_ngram/am_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5eabd2483a9b43b8f8fd572f71957cc349ed926f2f419138f2f3a188ed78c42f
3
- size 300472
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad5c361634d826f85a00a861a3aac431b960daae8e30d389a26b10f223079031
3
+ size 300438
models/subword_ngram/am_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3c9b592f96c76cc3101fee1a715cf6c4149dfe51039e391d6098096f9714abe5
3
- size 1885405
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb1a1174c9a4c80aefbd37753d4553c8a756c0598012bc0df0044d2a23423c4d
3
+ size 1911161
models/subword_ngram/am_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:899527071ee5237fc41c2fdb6de104208b69f563db321792d2b61794349f7f99
3
- size 7120084
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a82c126bddc5c2665815030f550a0890e69f0a9cfbd4eaac08c79d884183a4d1
3
+ size 7098220
models/subword_ngram/am_5gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d6256142f289450d718eafc3693efd234fd5e20489751f38071b275a053df337
3
- size 12113547
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c35770939585fe557d5ccf7666948844ecbbbeab5b5bd7ec5cde2dc5905be254
3
+ size 12087680
models/tokenizer/am_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:802c7e23f92ccb7959b0feb6dc8f82635d55d846dbd9f4570913915ce17d5785
3
  size 559625
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a111ad7a5e4c6cde164649095c769c75312c7ad72aae53c5ee2f8ef7838721c4
3
  size 559625
models/tokenizer/am_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:15d15bc01b2e176dbce09f1705536a89afa2737570c8c47d3353634f9f68a94a
3
  size 902568
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab016e2ea38f8d8c7b267cf37f85d2bb7c17bda4ed8c6c8b690fb0324f3335dd
3
  size 902568
models/tokenizer/am_tokenizer_64k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5022a070bcee7d18664184b36241ade2425fae873f2d51e1b98af97932ce4f68
3
  size 1589838
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:961377d69eb3cec24c9c718382b0b467b9a1afa24c2dc238af1e0fa8ea2122c9
3
  size 1589838
models/tokenizer/am_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5a8188f6c50ce22b57642fe6d5b7e098ba217e95117378d4c365656422f23b18
3
  size 394741
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc01c040ea8751cdd811ff0c122044d022da068ea1f778e5d6105fed66d9c6e2
3
  size 394741
models/word_markov/am_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:95ccb7912954a141c9b8b7ca95e221a9fccec6dd51f1c444544019f2ac02b83d
3
- size 13813306
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b31d703fb4cebc4b7a0c50356f366a54d6244d97c871ead6e6e93a4fb68fbb1f
3
+ size 13829693
models/word_markov/am_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d1434179ed6804e7277902166d67afa9d340c443633dfac8c4f6574077bc3705
3
- size 30016914
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc2ff4ea0c2d68217181f37e11af662feba1f560f09d89557360c878b261cc0e
3
+ size 30071868
models/word_markov/am_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:780fe369483db11af3d104778e7d5bb9b9a5f148283b4a56821d1108c225a8f1
3
- size 37857525
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de52d6ed2d3cf8ba885adb29815096e6f0eaf92d7f4420e09402db1799affe4e
3
+ size 37843385
models/word_markov/am_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e728b8a8a0a36069b78abb0671af76227c43bf867e138b921d8bc54cbf49eefb
3
- size 43426773
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26c1f4fea34061dd7bf3ac5175c36ba8e8c0eecf6bf70e7a3b3673f18dfef779
3
+ size 43401561
models/word_ngram/am_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ec5b536313f116b0bde636f400cc6da30021220a55d66b04d2d921a46afac00c
3
- size 546977
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3253eb36f46641335616a4d569c3e3db153afd4a90452fd3c6b3a43277097773
3
+ size 548774
models/word_ngram/am_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a4b1a9110621fff9f70cff208a0fd53443188c369db1a7931d916d8df983445d
3
- size 758494
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e2da18fb13944faacfb0c55bc5dbe5f5aff6172a504ee740427290dda7f2ee6
3
+ size 761215
models/word_ngram/am_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b1d7bbbf331a49536843f3ee8f30a563ca757570bee2a718000a58404f82e7e8
3
- size 2116335
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21a05c63fd25438fb235f39c907ad834a6f3b4a58c825f0913049f3461b196f7
3
+ size 2121749
models/word_ngram/am_5gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4df340c03cec3e7a88e1f02a417db7326be9b7576ed70faa5e9eba5a5bad79b6
3
- size 2009853
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42543fabd233c4bac0f24bbbd665c94db17fdf8859ee492de75dc0a80d2dfc9e
3
+ size 2016123
visualizations/embedding_alignment_quality.png CHANGED
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: c2768ac617e6ac66d52b8b24dfb9900ad514ad9e2e6faf9ed2c9dd9c0f8d3530
  • Pointer size: 131 Bytes
  • Size of remote file: 135 kB

Git LFS Details

  • SHA256: 1e88f878d3ed3885c3990779b963be8a9df755be9a24bc0093e987cdf497b6c6
  • Pointer size: 131 Bytes
  • Size of remote file: 138 kB
visualizations/embedding_tsne_multilingual.png CHANGED

Git LFS Details

  • SHA256: fa50c2da3033c0bc5ec7dd4f790981d7d9624f6fb5645c508f305fe958f5693d
  • Pointer size: 131 Bytes
  • Size of remote file: 245 kB

Git LFS Details

  • SHA256: e8146e1087cfaeb961c38ceeecb4adc63f87ae7fa2de3a83153b697980ae8e9a
  • Pointer size: 131 Bytes
  • Size of remote file: 222 kB
visualizations/ngram_entropy.png CHANGED
visualizations/performance_dashboard.png CHANGED

Git LFS Details

  • SHA256: 8a8911f1ebc14b1d3a70c713834ef50d25c6c41ee7eb0b95f3964a1ebfbcfaf6
  • Pointer size: 131 Bytes
  • Size of remote file: 375 kB

Git LFS Details

  • SHA256: b4af5bc1eaa421652c992a879e0444c2890f3d3ca6b5e2e18d7dab34d0252d99
  • Pointer size: 131 Bytes
  • Size of remote file: 383 kB
visualizations/position_encoding_comparison.png CHANGED

Git LFS Details

  • SHA256: 7553b32774e38463225e80b2babeffc7a10375b45a219db41a997f42f9b84e70
  • Pointer size: 131 Bytes
  • Size of remote file: 120 kB

Git LFS Details

  • SHA256: 16d0a26e823193e62af734fc6104d904f0f96355d2d8c4a0c91ae75e244ad610
  • Pointer size: 131 Bytes
  • Size of remote file: 118 kB
visualizations/tsne_sentences.png CHANGED

Git LFS Details

  • SHA256: 3021f2203420cc58ff1b9dea95d483b4729c516b9c004adf963d35ed64951199
  • Pointer size: 131 Bytes
  • Size of remote file: 259 kB

Git LFS Details

  • SHA256: c57a33c2360a0890b1a80a8b86ef0e0c910b72a0936123dbb9602a5700f16143
  • Pointer size: 131 Bytes
  • Size of remote file: 263 kB
visualizations/tsne_words.png CHANGED

Git LFS Details

  • SHA256: 05b3299584637cbaafb29dda78525b98fc5a65fbfe4efa6e85f93cffd135eae7
  • Pointer size: 131 Bytes
  • Size of remote file: 663 kB

Git LFS Details

  • SHA256: ba9401b6ebede703d28c3ee445bc94f91f4610e455a293374fa62eb6119fadcc
  • Pointer size: 131 Bytes
  • Size of remote file: 657 kB