omarkamali commited on
Commit
873ac1d
·
verified ·
1 Parent(s): a863838

Upload all models and assets for ady (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. README.md +307 -139
  3. models/embeddings/monolingual/ady_128d.bin +2 -2
  4. models/embeddings/monolingual/ady_128d_metadata.json +5 -3
  5. models/embeddings/monolingual/ady_32d.bin +2 -2
  6. models/embeddings/monolingual/ady_32d_metadata.json +5 -3
  7. models/embeddings/monolingual/ady_64d.bin +2 -2
  8. models/embeddings/monolingual/ady_64d_metadata.json +5 -3
  9. models/subword_markov/ady_markov_ctx1_subword.parquet +2 -2
  10. models/subword_markov/ady_markov_ctx1_subword_metadata.json +2 -2
  11. models/subword_markov/ady_markov_ctx2_subword.parquet +2 -2
  12. models/subword_markov/ady_markov_ctx2_subword_metadata.json +2 -2
  13. models/subword_markov/ady_markov_ctx3_subword.parquet +2 -2
  14. models/subword_markov/ady_markov_ctx3_subword_metadata.json +2 -2
  15. models/subword_markov/ady_markov_ctx4_subword.parquet +2 -2
  16. models/subword_markov/ady_markov_ctx4_subword_metadata.json +2 -2
  17. models/subword_ngram/ady_2gram_subword.parquet +2 -2
  18. models/subword_ngram/ady_2gram_subword_metadata.json +2 -2
  19. models/subword_ngram/ady_3gram_subword.parquet +2 -2
  20. models/subword_ngram/ady_3gram_subword_metadata.json +2 -2
  21. models/subword_ngram/ady_4gram_subword.parquet +2 -2
  22. models/subword_ngram/ady_4gram_subword_metadata.json +2 -2
  23. models/tokenizer/ady_tokenizer_16k.model +2 -2
  24. models/tokenizer/ady_tokenizer_16k.vocab +0 -0
  25. models/tokenizer/ady_tokenizer_32k.model +2 -2
  26. models/tokenizer/ady_tokenizer_32k.vocab +0 -0
  27. models/tokenizer/ady_tokenizer_8k.model +2 -2
  28. models/tokenizer/ady_tokenizer_8k.vocab +0 -0
  29. models/vocabulary/ady_vocabulary.parquet +2 -2
  30. models/vocabulary/ady_vocabulary_metadata.json +10 -9
  31. models/word_markov/ady_markov_ctx1_word.parquet +2 -2
  32. models/word_markov/ady_markov_ctx1_word_metadata.json +2 -2
  33. models/word_markov/ady_markov_ctx2_word.parquet +2 -2
  34. models/word_markov/ady_markov_ctx2_word_metadata.json +2 -2
  35. models/word_markov/ady_markov_ctx3_word.parquet +2 -2
  36. models/word_markov/ady_markov_ctx3_word_metadata.json +2 -2
  37. models/word_markov/ady_markov_ctx4_word.parquet +2 -2
  38. models/word_markov/ady_markov_ctx4_word_metadata.json +2 -2
  39. models/word_ngram/ady_2gram_word.parquet +2 -2
  40. models/word_ngram/ady_2gram_word_metadata.json +2 -2
  41. models/word_ngram/ady_3gram_word.parquet +2 -2
  42. models/word_ngram/ady_3gram_word_metadata.json +2 -2
  43. models/word_ngram/ady_4gram_word.parquet +2 -2
  44. models/word_ngram/ady_4gram_word_metadata.json +2 -2
  45. visualizations/embedding_isotropy.png +0 -0
  46. visualizations/embedding_norms.png +0 -0
  47. visualizations/embedding_similarity.png +2 -2
  48. visualizations/markov_branching.png +0 -0
  49. visualizations/markov_contexts.png +0 -0
  50. visualizations/markov_entropy.png +0 -0
.gitattributes CHANGED
@@ -39,3 +39,4 @@ visualizations/position_encoding_comparison.png filter=lfs diff=lfs merge=lfs -t
39
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
41
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
 
 
39
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
41
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
42
+ visualizations/ngram_coverage.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 4.453
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.6831
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 8988
33
- generated: 2025-12-27
34
  ---
35
 
36
  # ADY - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,58 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.223x | 3.18 | 0.1016% | 189,909 |
76
- | **16k** | 3.621x | 3.57 | 0.1142% | 169,055 |
77
- | **32k** | 4.071x | 4.02 | 0.1284% | 150,370 |
78
- | **64k** | 4.453x 🏆 | 4.39 | 0.1404% | 137,476 |
79
 
80
  ### Tokenization Examples
81
 
82
  Below are sample sentences tokenized with each vocabulary size:
83
 
84
- **Sample 1:** `КиотоЯпонием и къалэ.
85
-
86
- Category:Къалэхэр
87
- Category:Японие`
88
 
89
  | Vocab | Tokens | Count |
90
  |-------|--------|-------|
91
- | 8k | `▁к и от о ▁— ▁японием ▁и ▁къалэ . ▁category ... (+5 more)` | 15 |
92
- | 16k | `▁ки ото ▁— ▁японием ▁и ▁къалэ . ▁category : къалэхэр ... (+3 more)` | 13 |
93
- | 32k | `▁ки ото ▁— ▁японием ▁и ▁къалэ . ▁category : къалэхэр ... (+3 more)` | 13 |
94
- | 64k | `▁киото ▁— ▁японием ▁и ▁къалэ . ▁category : къалэхэр ▁category ... (+2 more)` | 12 |
95
 
96
- **Sample 2:** `Ереван () Армение и къэлэшъхьаI. Нэбгырэ млн 1,06 фэдиз дэс. Къалэм и лIышъхьэ...`
97
 
98
  | Vocab | Tokens | Count |
99
  |-------|--------|-------|
100
- | 8k | `▁е ре ван ▁() ▁– ▁армение ▁и ▁къэлэшъхьа i . ... (+28 more)` | 38 |
101
- | 16k | `▁ереван ▁() ▁– ▁армение ▁и ▁къэлэшъхьа i . ▁нэбгырэ ▁млн ... (+25 more)` | 35 |
102
- | 32k | `▁ереван ▁() ▁– ▁армение ▁и ▁къэлэшъхьа i . ▁нэбгырэ ▁млн ... (+25 more)` | 35 |
103
- | 64k | `▁ереван ▁() ▁– ▁армение ▁и ▁къэлэшъхьа i . ▁нэбгырэ ▁млн ... (+20 more)` | 30 |
104
-
105
- **Sample 3:** `thumb
106
- thumb
107
- Ишъхъэрэ Америкэ — континент.
108
 
109
- ЧIырэу млн 24,7 км² фэдиз еубыты. ЦIы...`
110
 
111
  | Vocab | Tokens | Count |
112
  |-------|--------|-------|
113
- | 8k | `▁thumb ▁thumb ▁ишъхъэрэ ▁америкэ ▁— ▁континент . ▁ч i ырэу ... (+27 more)` | 37 |
114
- | 16k | `▁thumb ▁thumb ▁ишъхъэрэ ▁америкэ ▁— ▁континент . ▁ч i ырэу ... (+27 more)` | 37 |
115
- | 32k | `▁thumb ▁thumb ▁ишъхъэрэ ▁америкэ ▁— ▁континент . ▁ч i ырэу ... (+27 more)` | 37 |
116
- | 64k | `▁thumb ▁thumb ▁ишъхъэрэ ▁америкэ ▁— ▁континент . ▁ч i ырэу ... (+27 more)` | 37 |
117
 
118
 
119
  ### Key Findings
120
 
121
- - **Best Compression:** 64k achieves 4.453x compression
122
- - **Lowest UNK Rate:** 8k with 0.1016% unknown tokens
123
  - **Trade-off:** Larger vocabularies improve compression but increase model size
124
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
125
 
@@ -128,57 +125,89 @@ thumb
128
 
129
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
130
 
 
 
131
  ![N-gram Coverage](visualizations/ngram_coverage.png)
132
 
133
  ### Results
134
 
135
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
136
- |--------|------------|---------|----------------|------------------|-------------------|
137
- | **2-gram** | 927 🏆 | 9.86 | 1,856 | 38.3% | 83.1% |
138
- | **2-gram** | 486 🏆 | 8.92 | 2,656 | 53.5% | 95.5% |
139
- | **3-gram** | 1,521 | 10.57 | 2,744 | 28.3% | 71.0% |
140
- | **3-gram** | 3,351 | 11.71 | 15,024 | 23.1% | 61.6% |
141
- | **4-gram** | 4,981 | 12.28 | 7,604 | 14.3% | 42.5% |
142
- | **4-gram** | 12,700 | 13.63 | 44,900 | 11.7% | 37.6% |
143
 
144
  ### Top 5 N-grams by Size
145
 
146
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
147
 
148
  | Rank | N-gram | Count |
149
  |------|--------|-------|
150
- | 1 | `category :` | 662 |
151
- | 2 | `- рэ` | 638 |
152
- | 3 | `- м` | 464 |
153
- | 4 | `рэ илъэсым` | 335 |
154
- | 5 | `. category` | 276 |
155
 
156
- **3-grams:**
157
 
158
  | Rank | N-gram | Count |
159
  |------|--------|-------|
160
- | 1 | `- рэ илъэсым` | 333 |
161
- | 2 | `. category :` | 276 |
162
- | 3 | `category : !` | 179 |
163
- | 4 | `: ! main` | 179 |
164
- | 5 | `! main category` | 179 |
165
 
166
- **4-grams:**
167
 
168
  | Rank | N-gram | Count |
169
  |------|--------|-------|
170
- | 1 | `category : ! main` | 179 |
171
- | 2 | `: ! main category` | 179 |
172
- | 3 | `. category : !` | 132 |
173
- | 4 | `. хэгэгум чiырэу иiэр` | 101 |
174
- | 5 | `. дло - м` | 87 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
 
177
  ### Key Findings
178
 
179
- - **Best Perplexity:** 2-gram with 486
180
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
181
- - **Coverage:** Top-1000 patterns cover ~38% of corpus
182
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
183
 
184
  ---
@@ -186,55 +215,86 @@ thumb
186
 
187
  ![Markov Entropy](visualizations/markov_entropy.png)
188
 
 
 
189
  ![Markov Branching](visualizations/markov_branching.png)
190
 
191
  ### Results
192
 
193
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
194
- |---------|-------------|------------|------------------|-----------------|----------------|
195
- | **1** | 0.3643 | 1.287 | 2.28 | 28,827 | 63.6% |
196
- | **1** | 1.5343 | 2.896 | 12.27 | 463 | 0.0% |
197
- | **2** | 0.1248 | 1.090 | 1.24 | 65,637 | 87.5% |
198
- | **2** | 1.1477 | 2.216 | 5.66 | 5,679 | 0.0% |
199
- | **3** | 0.0452 | 1.032 | 1.07 | 80,882 | 95.5% |
200
- | **3** | 0.7357 | 1.665 | 2.89 | 32,122 | 26.4% |
201
- | **4** | 0.0244 🏆 | 1.017 | 1.04 | 86,492 | 97.6% |
202
- | **4** | 0.4145 🏆 | 1.333 | 1.83 | 92,841 | 58.5% |
 
 
203
 
204
- ### Generated Text Samples
205
 
206
- Below are text samples generated from each Markov chain model:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
 
208
  **Context Size 1:**
209
 
210
- 1. `. ахъщэр iэгурыхьэ - 14 . ы ↔ ӏей ; пшэс 1угжъу category : / даутэ`
211
- 2. `, коц , " ids " " тэiошъ , батэ вгъэшым ��э силъэпкъ шъукъыхахьэу , къэрабгъэр`
212
- 3. `- рэ лӏэшӏэгъуршапсыгъэхэршапсыгъэ ныпгорэ . географие гъунэгъухэр : " kaynar , каракас ) къо зиiэм ...`
213
 
214
  **Context Size 2:**
215
 
216
- 1. `category : район category : ! main category зэпыщэхэр`
217
- 2. `- рэ щэпсэу . гулъытэгъуэ техьэпӏэхэр къайсэр адыгэ хасэм ( дах - м хахьэ . хэгъэгу лiышъхьэр`
218
- 3. `- м инароднэ тхакiу . илэжьэнхэр 1960 - рэ ислъэсхэм – адыгэ къэралыгъо университетым студентхэр щег...`
219
 
220
  **Context Size 3:**
221
 
222
- 1. `- рэ илъэсым къыщегъэжьагъэу щэiэфэ гуманитар ушэтынхэмкiэ адыгэ республикэ институтым литературэмкi...`
223
- 2. `. category : ! main category зэпыщэхэр`
224
- 3. `category : ! main category зэпыщэхэр`
225
 
226
  **Context Size 4:**
227
 
228
- 1. `category : ! main category зэпыщэхэр`
229
- 2. `. category : ! main category зэпыщэхэр`
230
- 3. `. хэгэгум чiырэу иiэр 9 984 670 км² ( дунаемкiэ я - 11 ) . хэгэгум чiырэу иiэр 283`
231
 
232
 
233
  ### Key Findings
234
 
235
- - **Best Predictability:** Context-4 with 97.6% predictability
236
  - **Branching Factor:** Decreases with context size (more deterministic)
237
- - **Memory Trade-off:** Larger contexts require more storage (92,841 contexts)
238
  - **Recommendation:** Context-3 or Context-4 for text generation
239
 
240
  ---
@@ -250,64 +310,64 @@ Below are text samples generated from each Markov chain model:
250
 
251
  | Metric | Value |
252
  |--------|-------|
253
- | Vocabulary Size | 8,988 |
254
- | Total Tokens | 57,159 |
255
- | Mean Frequency | 6.36 |
256
  | Median Frequency | 3 |
257
- | Frequency Std Dev | 23.47 |
258
 
259
  ### Most Common Words
260
 
261
  | Rank | Word | Frequency |
262
  |------|------|-----------|
263
- | 1 | и | 1,019 |
264
- | 2 | category | 841 |
265
- | 3 | адыгэ | 701 |
266
- | 4 | рэ | 641 |
267
- | 5 | м | 541 |
268
- | 6 | илъэсым | 407 |
269
- | 7 | ащ | 392 |
270
- | 8 | я | 349 |
271
- | 9 | ары | 276 |
272
- | 10 | а | 259 |
273
 
274
  ### Least Common Words (from vocabulary)
275
 
276
  | Rank | Word | Frequency |
277
  |------|------|-----------|
278
- | 1 | britishpedia | 2 |
279
- | 2 | encyklopedia | 2 |
280
- | 3 | osobistości | 2 |
281
- | 4 | rzeczypospolitej | 2 |
282
- | 5 | polskiej | 2 |
283
- | 6 | bph | 2 |
284
- | 7 | british | 2 |
285
- | 8 | publishing | 2 |
286
- | 9 | ltd | 2 |
287
- | 10 | 912100 | 2 |
288
 
289
  ### Zipf's Law Analysis
290
 
291
  | Metric | Value |
292
  |--------|-------|
293
- | Zipf Coefficient | 0.7855 |
294
- | R² (Goodness of Fit) | 0.976491 |
295
  | Adherence Quality | **excellent** |
296
 
297
  ### Coverage Analysis
298
 
299
  | Top N Words | Coverage |
300
  |-------------|----------|
301
- | Top 100 | 26.7% |
302
- | Top 1,000 | 56.9% |
303
- | Top 5,000 | 86.0% |
304
  | Top 10,000 | 0.0% |
305
 
306
  ### Key Findings
307
 
308
- - **Zipf Compliance:** R²=0.9765 indicates excellent adherence to Zipf's law
309
- - **High Frequency Dominance:** Top 100 words cover 26.7% of corpus
310
- - **Long Tail:** -1,012 words needed for remaining 100.0% coverage
311
 
312
  ---
313
  ## 5. Word Embeddings Evaluation
@@ -320,24 +380,129 @@ Below are text samples generated from each Markov chain model:
320
 
321
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
322
 
323
- ### Model Comparison
324
 
325
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
326
- |-------|------------|-----------|----------|----------|----------|
327
- | **mono_32d** | 1,830 | 32 | 3.764 | 0.663 | 0.6831 🏆 |
328
- | **mono_64d** | 1,830 | 64 | 3.806 | 0.668 | 0.2517 |
329
- | **mono_128d** | 1,830 | 128 | 3.824 | 0.669 | 0.0484 |
330
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
331
 
332
  ### Key Findings
333
 
334
- - **Best Isotropy:** mono_32d with 0.6831 (more uniform distribution)
335
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
336
- - **Vocabulary Coverage:** All models cover 1,830 words
337
- - **Recommendation:** 100d for balanced semantic capture and efficiency
338
 
339
  ---
340
- ## 6. Summary & Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
341
 
342
  ![Performance Dashboard](visualizations/performance_dashboard.png)
343
 
@@ -345,11 +510,12 @@ Below are text samples generated from each Markov chain model:
345
 
346
  | Component | Recommended | Rationale |
347
  |-----------|-------------|-----------|
348
- | Tokenizer | **32k BPE** | Best compression (4.45x) with low UNK rate |
349
- | N-gram | **5-gram** | Lowest perplexity (486) |
350
- | Markov | **Context-4** | Highest predictability (97.6%) |
351
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
352
 
 
353
  ---
354
  ## Appendix: Metrics Glossary & Interpretation Guide
355
 
@@ -539,7 +705,8 @@ If you use these models in your research, please cite:
539
  author = {Kamali, Omar},
540
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
541
  year = {2025},
542
- publisher = {HuggingFace},
 
543
  url = {https://huggingface.co/wikilangs}
544
  institution = {Omneity Labs}
545
  }
@@ -555,7 +722,8 @@ MIT License - Free for academic and commercial use.
555
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
556
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
557
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
558
  ---
559
  *Generated by Wikilangs Models Pipeline*
560
 
561
- *Report Date: 2025-12-27 04:34:00*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 4.231
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.4730
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # ADY - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 3.442x | 3.45 | 0.1638% | 134,283 |
84
+ | **16k** | 3.798x | 3.80 | 0.1808% | 121,676 |
85
+ | **32k** | 4.231x 🏆 | 4.24 | 0.2014% | 109,215 |
 
86
 
87
  ### Tokenization Examples
88
 
89
  Below are sample sentences tokenized with each vocabulary size:
90
 
91
+ **Sample 1:** `ШъхьафитАшэ псыхъо иджабгъу нэпкъы тес Адыгэ къуадж. районым хахьэ. Хым зы пэ...`
 
 
 
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
+ | 8k | `▁шъхьафит ▁— ▁ашэ ▁псыхъо ▁иджабгъу ▁нэпкъы ▁тес ▁адыгэ ▁къуадж . ... (+7 more)` | 17 |
96
+ | 16k | `▁шъхьафит ▁— ▁ашэ ▁псыхъо ▁иджабгъу ▁нэпкъы ▁тес ▁адыгэ ▁къуадж . ... (+7 more)` | 17 |
97
+ | 32k | `▁шъхьафит ▁— ▁ашэ ▁псыхъо ▁иджабгъу ▁нэпкъы ▁тес ▁адыгэ ▁къуадж . ... (+7 more)` | 17 |
 
98
 
99
+ **Sample 2:** `thumb Америкэ - чӀынэлъэшхухэр Iут зэхэт (Къыблэ Америкэмрэ, Ишъхъэрэмрэ) Тыгъэк...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
+ | 8k | `▁thumb ▁америкэ ▁- ▁чӏы нэлъэ шхухэр ▁i ут ▁зэхэт ▁( ... (+17 more)` | 27 |
104
+ | 16k | `▁thumb ▁америкэ ▁- ▁чӏынэлъэшхухэр ▁i ут ▁зэхэт ▁( къыблэ ▁америкэмрэ ... (+13 more)` | 23 |
105
+ | 32k | `▁thumb ▁америкэ ▁- ▁чӏынэлъэшхухэр ▁i ут ▁зэхэт ▁( къыблэ ▁америкэмрэ ... (+11 more)` | 21 |
 
 
 
 
 
106
 
107
+ **Sample 3:** `thumb Мамуныр мэз псэушъхьэхэмэ а щыщ. Мамунхэр чыг дэпшэиэным лъэшэу Мамуным и ...`
108
 
109
  | Vocab | Tokens | Count |
110
  |-------|--------|-------|
111
+ | 8k | `▁thumb ▁мамун ыр ▁мэз ▁псэушъхьэхэмэ ▁а ▁щыщ . ▁мамун хэр ... (+22 more)` | 32 |
112
+ | 16k | `▁thumb ▁мамуныр ▁мэз ▁псэушъхьэхэмэ ▁а ▁щыщ . ▁мамунхэр ▁ч ыг ... (+14 more)` | 24 |
113
+ | 32k | `▁thumb ▁мамуныр ▁мэз ▁псэушъхьэхэмэ ▁а ▁щыщ . ▁мамунхэр ▁чыг ▁дэпшэиэным ... (+10 more)` | 20 |
 
114
 
115
 
116
  ### Key Findings
117
 
118
+ - **Best Compression:** 32k achieves 4.231x compression
119
+ - **Lowest UNK Rate:** 8k with 0.1638% unknown tokens
120
  - **Trade-off:** Larger vocabularies improve compression but increase model size
121
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
122
 
 
125
 
126
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
127
 
128
+ ![N-gram Unique](visualizations/ngram_unique.png)
129
+
130
  ![N-gram Coverage](visualizations/ngram_coverage.png)
131
 
132
  ### Results
133
 
134
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
135
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
136
+ | **2-gram** | Word | 418 | 8.71 | 593 | 45.3% | 100.0% |
137
+ | **2-gram** | Subword | 399 🏆 | 8.64 | 2,072 | 57.0% | 97.4% |
138
+ | **3-gram** | Word | 706 | 9.46 | 922 | 33.9% | 100.0% |
139
+ | **3-gram** | Subword | 2,788 | 11.44 | 11,614 | 24.5% | 65.1% |
140
+ | **4-gram** | Word | 2,848 | 11.48 | 3,264 | 13.1% | 44.3% |
141
+ | **4-gram** | Subword | 10,651 | 13.38 | 35,316 | 12.4% | 39.6% |
142
 
143
  ### Top 5 N-grams by Size
144
 
145
+ **2-grams (Word):**
146
+
147
+ | Rank | N-gram | Count |
148
+ |------|--------|-------|
149
+ | 1 | `нэбгырэ млн` | 169 |
150
+ | 2 | `къехъу щэпсэу` | 104 |
151
+ | 3 | `картым тетэу` | 100 |
152
+ | 4 | `м къехъу` | 89 |
153
+ | 5 | `дло м` | 87 |
154
+
155
+ **3-grams (Word):**
156
 
157
  | Rank | N-gram | Count |
158
  |------|--------|-------|
159
+ | 1 | къехъу щэпсэу` | 76 |
160
+ | 2 | `къехъу щэпсэу хэгэгум` | 70 |
161
+ | 3 | `адыгэ республикэм и` | 48 |
162
+ | 4 | `дло м хахьэ` | 44 |
163
+ | 5 | хахьэ хэгъэгу` | 39 |
164
 
165
+ **4-grams (Word):**
166
 
167
  | Rank | N-gram | Count |
168
  |------|--------|-------|
169
+ | 1 | къехъу щэпсэу хэгэгум` | 45 |
170
+ | 2 | `дло м хахьэ хэгъэгу` | 39 |
171
+ | 3 | `еуропэм хэт къэралыгъу къэлэ` | 23 |
172
+ | 4 | `америкэм ит къэралыгъу къэлэ` | 19 |
173
+ | 5 | `азием ит къэралыгъу къэлэ` | 18 |
174
 
175
+ **2-grams (Subword):**
176
 
177
  | Rank | N-gram | Count |
178
  |------|--------|-------|
179
+ | 1 | ъ` | 9,349 |
180
+ | 2 | э` | 9,255 |
181
+ | 3 | _` | 8,719 |
182
+ | 4 | _` | 7,823 |
183
+ | 5 | р` | 6,778 |
184
+
185
+ **3-grams (Subword):**
186
+
187
+ | Rank | N-gram | Count |
188
+ |------|--------|-------|
189
+ | 1 | `г ъ э` | 4,967 |
190
+ | 2 | `_ к ъ` | 4,149 |
191
+ | 3 | `э м _` | 3,582 |
192
+ | 4 | `ы г ъ` | 3,357 |
193
+ | 5 | `э р _` | 3,016 |
194
+
195
+ **4-grams (Subword):**
196
+
197
+ | Rank | N-gram | Count |
198
+ |------|--------|-------|
199
+ | 1 | `ы г ъ э` | 1,903 |
200
+ | 2 | `х э р _` | 1,450 |
201
+ | 3 | `а г ъ э` | 1,351 |
202
+ | 4 | `х э м _` | 1,305 |
203
+ | 5 | `_ к ъ э` | 1,289 |
204
 
205
 
206
  ### Key Findings
207
 
208
+ - **Best Perplexity:** 2-gram (subword) with 399
209
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
210
+ - **Coverage:** Top-1000 patterns cover ~40% of corpus
211
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
212
 
213
  ---
 
215
 
216
  ![Markov Entropy](visualizations/markov_entropy.png)
217
 
218
+ ![Markov Contexts](visualizations/markov_contexts.png)
219
+
220
  ![Markov Branching](visualizations/markov_branching.png)
221
 
222
  ### Results
223
 
224
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
225
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
226
+ | **1** | Word | 0.4365 | 1.353 | 2.10 | 22,306 | 56.3% |
227
+ | **1** | Subword | 1.4909 | 2.811 | 10.56 | 410 | 0.0% |
228
+ | **2** | Word | 0.0764 | 1.054 | 1.12 | 46,305 | 92.4% |
229
+ | **2** | Subword | 1.1481 | 2.216 | 5.61 | 4,325 | 0.0% |
230
+ | **3** | Word | 0.0240 | 1.017 | 1.03 | 51,243 | 97.6% |
231
+ | **3** | Subword | 0.7541 | 1.687 | 2.97 | 24,260 | 24.6% |
232
+ | **4** | Word | 0.0128 🏆 | 1.009 | 1.02 | 52,387 | 98.7% |
233
+ | **4** | Subword | 0.4304 | 1.348 | 1.86 | 72,077 | 57.0% |
234
+
235
+ ### Generated Text Samples (Word-based)
236
 
237
+ Below are text samples generated from each word-based Markov chain model:
238
 
239
+ **Context Size 1:**
240
+
241
+ 1. `и 13 мэ ащыщэу адыгэр сыдигъокіи адыгэ къуаж ипшъэ итхьапӏэ иблэгъожъхэм афэгъэхьыгъэ мифхэр къызэра...`
242
+ 2. `адыгэ хьатыкъуай унагъохэр тыркуем и плакат ныбэрынхьэблэ адыгэбзэ жэбзэ къабзэ ежь ныпым зызиушъомб...`
243
+ 3. `м ахахьэ хэгъэгу шавкат мирзияев къэрал лӏышъхьэр кӏокӏо къызбэч кавказ заом ыпэкӏэ щыӏагъэхэмрэ якъ...`
244
+
245
+ **Context Size 2:**
246
+
247
+ 1. `нэбгырэ млн 10 фэдиз тешӏагъэу анатолием ахэр агъэкощыгъэх тхыгъэ зэфэшъхьафхэм мэхьанэу каноничност...`
248
+ 2. `къехъу щэпсэу я 84 хэгэгум 93 030 км я 26 испаныбзэр ащ нэмыкӏэу регионыбзэхэр иӏэх дло м`
249
+ 3. `картым тетэу бразилие къыблэ америкэм ыгу ит германиер аустриер словакиер руманиер украинэр сербиер ...`
250
+
251
+ **Context Size 3:**
252
+
253
+ 1. `м къехъу щэпсэу хэгэгум 2 149 690 км арапыбз сауд арабиер арап къэралыгъомэ ащыщмэ анахь хэгъэгу ащы...`
254
+ 2. `къехъу щэпсэу хэгэгум 140 800 км непали дло м хахьэ хэгъэгу хассанал болкиах географие азием и гъунэ...`
255
+ 3. `адыгэ республикэм и къэралыгъо премие илауреат дунэе адыгэ академием иакадемик къалэу шъачэ поселкэу...`
256
+
257
+ **Context Size 4:**
258
+
259
+ 1. `м къехъу щэпсэу хэгэгум 9 596 960 км китаибзэр дло м хахьэ хэгъэгу эмомали рахмон къэрал тхьэматэр к...`
260
+ 2. `дло м хахьэ хэгъэгу джоко видодо гуадзэр юсуф калла географие океан шъэфымымрэ инд океанымрэ азфагу ...`
261
+ 3. `еуропэм хэт къэралыгъу къэлэ париж нэбгырэ млн 66 м къехъу щэпсэу хэгэгум 9 984 670 км я 2 англыбзэ`
262
+
263
+
264
+ ### Generated Text Samples (Subword-based)
265
+
266
+ Below are text samples generated from each subword-based Markov chain model:
267
 
268
  **Context Size 1:**
269
 
270
+ 1. `_фим_хъэрикъолам`
271
+ 2. `эм_илъу_-м_бэхь_`
272
+ 3. `ышъэпсым_илнине_`
273
 
274
  **Context Size 2:**
275
 
276
+ 1. `гъэгъэ_асэу_ɡʲadə`
277
+ 2. `ъэхьэухэм_епхъухь`
278
+ 3. `э_хэгьэмрэ_щыпӏэ-`
279
 
280
  **Context Size 3:**
281
 
282
+ 1. `гъэ_уахэмрэ,_къыуи`
283
+ 2. `_къалэбилэжъ_зэпхъ`
284
+ 3. `эм_ыгугъэкон_къаук`
285
 
286
  **Context Size 4:**
287
 
288
+ 1. `ыгъэуцохэр_чэзыу-чэ`
289
+ 2. `хэр_нэхъин_динхэр_з`
290
+ 3. `агъэр_гъэп,_англыбз`
291
 
292
 
293
  ### Key Findings
294
 
295
+ - **Best Predictability:** Context-4 (word) with 98.7% predictability
296
  - **Branching Factor:** Decreases with context size (more deterministic)
297
+ - **Memory Trade-off:** Larger contexts require more storage (72,077 contexts)
298
  - **Recommendation:** Context-3 or Context-4 for text generation
299
 
300
  ---
 
310
 
311
  | Metric | Value |
312
  |--------|-------|
313
+ | Vocabulary Size | 7,032 |
314
+ | Total Tokens | 44,503 |
315
+ | Mean Frequency | 6.33 |
316
  | Median Frequency | 3 |
317
+ | Frequency Std Dev | 22.13 |
318
 
319
  ### Most Common Words
320
 
321
  | Rank | Word | Frequency |
322
  |------|------|-----------|
323
+ | 1 | и | 1,013 |
324
+ | 2 | адыгэ | 666 |
325
+ | 3 | м | 489 |
326
+ | 4 | илъэсым | 398 |
327
+ | 5 | ащ | 391 |
328
+ | 6 | я | 309 |
329
+ | 7 | ары | 271 |
330
+ | 8 | нэбгырэ | 247 |
331
+ | 9 | а | 243 |
332
+ | 10 | ыкӏи | 211 |
333
 
334
  ### Least Common Words (from vocabulary)
335
 
336
  | Rank | Word | Frequency |
337
  |------|------|-----------|
338
+ | 1 | рсфср | 2 |
339
+ | 2 | серийнэ | 2 |
340
+ | 3 | ныбжьыкӏэхэри | 2 |
341
+ | 4 | зэратебэнагъэр | 2 |
342
+ | 5 | хираганэ | 2 |
343
+ | 6 | катаканэ | 2 |
344
+ | 7 | сербыбзэм | 2 |
345
+ | 8 | къыздикӏыгъэр | 2 |
346
+ | 9 | тыванбзэ | 2 |
347
+ | 10 | къызыл | 2 |
348
 
349
  ### Zipf's Law Analysis
350
 
351
  | Metric | Value |
352
  |--------|-------|
353
+ | Zipf Coefficient | 0.7821 |
354
+ | R² (Goodness of Fit) | 0.977951 |
355
  | Adherence Quality | **excellent** |
356
 
357
  ### Coverage Analysis
358
 
359
  | Top N Words | Coverage |
360
  |-------------|----------|
361
+ | Top 100 | 29.3% |
362
+ | Top 1,000 | 60.6% |
363
+ | Top 5,000 | 90.9% |
364
  | Top 10,000 | 0.0% |
365
 
366
  ### Key Findings
367
 
368
+ - **Zipf Compliance:** R²=0.9780 indicates excellent adherence to Zipf's law
369
+ - **High Frequency Dominance:** Top 100 words cover 29.3% of corpus
370
+ - **Long Tail:** -2,968 words needed for remaining 100.0% coverage
371
 
372
  ---
373
  ## 5. Word Embeddings Evaluation
 
380
 
381
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
382
 
 
383
 
384
+ ### 5.1 Cross-Lingual Alignment
385
+
386
+ > *Note: Multilingual alignment visualization not available for this language.*
387
+
388
+
389
+ ### 5.2 Model Comparison
390
+
391
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
392
+ |-------|-----------|----------|------------------|---------------|----------------|
393
+ | **mono_32d** | 32 | 0.4730 🏆 | 0.4239 | N/A | N/A |
394
+ | **mono_64d** | 64 | 0.2201 | 0.4040 | N/A | N/A |
395
+ | **mono_128d** | 128 | 0.0372 | 0.3952 | N/A | N/A |
396
 
397
  ### Key Findings
398
 
399
+ - **Best Isotropy:** mono_32d with 0.4730 (more uniform distribution)
400
+ - **Semantic Density:** Average pairwise similarity of 0.4077. Lower values indicate better semantic separation.
401
+ - **Alignment Quality:** No aligned models evaluated in this run.
402
+ - **Recommendation:** 128d aligned for best cross-lingual performance
403
 
404
  ---
405
+ ## 6. Morphological Analysis (Experimental)
406
+
407
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
408
+
409
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
410
+
411
+ ### 6.1 Productivity & Complexity
412
+
413
+ | Metric | Value | Interpretation | Recommendation |
414
+ |--------|-------|----------------|----------------|
415
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
416
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
417
+
418
+ ### 6.2 Affix Inventory (Productive Units)
419
+
420
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
421
+
422
+ #### Productive Prefixes
423
+ | Prefix | Examples |
424
+ |--------|----------|
425
+ | `-къ` | къыщыхъу, къуаджэхэу, къэбарым |
426
+ | `-зэ` | зэман, зэдаштэгъэ, зэпэух |
427
+ | `-къы` | къыщыхъу, къыщыфэфедэщтхэу, къызыхэкӏыгъэр |
428
+
429
+ #### Productive Suffixes
430
+ | Suffix | Examples |
431
+ |--------|----------|
432
+ | `-э` | ятхьэ, урысыбзэ, чылэ |
433
+ | `-м` | такъырым, шапхъэхэм, къэбарым |
434
+ | `-р` | латвиер, сыхьатыр, министр |
435
+ | `-эр` | курдхэр, щыгъынхэр, мэхъошхэр |
436
+ | `-эм` | шапхъэхэм, япэм, урымыбзэм |
437
+ | `-эу` | алфавитэу, илъхэу, игъэкӏотыгъэу |
438
+ | `-хэр` | курдхэр, щыгъынхэр, мэхъошхэр |
439
+ | `-рэ` | къагъэлъагъуэрэ, зыгорэ, цӏэмрэ |
440
+
441
+ ### 6.3 Bound Stems (Lexical Roots)
442
+
443
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
444
+
445
+ | Stem | Cohesion | Substitutability | Examples |
446
+ |------|----------|------------------|----------|
447
+ | `тыгъ` | 1.78x | 28 contexts | тыгъэ, итыгъ, тыгъу |
448
+ | `ъагъ` | 2.15x | 14 contexts | пчъагъ, лъагъо, пчъагъэ |
449
+ | `агъэ` | 1.54x | 41 contexts | тхагъэ, благъэ, пчагъэ |
450
+ | `эпкъ` | 1.74x | 25 contexts | нэпкъ, тхэпкъ, лъэпкъ |
451
+ | `къуа` | 2.16x | 10 contexts | къуае, къуажэ, къуадж |
452
+ | `ъхьэ` | 1.78x | 16 contexts | шъхьэ, пшъхьэ, шъхьэм |
453
+ | `дыгэ` | 1.82x | 14 contexts | адыгэ, адыгэм, иадыгэ |
454
+ | `эхэр` | 1.56x | 21 contexts | бэхэр, усэхэр, ынэхэр |
455
+ | `шъхь` | 1.49x | 24 contexts | шъхьэ, пшъхьэ, шъхьэм |
456
+ | `псэу` | 1.57x | 19 contexts | щыпсэу, щэпсэу, сыпсэу |
457
+ | `ыгъо` | 1.56x | 19 contexts | цыгъо, мыгъо, пщыгъо |
458
+ | `гъэх` | 1.65x | 14 contexts | багъэх, хъугъэх, ежагъэх |
459
+
460
+ ### 6.4 Affix Compatibility (Co-occurrence)
461
+
462
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
463
+
464
+ | Prefix | Suffix | Frequency | Examples |
465
+ |--------|--------|-----------|----------|
466
+ | `-къ` | `-э` | 96 words | къэлэмымкӏэ, къалэмэ |
467
+ | `-къ` | `-р` | 64 words | къо��, къуаджэхэр |
468
+ | `-къ` | `-м` | 56 words | къалэм, къумбылым |
469
+ | `-къ` | `-эр` | 52 words | къуаджэхэр, къэбархэр |
470
+ | `-зэ` | `-р` | 42 words | зэготхэр, зэхэтхэр |
471
+ | `-зэ` | `-м` | 41 words | зэхэзгъэуцуагъэхэм, зэӏукӏэгъум |
472
+ | `-къ` | `-эм` | 36 words | къалэм, къуаджэхэм |
473
+ | `-зэ` | `-эр` | 34 words | зэготхэр, зэхэтхэр |
474
+ | `-къ` | `-эу` | 34 words | къыхэкӏыгъэу, къэгъэлъэгъонэу |
475
+ | `-зэ` | `-э` | 31 words | зэ, зэригъэфэгъэ |
476
+
477
+ ### 6.5 Recursive Morpheme Segmentation
478
+
479
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
480
+
481
+ | Word | Suggested Split | Confidence | Stem |
482
+ |------|-----------------|------------|------|
483
+ | щыпсэухэрэр | **`щыпс-эу-хэр-эр`** | 7.5 | `щыпс` |
484
+ | америкэмрэ | **`америк-эм-рэ`** | 6.0 | `америк` |
485
+ | океанымрэ | **`океан-ым-рэ`** | 6.0 | `океан` |
486
+ | литературэмрэ | **`литератур-эм-рэ`** | 6.0 | `литератур` |
487
+ | бзылъфыгъэмрэ | **`бзылъфыгъ-эм-рэ`** | 6.0 | `бзылъфыгъ` |
488
+ | адыгабзэмрэ | **`адыгабз-эм-рэ`** | 6.0 | `адыгабз` |
489
+ | хыплъыжьымрэ | **`хыплъыжь-ым-рэ`** | 6.0 | `хыплъыжь` |
490
+ | алфавитэу | **`алфавит-эу`** | 4.5 | `алфавит` |
491
+ | цӏыкӏухэр | **`цӏыкӏу-хэр`** | 4.5 | `цӏыкӏу` |
492
+ | исурэтхэр | **`исурэт-хэр`** | 4.5 | `исурэт` |
493
+ | шӏыпӏэхэр | **`шӏыпӏэ-хэр`** | 4.5 | `шӏыпӏэ` |
494
+ | шӏэныгъэм | **`шӏэныгъ-эм`** | 4.5 | `шӏэныгъ` |
495
+ | къыпыщылъ | **`къы-пыщылъ`** | 4.5 | `пыщылъ` |
496
+ | пэблагъэу | **`пэблагъ-эу`** | 4.5 | `пэблагъ` |
497
+ | ишъхъэрэмрэ | **`ишъхъ-эр-эм-рэ`** | 4.5 | `ишъхъ` |
498
+
499
+ ### 6.6 Linguistic Interpretation
500
+
501
+ > **Automated Insight:**
502
+ The language ADY appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
503
+
504
+ ---
505
+ ## 7. Summary & Recommendations
506
 
507
  ![Performance Dashboard](visualizations/performance_dashboard.png)
508
 
 
510
 
511
  | Component | Recommended | Rationale |
512
  |-----------|-------------|-----------|
513
+ | Tokenizer | **32k BPE** | Best compression (4.23x) |
514
+ | N-gram | **2-gram** | Lowest perplexity (399) |
515
+ | Markov | **Context-4** | Highest predictability (98.7%) |
516
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
517
 
518
+
519
  ---
520
  ## Appendix: Metrics Glossary & Interpretation Guide
521
 
 
705
  author = {Kamali, Omar},
706
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
707
  year = {2025},
708
+ doi = {10.5281/zenodo.18073153},
709
+ publisher = {Zenodo},
710
  url = {https://huggingface.co/wikilangs}
711
  institution = {Omneity Labs}
712
  }
 
722
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
723
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
724
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
725
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
726
  ---
727
  *Generated by Wikilangs Models Pipeline*
728
 
729
+ *Report Date: 2026-01-03 05:00:02*
models/embeddings/monolingual/ady_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:27f96cb22cce1dccf3f5a68afba58270c9a3d30c33c99e0b985a0160dc08be41
3
- size 1025913894
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79239fb8f1d5516f7259d49637e01b3851134cf4dc232d72d6bbc383171bd360
3
+ size 1025621365
models/embeddings/monolingual/ady_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1830
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 1551
15
  }
models/embeddings/monolingual/ady_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4df91a4ae1b68f41d1ebc698b3b79863f82558368b99039be6879f7219c9dfab
3
- size 256508454
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35a83ab3b148734fe90d463a14bb03ac25aa53bebe1cd930e6ee275b421f2b9e
3
+ size 256430197
models/embeddings/monolingual/ady_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1830
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 1551
15
  }
models/embeddings/monolingual/ady_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e39765ffd2c8eb6ef0e3b1854af129b7b0260b2e70a54156a81a9a099b1fa27b
3
- size 512976934
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f26b79bc9f27a68c68e56e0b9e83bef6a8d25662bd63c7a490adce062b153229
3
+ size 512827253
models/embeddings/monolingual/ady_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1830
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 1551
15
  }
models/subword_markov/ady_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:eb658a5696f0194212546a4afabf40b1859c08800bbd6f1ffecda37fddfd3c58
3
- size 45335
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:326d1899bc21c88f5d2ecbb60df90091a3f81877b05f9f3880179e64fc82bc39
3
+ size 35559
models/subword_markov/ady_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_contexts": 463,
6
- "total_transitions": 611427
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_contexts": 410,
6
+ "total_transitions": 461186
7
  }
models/subword_markov/ady_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b5ebd13fcb32a70f1eeeb3c7ec3df3945466bdb98d3539405ae08a602378e310
3
- size 217004
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:439580e1d85b149b7a4e12434dcacbd8c63ce4f422949f7787d12f51a3b06768
3
+ size 169546
models/subword_markov/ady_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_contexts": 5679,
6
- "total_transitions": 610662
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_contexts": 4325,
6
+ "total_transitions": 460515
7
  }
models/subword_markov/ady_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b9df0d86e436385c2b2494f0af89b0ca67a6a9d8aa7e5706dadd31de6649a3d8
3
- size 714083
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba32cd485f139de6037badf290db15170159de5ebb63242fabe6473e50963984
3
+ size 554556
models/subword_markov/ady_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_contexts": 32122,
6
- "total_transitions": 609897
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_contexts": 24260,
6
+ "total_transitions": 459844
7
  }
models/subword_markov/ady_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:14192c7fe2413ebd6957d9d1e1bf08f20d2cf28008cf6b161e3c9ff26a079c9b
3
- size 1599436
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a23a4022efd4fb6c5360771eb3945cadabce7d4e08d9a04895e4535a30c12dad
3
+ size 1240010
models/subword_markov/ady_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_contexts": 92841,
6
- "total_transitions": 609132
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_contexts": 72077,
6
+ "total_transitions": 459173
7
  }
models/subword_ngram/ady_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:016367258813db8565f69ab7cb29945baccaac7e94321afc25a4b6c3c9f79f58
3
- size 32965
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2744392ac6f30902dc24ef71be21ed3d0027816ff58c51cc117dbe79b60b6eb
3
+ size 26615
models/subword_ngram/ady_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_ngrams": 2656,
6
- "total_ngrams": 611427
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_ngrams": 2072,
6
+ "total_ngrams": 461186
7
  }
models/subword_ngram/ady_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:19fa61eac0061901eb9506a6a556a9261919e5b319eaf1999aa663587ce22014
3
- size 187929
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e27920b4a896f7b33778a4da60561fc4bf2c40fe17414d4721a023a815cc6e6
3
+ size 144946
models/subword_ngram/ady_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_ngrams": 15024,
6
- "total_ngrams": 610662
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_ngrams": 11614,
6
+ "total_ngrams": 460515
7
  }
models/subword_ngram/ady_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c90862733e2c9b9b0d0d50264ea47e26c1bdce42f31aa0ab349bff62b10cc5fe
3
- size 580214
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1189db5bc7c400c2d162eabf55e0ca7ce6b8403c020f3330fe200b437eab55c
3
+ size 467994
models/subword_ngram/ady_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "ady",
5
- "unique_ngrams": 44900,
6
- "total_ngrams": 609897
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "ady",
5
+ "unique_ngrams": 35316,
6
+ "total_ngrams": 459844
7
  }
models/tokenizer/ady_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e6120c55e21bf381032672a3067bcf46760e2c4d359e940e77b9f3f746421b69
3
- size 564362
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa7339e704e163ae3ef399bf56ca2b041b58ad1cead7a66d4fef2204b2af8435
3
+ size 582264
models/tokenizer/ady_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/ady_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8e432f1beef1ea7d335d31918e5e88f5751d4fc577f2231b239c92518f39dc9c
3
- size 955536
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd22631bb665397f8acd6aab07c2522b6ca5fc9532f157143320bbb205558552
3
+ size 924924
models/tokenizer/ady_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/ady_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cfc7334d75b921cca2c9e616b995ac20d0728c4f9b3c54f3c5cb0b2ba2b5267c
3
- size 398334
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a4cd8210cf36092a1e5cc2950c6a2e334677c7d204476c1e9ac6f464fda33e5
3
+ size 396510
models/tokenizer/ady_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/ady_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:733b017356844758ce21db6ab8269f1bff9141c2d41430f547302391bc541d7e
3
- size 161889
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0394f3c6c2f9871c8756de87f7c0ab00cba32facf111d0c2027f0844c932241b
3
+ size 124070
models/vocabulary/ady_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "ady",
3
- "vocabulary_size": 8988,
 
4
  "statistics": {
5
- "type_token_ratio": 0.374012371348373,
6
  "coverage": {
7
- "top_100": 0.1982014762449319,
8
- "top_1000": 0.42253612641646743,
9
- "top_5000": 0.6391386838548706,
10
- "top_10000": 0.7559387670235991
11
  },
12
- "hapax_count": 19793,
13
- "hapax_ratio": 0.6877106424377193,
14
- "total_documents": 765
15
  }
16
  }
 
1
  {
2
  "language": "ady",
3
+ "vocabulary_size": 7032,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.37416908841901325,
7
  "coverage": {
8
+ "top_100": 0.21755686942579416,
9
+ "top_1000": 0.4501954103617597,
10
+ "top_5000": 0.6754016768547283,
11
+ "top_10000": 0.7928483147944015
12
  },
13
+ "hapax_count": 15371,
14
+ "hapax_ratio": 0.6861134669463911,
15
+ "total_documents": 671
16
  }
17
  }
models/word_markov/ady_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0faaa6cb0c52d77ac3f86b442e03b6cdd0646165c5b5467afaa42799100c5c08
3
- size 1013255
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e529d92df652217192b1a53b43d340b1b2d134ff78b718c2331e8296f5e8a183
3
+ size 772451
models/word_markov/ady_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_contexts": 28827,
6
- "total_transitions": 105741
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_contexts": 22306,
6
+ "total_transitions": 59203
7
  }
models/word_markov/ady_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cdf814da92430ed71f7261a45abbc27d6bc4f679afc5556ff21d158586b5709f
3
- size 1738893
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:949fea0fcff222936353244ef6edc60951c50ac50290dc3b857a31ce73658051
3
+ size 1310186
models/word_markov/ady_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_contexts": 65637,
6
- "total_transitions": 104976
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_contexts": 46305,
6
+ "total_transitions": 58532
7
  }
models/word_markov/ady_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c430f624230776fb0bed9ceec3edfcc58d221b9f682aa377294e7faf952a7f33
3
- size 2203636
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0edd29aaf52ac72ad56819e317223cc422f1de2fd93feb950a6fd0c9d7772f20
3
+ size 1585798
models/word_markov/ady_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_contexts": 80882,
6
- "total_transitions": 104212
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_contexts": 51243,
6
+ "total_transitions": 57861
7
  }
models/word_markov/ady_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f994f94aaa5faf04bea4da903f474a186f4d719203b671b4c8538fa7e3d6c323
3
- size 2577702
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3e2c27ffec6ab82c94cdc5ee7375d9e46705feb8849d5c1a1b44f49a1934653
3
+ size 1845825
models/word_markov/ady_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_contexts": 86492,
6
- "total_transitions": 103449
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_contexts": 52387,
6
+ "total_transitions": 57190
7
  }
models/word_ngram/ady_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e36a3e7022b130b28b923c0f0edfaff7cb6f8c269b793337bd6e999b68fc36cd
3
- size 35981
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:844f6778f268cd035e5b1a16fd38a5a6039c763c63cfaee2c5da98c36decbf9d
3
+ size 15361
models/word_ngram/ady_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_ngrams": 1856,
6
- "total_ngrams": 105741
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_ngrams": 593,
6
+ "total_ngrams": 59203
7
  }
models/word_ngram/ady_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a2ce08e36db31a04439c365d753c2684b44924398908b6f7a2883305f54f1f0c
3
- size 61177
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d7eb072c0df1857a28734446befa43d91f7bec4523744835d1f96b7ef4ea52f
3
+ size 26381
models/word_ngram/ady_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_ngrams": 2744,
6
- "total_ngrams": 104976
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_ngrams": 922,
6
+ "total_ngrams": 58532
7
  }
models/word_ngram/ady_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:43176b575286a7924ea5d8764dc0f6dfc283f757937cfa2d648424a65326ca9f
3
- size 203246
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:554ec1d49a6c147c67f6b66d5ec37c9b9ad812cce293118aea436980088ed1b3
3
+ size 106335
models/word_ngram/ady_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "ady",
5
- "unique_ngrams": 7604,
6
- "total_ngrams": 104212
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "ady",
5
+ "unique_ngrams": 3264,
6
+ "total_ngrams": 57861
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 1390d31649b7b1e14178b468ce45edbcbc5b56a783f8ab4ab111de4f6915d356
  • Pointer size: 131 Bytes
  • Size of remote file: 152 kB

Git LFS Details

  • SHA256: c9cf7cf290ea2cdd3ae4d69c530470ea951a4ba615d62fe79fc241192ad9656d
  • Pointer size: 131 Bytes
  • Size of remote file: 151 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED