Upload all models and assets for am (latest)
Browse files- README.md +61 -61
- models/embeddings/aligned/am_128d.bin +1 -1
- models/embeddings/aligned/am_128d.projection.npy +1 -1
- models/embeddings/aligned/am_32d.bin +1 -1
- models/embeddings/aligned/am_32d.projection.npy +1 -1
- models/embeddings/aligned/am_64d.bin +1 -1
- models/embeddings/aligned/am_64d.projection.npy +1 -1
- models/embeddings/monolingual/am_128d.bin +1 -1
- models/embeddings/monolingual/am_32d.bin +1 -1
- models/embeddings/monolingual/am_64d.bin +1 -1
- models/subword_markov/am_markov_ctx1_subword.parquet +2 -2
- models/subword_markov/am_markov_ctx2_subword.parquet +2 -2
- models/subword_markov/am_markov_ctx3_subword.parquet +2 -2
- models/subword_markov/am_markov_ctx4_subword.parquet +2 -2
- models/subword_ngram/am_2gram_subword.parquet +2 -2
- models/subword_ngram/am_3gram_subword.parquet +2 -2
- models/subword_ngram/am_4gram_subword.parquet +2 -2
- models/subword_ngram/am_5gram_subword.parquet +2 -2
- models/tokenizer/am_tokenizer_16k.model +1 -1
- models/tokenizer/am_tokenizer_32k.model +1 -1
- models/tokenizer/am_tokenizer_64k.model +1 -1
- models/tokenizer/am_tokenizer_8k.model +1 -1
- models/word_markov/am_markov_ctx1_word.parquet +2 -2
- models/word_markov/am_markov_ctx2_word.parquet +2 -2
- models/word_markov/am_markov_ctx3_word.parquet +2 -2
- models/word_markov/am_markov_ctx4_word.parquet +2 -2
- models/word_ngram/am_2gram_word.parquet +2 -2
- models/word_ngram/am_3gram_word.parquet +2 -2
- models/word_ngram/am_4gram_word.parquet +2 -2
- models/word_ngram/am_5gram_word.parquet +2 -2
- visualizations/embedding_alignment_quality.png +0 -0
- visualizations/embedding_isotropy.png +0 -0
- visualizations/embedding_norms.png +0 -0
- visualizations/embedding_similarity.png +2 -2
- visualizations/embedding_tsne_multilingual.png +2 -2
- visualizations/ngram_entropy.png +0 -0
- visualizations/performance_dashboard.png +2 -2
- visualizations/position_encoding_comparison.png +2 -2
- visualizations/tsne_sentences.png +2 -2
- visualizations/tsne_words.png +2 -2
README.md
CHANGED
|
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 99 |
|
| 100 |
Below are sample sentences tokenized with each vocabulary size:
|
| 101 |
|
| 102 |
-
**Sample 1:**
|
| 103 |
|
| 104 |
| Vocab | Tokens | Count |
|
| 105 |
|-------|--------|-------|
|
| 106 |
-
| 8k |
|
| 107 |
-
| 16k |
|
| 108 |
-
| 32k |
|
| 109 |
-
| 64k |
|
| 110 |
|
| 111 |
-
**Sample 2:**
|
| 112 |
|
| 113 |
| Vocab | Tokens | Count |
|
| 114 |
|-------|--------|-------|
|
| 115 |
-
| 8k |
|
| 116 |
-
| 16k |
|
| 117 |
-
| 32k |
|
| 118 |
-
| 64k |
|
| 119 |
|
| 120 |
-
**Sample 3:**
|
| 121 |
|
| 122 |
| Vocab | Tokens | Count |
|
| 123 |
|-------|--------|-------|
|
| 124 |
-
| 8k |
|
| 125 |
-
| 16k |
|
| 126 |
-
| 32k |
|
| 127 |
-
| 64k |
|
| 128 |
|
| 129 |
|
| 130 |
### Key Findings
|
|
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
|
|
| 274 |
|
| 275 |
**Context Size 1:**
|
| 276 |
|
| 277 |
-
1. `ነው
|
| 278 |
-
2. `እና
|
| 279 |
-
3. `ላይ
|
| 280 |
|
| 281 |
**Context Size 2:**
|
| 282 |
|
| 283 |
-
1. `ዓ ም
|
| 284 |
-
2. `ምሳሌ ነው
|
| 285 |
-
3. `የአማርኛ ምሳሌ ነው
|
| 286 |
|
| 287 |
**Context Size 3:**
|
| 288 |
|
| 289 |
-
1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ
|
| 290 |
-
2. `እ ኤ አ
|
| 291 |
-
3. `ምሳሌ ነው ትርጉሙ
|
| 292 |
|
| 293 |
**Context Size 4:**
|
| 294 |
|
| 295 |
-
1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ
|
| 296 |
-
2. `ምሳሌ ነው ትርጉሙ መደብ ተረትና ምሳሌ
|
| 297 |
-
3. `ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ
|
| 298 |
|
| 299 |
|
| 300 |
### Generated Text Samples (Subword-based)
|
|
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 303 |
|
| 304 |
**Context Size 1:**
|
| 305 |
|
| 306 |
-
1. `
|
| 307 |
-
2.
|
| 308 |
-
3. `ት_
|
| 309 |
|
| 310 |
**Context Size 2:**
|
| 311 |
|
| 312 |
-
1. `_
|
| 313 |
-
2. `ት_
|
| 314 |
-
3. `_
|
| 315 |
|
| 316 |
**Context Size 3:**
|
| 317 |
|
| 318 |
-
1. `_
|
| 319 |
-
2. `_
|
| 320 |
-
3. `_እና_
|
| 321 |
|
| 322 |
**Context Size 4:**
|
| 323 |
|
| 324 |
-
1. `_
|
| 325 |
-
2. `_ነው።_
|
| 326 |
-
3. `ነው።_
|
| 327 |
|
| 328 |
|
| 329 |
### Key Findings
|
|
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 428 |
|
| 429 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 430 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 431 |
-
| **mono_32d** | 32 | 0.
|
| 432 |
-
| **mono_64d** | 64 | 0.9137 | 0.
|
| 433 |
-
| **mono_128d** | 128 | 0.
|
| 434 |
-
| **aligned_32d** | 32 | 0.
|
| 435 |
-
| **aligned_64d** | 64 | 0.9137
|
| 436 |
-
| **aligned_128d** | 128 | 0.
|
| 437 |
|
| 438 |
### Key Findings
|
| 439 |
|
| 440 |
-
- **Best Isotropy:**
|
| 441 |
-
- **Semantic Density:** Average pairwise similarity of 0.
|
| 442 |
-
- **Alignment Quality:** Aligned models achieve up to
|
| 443 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 444 |
|
| 445 |
---
|
|
@@ -467,18 +467,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
|
|
| 467 |
|
| 468 |
| Stem | Cohesion | Substitutability | Examples |
|
| 469 |
|------|----------|------------------|----------|
|
| 470 |
-
| `እንደሚ` | 2.
|
| 471 |
-
| `ርስቲያ` | 2.
|
| 472 |
-
| `ትዮጵያ` | 2.
|
| 473 |
-
| `መንግስ` | 2.
|
| 474 |
-
| `ግዚአብ` | 2.
|
| 475 |
-
| `ኢትዮጵ` | 2.
|
| 476 |
-
| `እንግሊ` | 2.
|
| 477 |
-
|
|
| 478 |
-
|
|
| 479 |
-
|
|
| 480 |
-
|
|
| 481 |
-
|
|
| 482 |
|
| 483 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 484 |
|
|
@@ -726,4 +726,4 @@ MIT License - Free for academic and commercial use.
|
|
| 726 |
---
|
| 727 |
*Generated by Wikilangs Models Pipeline*
|
| 728 |
|
| 729 |
-
*Report Date: 2026-01-03
|
|
|
|
| 99 |
|
| 100 |
Below are sample sentences tokenized with each vocabulary size:
|
| 101 |
|
| 102 |
+
**Sample 1:** `ናውሩ በሰላማዊ ውቅያኖስ የሚገኝ ደሴት አገር ነው። ዋና ከተማ የለውም፣ ትልቁ ከተማ ግን ያሬን ነው።`
|
| 103 |
|
| 104 |
| Vocab | Tokens | Count |
|
| 105 |
|-------|--------|-------|
|
| 106 |
+
| 8k | `▁ና ው ሩ ▁በሰ ላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ... (+10 more)` | 20 |
|
| 107 |
+
| 16k | `▁ና ውሩ ▁በሰላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ▁ዋና ▁ከተማ ... (+8 more)` | 18 |
|
| 108 |
+
| 32k | `▁ናውሩ ▁በሰላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ▁ዋና ▁ከተማ ▁የለውም፣ ... (+6 more)` | 16 |
|
| 109 |
+
| 64k | `▁ናውሩ ▁በሰላማዊ ▁ውቅያኖስ ▁የሚገኝ ▁ደሴት ▁አገር ▁ነው። ▁ዋና ▁ከተማ ▁የለውም፣ ... (+5 more)` | 15 |
|
| 110 |
|
| 111 |
+
**Sample 2:** `አሾካ ከ277 ስከ 240 ዓክልበ. ድረስ የሕንድ አገር ማውርያ መንግሥት ንጉሥ ነበር። በ271 ዓክልበ. ግድም የቡዲስም ተከታይ...`
|
| 112 |
|
| 113 |
| Vocab | Tokens | Count |
|
| 114 |
|-------|--------|-------|
|
| 115 |
+
| 8k | `▁አ ሾ ካ ▁ከ 2 7 7 ▁ስ ከ ▁ ... (+42 more)` | 52 |
|
| 116 |
+
| 16k | `▁አ ሾ ካ ▁ከ 2 7 7 ▁ስ ከ ▁ ... (+39 more)` | 49 |
|
| 117 |
+
| 32k | `▁አሾ ካ ▁ከ 2 7 7 ▁ስ ከ ▁ 2 ... (+38 more)` | 48 |
|
| 118 |
+
| 64k | `▁አሾካ ▁ከ 2 7 7 ▁ስከ ▁ 2 4 0 ... (+34 more)` | 44 |
|
| 119 |
|
| 120 |
+
**Sample 3:** `ኔትፍሊክስ (እንግሊዝኛ: Netflix) በመስመር ላይ ፊልሞችን እና የቴሌቪዥን ፕሮግራሞችን ለመመልከት የሚያስችል የዥረት አገል...`
|
| 121 |
|
| 122 |
| Vocab | Tokens | Count |
|
| 123 |
|-------|--------|-------|
|
| 124 |
+
| 8k | `▁ኔ ት ፍ ሊ ክስ ▁( እንግሊዝኛ : ▁n et ... (+36 more)` | 46 |
|
| 125 |
+
| 16k | `▁ኔ ትፍ ሊ ክስ ▁( እንግሊዝኛ : ▁n et fl ... (+29 more)` | 39 |
|
| 126 |
+
| 32k | `▁ኔ ትፍ ሊክስ ▁( እንግሊዝኛ : ▁net fl ix ) ... (+23 more)` | 33 |
|
| 127 |
+
| 64k | `▁ኔ ትፍ ሊክስ ▁( እንግሊዝኛ : ▁net flix ) ▁በመስመር ... (+16 more)` | 26 |
|
| 128 |
|
| 129 |
|
| 130 |
### Key Findings
|
|
|
|
| 274 |
|
| 275 |
**Context Size 1:**
|
| 276 |
|
| 277 |
+
1. `ነው ያኽዱን ሊም ዓክልበ የነገሠ የሊፒት እሽታርን እርዳታ የማግኘት መብቱ የተጠበቀ ስለሆነ ፈጽሞ ይበላል ፍሬው ሳይበስል`
|
| 278 |
+
2. `እና ኢኮኖሚያዊ እና አመለካከቶችን ለመግለጽ ይወዳል የወዳጅሽ የመሠወሪያው ማዕበልም ያማታዋል ዳግመኛም የከበረውን የመልክተኛዎን የቃል ትርጉም ሊያዳብር`
|
| 279 |
+
3. `ላይ አፈፃፀምን በራስ መተማመን አይችሉም ከሚለው ቃል በሲቪል ደግሞ ለየተለያዩ በአፍሪካ ውስጥ የተረጋገጠ ይመስላል ከዚያም የሶቪየት`
|
| 280 |
|
| 281 |
**Context Size 2:**
|
| 282 |
|
| 283 |
+
1. `ዓ ም በኋላ ለሆኑት ዓመታት ግን በሌላ ቀን ላይ መሆኑን ይገንዘቡ ለእነዚያ ዓመቶች ይህ የቀን መለወጫ መሣርያ`
|
| 284 |
+
2. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ ምናልባትም ከቤ`
|
| 285 |
+
3. `የአማርኛ ምሳሌ ነው ትርጉሙ ሚስጥር አይደበቅ ይመስላል ትርጉሙ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ መደብ ተረትና ምሳሌ`
|
| 286 |
|
| 287 |
**Context Size 3:**
|
| 288 |
|
| 289 |
+
1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ምግባር ሳይኖር ስም እንደማለት ነዉ`
|
| 290 |
+
2. `እ ኤ አ የእንግሊዝ ካላንደር ማሻሻያ ተከትሎ የንግሥቲቱን ሞት መመዝገብ የተለመደ ቢሆንም እንግሊዝ መጋቢት 25 ቀን ማለት ነው`
|
| 291 |
+
3. `ምሳሌ ነው ትርጉሙ የተያያዙ ነገሮችን ለመለየት የሚያገለግል ፈሊጥ መደብ ተረትና ምሳሌ ምሳሌ`
|
| 292 |
|
| 293 |
**Context Size 4:**
|
| 294 |
|
| 295 |
+
1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ተረትና ምሳሌ በሬ ካራጁ ይዉላል`
|
| 296 |
+
2. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ያልተተረጎመ ምሳሌ`
|
| 297 |
+
3. `ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ሴት ሁሉን ቻይ ናት`
|
| 298 |
|
| 299 |
|
| 300 |
### Generated Text Samples (Subword-based)
|
|
|
|
| 303 |
|
| 304 |
**Context Size 1:**
|
| 305 |
|
| 306 |
+
1. `_በይምነዉ፡ቢቢትር_የተፅሀ`
|
| 307 |
+
2. `ን_እንዋጮችት_crcue_አ`
|
| 308 |
+
3. `ት_ው_ፈርዕስክሎ_አስ_po`
|
| 309 |
|
| 310 |
**Context Size 2:**
|
| 311 |
|
| 312 |
+
1. `_የኢትዮጵያ_ዘን_ሳይንስ_ተ`
|
| 313 |
+
2. `ት_ነው።_እንግሥታት_ናይትድ`
|
| 314 |
+
3. `_በወራ_ህብረ_የሳምን_ወይ_`
|
| 315 |
|
| 316 |
**Context Size 3:**
|
| 317 |
|
| 318 |
+
1. `_እንዲህ፡መልክ_ፈላሁ_ዐንሁ_`
|
| 319 |
+
2. `_ነው።_እንዲህም፡ዅሉ፡_ደግሞ`
|
| 320 |
+
3. `_እና_ከተያዙ_እንዲሸከሙአቸው`
|
| 321 |
|
| 322 |
**Context Size 4:**
|
| 323 |
|
| 324 |
+
1. `_እና_ቁሳዊ_ነገሥታት_መሽኛ_ት`
|
| 325 |
+
2. `_ነው።_ከግብጽ_ዘውድ_ጭነው_ነ`
|
| 326 |
+
3. `ነው።_ሁሉም_የተነሳ_በኋላም_ያ`
|
| 327 |
|
| 328 |
|
| 329 |
### Key Findings
|
|
|
|
| 428 |
|
| 429 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 430 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 431 |
+
| **mono_32d** | 32 | 0.9098 | 0.3240 | N/A | N/A |
|
| 432 |
+
| **mono_64d** | 64 | 0.9137 🏆 | 0.2319 | N/A | N/A |
|
| 433 |
+
| **mono_128d** | 128 | 0.8452 | 0.1755 | N/A | N/A |
|
| 434 |
+
| **aligned_32d** | 32 | 0.9098 | 0.3259 | 0.0200 | 0.1420 |
|
| 435 |
+
| **aligned_64d** | 64 | 0.9137 | 0.2299 | 0.0480 | 0.1860 |
|
| 436 |
+
| **aligned_128d** | 128 | 0.8452 | 0.1764 | 0.0840 | 0.2800 |
|
| 437 |
|
| 438 |
### Key Findings
|
| 439 |
|
| 440 |
+
- **Best Isotropy:** mono_64d with 0.9137 (more uniform distribution)
|
| 441 |
+
- **Semantic Density:** Average pairwise similarity of 0.2439. Lower values indicate better semantic separation.
|
| 442 |
+
- **Alignment Quality:** Aligned models achieve up to 8.4% R@1 in cross-lingual retrieval.
|
| 443 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 444 |
|
| 445 |
---
|
|
|
|
| 467 |
|
| 468 |
| Stem | Cohesion | Substitutability | Examples |
|
| 469 |
|------|----------|------------------|----------|
|
| 470 |
+
| `እንደሚ` | 2.30x | 158 contexts | እንደሚሹ, እንደሚሻ, እንደሚል |
|
| 471 |
+
| `ርስቲያ` | 2.39x | 61 contexts | ክርስቲያ, ከርስቲያን, ክርስቲያን |
|
| 472 |
+
| `ትዮጵያ` | 2.17x | 57 contexts | ኢትዮጵያ, እትዮጵያ, ኢትዮጵያን |
|
| 473 |
+
| `መንግስ` | 2.10x | 49 contexts | መንግስቱ, መንግስተ, መንግስት |
|
| 474 |
+
| `ግዚአብ` | 2.58x | 23 contexts | እግዚአብሐር, እግዚአብሔር, እግዚአብሄር |
|
| 475 |
+
| `ኢትዮጵ` | 2.08x | 46 contexts | ኢትዮጵያ, ኢትዮጵያን, ኢትዮጵያና |
|
| 476 |
+
| `እንግሊ` | 2.00x | 52 contexts | እንግሊዝ, እንግሊዙ, እንግሊኛ |
|
| 477 |
+
| `ፈረንሳ` | 2.23x | 34 contexts | ፈረንሳዊ, ፈረንሳይ, ከፈረንሳዩ |
|
| 478 |
+
| `መንግሥ` | 2.04x | 46 contexts | መንግሥቱ, መንግሥት, መንግሥተ |
|
| 479 |
+
| `tion` | 2.71x | 17 contexts | action, nation, section |
|
| 480 |
+
| `አስተዳ` | 2.21x | 33 contexts | አስተዳደጉ, አስተዳደሪ, አስተዳደጓ |
|
| 481 |
+
| `ግሊዝኛ` | 2.54x | 19 contexts | እንግሊዝኛ, በእንግሊዝኛ, ኢንግሊዝኛው |
|
| 482 |
|
| 483 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 484 |
|
|
|
|
| 726 |
---
|
| 727 |
*Generated by Wikilangs Models Pipeline*
|
| 728 |
|
| 729 |
+
*Report Date: 2026-01-03 16:28:42*
|
models/embeddings/aligned/am_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1064306440
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:05365ec12b2c2fb2d8df995aee24119fee6ddb57a678bd862fd40de10e193cb1
|
| 3 |
size 1064306440
|
models/embeddings/aligned/am_128d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 65664
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:deb24c9e18725a3ee6f696f5f490d96e499e66d1aef469fc3b67ca71da6c2d47
|
| 3 |
size 65664
|
models/embeddings/aligned/am_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 266727688
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8345b199adcc25abea3de9a656a6b112dd0b6a39d6361db11f5e425bf8e004bd
|
| 3 |
size 266727688
|
models/embeddings/aligned/am_32d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4224
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:634cf159edda5ce6a8152867dd7521b71f1c434d47e85d17f55d1e0d6feb3316
|
| 3 |
size 4224
|
models/embeddings/aligned/am_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 532587272
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:93e5df8ef91db2272bca59309b0c79fd6feaa7e932515206d8059b8330fa5e72
|
| 3 |
size 532587272
|
models/embeddings/aligned/am_64d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 16512
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:627287628b6e540e4db14e30e43e8a0a62ea37c30be3ad5ce1ab5d656c55c446
|
| 3 |
size 16512
|
models/embeddings/monolingual/am_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1064306440
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:05365ec12b2c2fb2d8df995aee24119fee6ddb57a678bd862fd40de10e193cb1
|
| 3 |
size 1064306440
|
models/embeddings/monolingual/am_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 266727688
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8345b199adcc25abea3de9a656a6b112dd0b6a39d6361db11f5e425bf8e004bd
|
| 3 |
size 266727688
|
models/embeddings/monolingual/am_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 532587272
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:93e5df8ef91db2272bca59309b0c79fd6feaa7e932515206d8059b8330fa5e72
|
| 3 |
size 532587272
|
models/subword_markov/am_markov_ctx1_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0819e23adea11e4b2d437cb14f2eb3a8095f9305eca7295c22f45869cc2398c
|
| 3 |
+
size 354543
|
models/subword_markov/am_markov_ctx2_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b6243c7c58a847e7d8a7e59e0d6cf893519ad61e931d6182b4fd19948111d9cf
|
| 3 |
+
size 2169861
|
models/subword_markov/am_markov_ctx3_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ab2bfec0c73c9d92c58f9ffed32659a2ea045ce31cbf4ae474be8c7609c2d247
|
| 3 |
+
size 8421663
|
models/subword_markov/am_markov_ctx4_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f389ec100097a598046200d415e719f5c015e20313c572c31733c198bd66825b
|
| 3 |
+
size 23300653
|
models/subword_ngram/am_2gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ad5c361634d826f85a00a861a3aac431b960daae8e30d389a26b10f223079031
|
| 3 |
+
size 300438
|
models/subword_ngram/am_3gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb1a1174c9a4c80aefbd37753d4553c8a756c0598012bc0df0044d2a23423c4d
|
| 3 |
+
size 1911161
|
models/subword_ngram/am_4gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a82c126bddc5c2665815030f550a0890e69f0a9cfbd4eaac08c79d884183a4d1
|
| 3 |
+
size 7098220
|
models/subword_ngram/am_5gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c35770939585fe557d5ccf7666948844ecbbbeab5b5bd7ec5cde2dc5905be254
|
| 3 |
+
size 12087680
|
models/tokenizer/am_tokenizer_16k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 559625
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a111ad7a5e4c6cde164649095c769c75312c7ad72aae53c5ee2f8ef7838721c4
|
| 3 |
size 559625
|
models/tokenizer/am_tokenizer_32k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 902568
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ab016e2ea38f8d8c7b267cf37f85d2bb7c17bda4ed8c6c8b690fb0324f3335dd
|
| 3 |
size 902568
|
models/tokenizer/am_tokenizer_64k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1589838
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:961377d69eb3cec24c9c718382b0b467b9a1afa24c2dc238af1e0fa8ea2122c9
|
| 3 |
size 1589838
|
models/tokenizer/am_tokenizer_8k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 394741
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dc01c040ea8751cdd811ff0c122044d022da068ea1f778e5d6105fed66d9c6e2
|
| 3 |
size 394741
|
models/word_markov/am_markov_ctx1_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b31d703fb4cebc4b7a0c50356f366a54d6244d97c871ead6e6e93a4fb68fbb1f
|
| 3 |
+
size 13829693
|
models/word_markov/am_markov_ctx2_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cc2ff4ea0c2d68217181f37e11af662feba1f560f09d89557360c878b261cc0e
|
| 3 |
+
size 30071868
|
models/word_markov/am_markov_ctx3_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:de52d6ed2d3cf8ba885adb29815096e6f0eaf92d7f4420e09402db1799affe4e
|
| 3 |
+
size 37843385
|
models/word_markov/am_markov_ctx4_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:26c1f4fea34061dd7bf3ac5175c36ba8e8c0eecf6bf70e7a3b3673f18dfef779
|
| 3 |
+
size 43401561
|
models/word_ngram/am_2gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3253eb36f46641335616a4d569c3e3db153afd4a90452fd3c6b3a43277097773
|
| 3 |
+
size 548774
|
models/word_ngram/am_3gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e2da18fb13944faacfb0c55bc5dbe5f5aff6172a504ee740427290dda7f2ee6
|
| 3 |
+
size 761215
|
models/word_ngram/am_4gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:21a05c63fd25438fb235f39c907ad834a6f3b4a58c825f0913049f3461b196f7
|
| 3 |
+
size 2121749
|
models/word_ngram/am_5gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:42543fabd233c4bac0f24bbbd665c94db17fdf8859ee492de75dc0a80d2dfc9e
|
| 3 |
+
size 2016123
|
visualizations/embedding_alignment_quality.png
CHANGED
|
|
visualizations/embedding_isotropy.png
CHANGED
|
|
visualizations/embedding_norms.png
CHANGED
|
|
visualizations/embedding_similarity.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/embedding_tsne_multilingual.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/ngram_entropy.png
CHANGED
|
|
visualizations/performance_dashboard.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/position_encoding_comparison.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_sentences.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_words.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|