aloobun commited on
Commit
497aa0a
·
verified ·
1 Parent(s): 5ce45a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ ---
4
+
5
+ ## STEP 1:
6
+
7
+ I sampled data from the multilingual(7 Indic languages) [aloobun/dhpileIN](https://huggingface.co/datasets/aloobun/dhpileIN) dataset and [trained](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/train.py) a SentencePiece tokenizer.
8
+
9
+ ## STEP 2:
10
+ I evaluated the tokenizer's performance on:
11
+ - Unicode coverage.
12
+ - Token distribution.
13
+ - Tokenization complexity across different scripts.
14
+ - Encoding and decoding capabilities &
15
+ - Edge cases e.g., special characters, numbers, etc.
16
+
17
+ ## STEP 2.1:
18
+ The first [test](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_suite_step_2_1.py) gives detailed results of the tokenizer's performance on unicode coverage, token distribution visualiztion and complexity across scripts.
19
+
20
+ ## Step 2.2:
21
+ The second [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_step_2_2.py) tests the encoding and decoding capabilities.
22
+ Example output:
23
+ ```
24
+ Bengali Analysis:
25
+ Original Text Length: 48 characters
26
+ Token IDs Count: 11
27
+ Token Strings: ['▁আমি', '▁বাংলাদেশ', '▁থেকে', '▁এসে', 'ছি', '।', '▁কলকাতা', '▁একটি', '▁সুন্দর', '▁শহর', '।']
28
+ Text Reconstruction: True
29
+
30
+ Hindi Analysis:
31
+ Original Text Length: 49 characters
32
+ Token IDs Count: 15
33
+ Token Strings: ['▁नम', 'स्ते', ',', '▁मैं', '▁भारत', '▁से', '▁हू', 'ँ', '।', '▁दिल्ली', '▁बहुत', '▁बड़ा', '▁शहर', '▁है', '।']
34
+ Text Reconstruction: True
35
+
36
+ Kannada Analysis:
37
+ Original Text Length: 53 characters
38
+ Token IDs Count: 13
39
+ Token Strings: ['▁ನಾನು', '▁ಬೆಂಗಳೂರಿ', 'ನಿಂದ', '▁ಬಂದ', 'ಿದ್ದೇನೆ', '।', '▁ಕನ್ನಡ', '▁ಒಂದು', '▁ಸೋ', 'ಂಪ', 'ಿನ', '▁ಭಾಷೆ', '।']
40
+ Text Reconstruction: True
41
+
42
+ Malayalam Analysis:
43
+ Original Text Length: 47 characters
44
+ Token IDs Count: 15
45
+ Token Strings: ['▁ഞ', 'ാ', 'ൻ', '▁കേരള', 'ത്തി', 'ൽ', '▁നിന്നാണ്', '.', '▁കൊച്ചി', '▁ഒരു', '▁സുന്ദ', 'ര', '▁നഗ', 'രം', '.']
46
+ Text Reconstruction: True
47
+
48
+ Telugu Analysis:
49
+ Original Text Length: 53 characters
50
+ Token IDs Count: 10
51
+ Token Strings: ['▁నేను', '▁తెలంగాణ', '▁నుంచి', '▁వచ్చ', 'ాను', '.', '▁హైదరాబాద్', '▁అద్భుతమైన', '▁నగరం', '.']
52
+ Text Reconstruction: True
53
+
54
+ Tamil Analysis:
55
+ Original Text Length: 54 characters
56
+ Token IDs Count: 13
57
+ Token Strings: ['▁நான்', '▁தமிழ்நா', 'ட்டை', 'ச்', '▁சேர்ந்த', 'வன்', '.', '▁சென்னை', '▁ஒரு', '▁பெரிய', '▁நக', 'ரம்', '.']
58
+ Text Reconstruction: True
59
+
60
+ Gujarati Analysis:
61
+ Original Text Length: 50 characters
62
+ Token IDs Count: 12
63
+ Token Strings: ['▁હું', '▁ગુજરાત', '▁થી', '▁આવ્યો', '▁છું', '।', '▁અમદાવાદ', '▁એક', '▁સુંદર', '▁શહેર', '▁છે', '।']
64
+ Text Reconstruction: True
65
+ ```
66
+
67
+ ## STEP 3:
68
+ This [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/merge_step_3.py) is used to merge and extend the tokenizer for the Llama3 tokenizer.
69
+
70
+ Script ensures:
71
+ - No duplicate tokens are added.
72
+ - Tokens arent excessively long.
73
+ - New tokens are correctly integrated.
74
+ - Token mappings, etc
75
+
76
+
77
+ I'm still working on how to improve things and will update as soon as I have any progress.