Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
-
---
|
2 |
-
license: llama3
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3
|
3 |
+
---
|
4 |
+
|
5 |
+
## STEP 1:
|
6 |
+
|
7 |
+
I sampled data from the multilingual(7 Indic languages) [aloobun/dhpileIN](https://huggingface.co/datasets/aloobun/dhpileIN) dataset and [trained](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/train.py) a SentencePiece tokenizer.
|
8 |
+
|
9 |
+
## STEP 2:
|
10 |
+
I evaluated the tokenizer's performance on:
|
11 |
+
- Unicode coverage.
|
12 |
+
- Token distribution.
|
13 |
+
- Tokenization complexity across different scripts.
|
14 |
+
- Encoding and decoding capabilities &
|
15 |
+
- Edge cases e.g., special characters, numbers, etc.
|
16 |
+
|
17 |
+
## STEP 2.1:
|
18 |
+
The first [test](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_suite_step_2_1.py) gives detailed results of the tokenizer's performance on unicode coverage, token distribution visualiztion and complexity across scripts.
|
19 |
+
|
20 |
+
## Step 2.2:
|
21 |
+
The second [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_step_2_2.py) tests the encoding and decoding capabilities.
|
22 |
+
Example output:
|
23 |
+
```
|
24 |
+
Bengali Analysis:
|
25 |
+
Original Text Length: 48 characters
|
26 |
+
Token IDs Count: 11
|
27 |
+
Token Strings: ['▁আমি', '▁বাংলাদেশ', '▁থেকে', '▁এসে', 'ছি', '।', '▁কলকাতা', '▁একটি', '▁সুন্দর', '▁শহর', '।']
|
28 |
+
Text Reconstruction: True
|
29 |
+
|
30 |
+
Hindi Analysis:
|
31 |
+
Original Text Length: 49 characters
|
32 |
+
Token IDs Count: 15
|
33 |
+
Token Strings: ['▁नम', 'स्ते', ',', '▁मैं', '▁भारत', '▁से', '▁हू', 'ँ', '।', '▁दिल्ली', '▁बहुत', '▁बड़ा', '▁शहर', '▁है', '।']
|
34 |
+
Text Reconstruction: True
|
35 |
+
|
36 |
+
Kannada Analysis:
|
37 |
+
Original Text Length: 53 characters
|
38 |
+
Token IDs Count: 13
|
39 |
+
Token Strings: ['▁ನಾನು', '▁ಬೆಂಗಳೂರಿ', 'ನಿಂದ', '▁ಬಂದ', 'ಿದ್ದೇನೆ', '।', '▁ಕನ್ನಡ', '▁ಒಂದು', '▁ಸೋ', 'ಂಪ', 'ಿನ', '▁ಭಾಷೆ', '।']
|
40 |
+
Text Reconstruction: True
|
41 |
+
|
42 |
+
Malayalam Analysis:
|
43 |
+
Original Text Length: 47 characters
|
44 |
+
Token IDs Count: 15
|
45 |
+
Token Strings: ['▁ഞ', 'ാ', 'ൻ', '▁കേരള', 'ത്തി', 'ൽ', '▁നിന്നാണ്', '.', '▁കൊച്ചി', '▁ഒരു', '▁സുന്ദ', 'ര', '▁നഗ', 'രം', '.']
|
46 |
+
Text Reconstruction: True
|
47 |
+
|
48 |
+
Telugu Analysis:
|
49 |
+
Original Text Length: 53 characters
|
50 |
+
Token IDs Count: 10
|
51 |
+
Token Strings: ['▁నేను', '▁తెలంగాణ', '▁నుంచి', '▁వచ్చ', 'ాను', '.', '▁హైదరాబాద్', '▁అద్భుతమైన', '▁నగరం', '.']
|
52 |
+
Text Reconstruction: True
|
53 |
+
|
54 |
+
Tamil Analysis:
|
55 |
+
Original Text Length: 54 characters
|
56 |
+
Token IDs Count: 13
|
57 |
+
Token Strings: ['▁நான்', '▁தமிழ்நா', 'ட்டை', 'ச்', '▁சேர்ந்த', 'வன்', '.', '▁சென்னை', '▁ஒரு', '▁பெரிய', '▁நக', 'ரம்', '.']
|
58 |
+
Text Reconstruction: True
|
59 |
+
|
60 |
+
Gujarati Analysis:
|
61 |
+
Original Text Length: 50 characters
|
62 |
+
Token IDs Count: 12
|
63 |
+
Token Strings: ['▁હું', '▁ગુજરાત', '▁થી', '▁આવ્યો', '▁છું', '।', '▁અમદાવાદ', '▁એક', '▁સુંદર', '▁શહેર', '▁છે', '।']
|
64 |
+
Text Reconstruction: True
|
65 |
+
```
|
66 |
+
|
67 |
+
## STEP 3:
|
68 |
+
This [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/merge_step_3.py) is used to merge and extend the tokenizer for the Llama3 tokenizer.
|
69 |
+
|
70 |
+
Script ensures:
|
71 |
+
- No duplicate tokens are added.
|
72 |
+
- Tokens arent excessively long.
|
73 |
+
- New tokens are correctly integrated.
|
74 |
+
- Token mappings, etc
|
75 |
+
|
76 |
+
|
77 |
+
I'm still working on how to improve things and will update as soon as I have any progress.
|