Update README.md
Browse files
README.md
CHANGED
@@ -55,13 +55,13 @@ Our goal for vocabulary expansion is threefold: (1) the number of newly-added to
|
|
55 |
|
56 |
As seen in the table below, our new vocabulary reduces the compression ratio from 4.29 to 1.57 for Thai - meaning it can now encode 2.7x longer Thai text given the same context length. Meanwhile, English is only compressed by 0.3%, thus preserving its integrity.
|
57 |
|
58 |
-
|Language | Llama's ratio | Our ratio | # New tokens
|
59 |
-
| --- | --- | --- | --- |
|
60 |
-
| Vi | 2.91 | 1.2488 | 2304
|
61 |
-
| Zh | 1.99 | 1.1806 | 3456
|
62 |
-
| Th | 4.29 | 1.5739 | 1536
|
63 |
-
| Id | 1.76 | 1.1408 | 3840
|
64 |
-
| En | 1.00 | 0.9976
|
65 |
|
66 |
|
67 |
### Pre-training Data
|
|
|
55 |
|
56 |
As seen in the table below, our new vocabulary reduces the compression ratio from 4.29 to 1.57 for Thai - meaning it can now encode 2.7x longer Thai text given the same context length. Meanwhile, English is only compressed by 0.3%, thus preserving its integrity.
|
57 |
|
58 |
+
|Language | ChatGPT's ratio | Llama's ratio | Our ratio | # New tokens
|
59 |
+
| --- | --- | --- | --- | --- |
|
60 |
+
| Vi | 4.41 | 2.91 | 1.2488 | 2304
|
61 |
+
| Zh | 2.80 | 1.99 | 1.1806 | 3456
|
62 |
+
| Th | 9.09 | 4.29 | 1.5739 | 1536
|
63 |
+
| Id | 2.00 | 1.76 | 1.1408 | 3840
|
64 |
+
| En | 1.00 | 1.00 | 0.9976
|
65 |
|
66 |
|
67 |
### Pre-training Data
|