Understanding vocab.txt

#1
by krishnagarg09 - opened

While looking at vocab.txt, I was left wondering why the vocabulary is not continuous.
For instance, see the below sample:

...
dice 63328
)@@ 63327
struggled 63326
wraps 63324
Investors 63312
#summer@@ 63305
...

As you can see, after 63305, we have 63312, followed by 63324... what about the numbers in between?

  1. Also, it feels a bit strange why vocabulary starts at around 3800.
르@@ 3800
utory 3798
...

Any explanations will be really appreciated.

VinAI Research org
  1. Each number denotes the frequency count that the corresponding word appears in the pre-training corpus.
  2. Only top 64k words are included in the vocab.
dqnguyen changed discussion status to closed

Sign up or log in to comment