Bloom Tokenization

#269
by Niazi - opened

I am working with one of the low-resource languages which has a small portion of data in the roots corpus on which Bloom is trained.

when I checked the token of the language, it is failed, I checked the all vocab of the language but the bloom tokenizer was not able to tokeneize that,

is it possible to inject the vocabulary by using the sentence piece tokenization method into Bloom's vocabulary and then tune it via a prompt?

any suggestions?

Sign up or log in to comment