This is a Unigram tokenizer trained on the Wikitext dataset. Refer to the train_unigram.py script within this repository to know how it was trained.

