# summary multilingual tokenizer trained on multilingual data by using the SentencePiece library and the BPE algorithm. * vocab size: 100k