This tokenizer was trained on a small corpus of concatenated ARPAbet pronunciation tokens + punctuation from the python g2p_en library computed over the entire synthbot/pony-speech
dataset and 240k lines from generics_kb_best
, from community-datasets/generics_kb
.
i.e. But one on one, let's clean it.
-> BAH1T WAH1N AA1N WAH1N , LEH1TS KLIY1N IH1T .
Uses BPE with vocab size of 1024.
It is trained on the same data as https://huggingface.co/therealvul/tokenizer_g2pen with the following differences:
- It does not split on whitespace as a token
- It uses ByteLevel in pretokenization/decoding step
- It uses a token vocab size of 1024 instead of 384
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.