--- license: cc-by-4.0 language: - pl library_name: transformers tags: - tokenizer - fast-tokenizer - polish datasets: - radlab/legal-mc4-pl - radlab/wikipedia-pl - radlab/kgr10 - clarin-knext/msmarco-pl - clarin-knext/fiqa-pl - clarin-knext/scifact-pl - clarin-knext/nfcorpus-pl --- This is polish fast tokenizer. Number of documents used to train tokenizer: - 25 088 398 Sample usge with transformers: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('radlab/polish-fast-tokenizer') tokenizer.decode(tokenizer("Ala ma kota i psa").input_ids) ```