Issue with tokenizer

#1
by marinone94 - opened

Hi,

I was trying out this model and it seems there is an issue with the tokenizer replacing Swedish characters with English ones (ä --> a, å --> a, ö --> o). It looks weird to me since the vocab file contains words including those swedish characters.

Examples

Input:  försändelse från utlandet
Decoded:  [CLS] forsandelse fran utlandet [SEP]
Input:  Örebro är en fin stad
Decoded:  [CLS] orebro ar en fin stad [SEP]

To reproduce:

from transformers import AutoTokenizer

examples = ["försändelse från utlandet", "Örebro är en fin stad"]
tokenizer = AutoTokenizer.from_pretrained("af-ai-center/bert-base-swedish-uncased")
enc = tokenizer(examples)
dec = tokenizer.batch_decode(enc["input_ids"])
for input_example, decoded_example in zip(examples, dec):
    print("Input: ", input_example)
    print("Decoded: ", decoded_example)

Env:

- `transformers` version: 4.18.0
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.6
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.9.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

Sign up or log in to comment