strange tokens output

#2
by Geo - opened

Hi,
When I use the model's tokenizer in order to tokenize a Greek sentence

tokenizer = load_tokenizer(model_path)
tokenizer.tokenize(“Ποιο τρίγωνο λέγεται αμβλυγώνιο?”)

I get

[‘Î’,
‘ł’,
‘ο’,
‘ιο’,
‘ĠÏĦÏģίγÏīνο’,
‘ĠλÎŃγεÏĦαι’,
‘Ġαμβ’,
‘λÏħ’,
‘γÏİν’,
‘ιο’,
‘?’]

Is this normal? Should't I see tokens or sub-word tokens in Greek?
Also when I open the vocabulary I don't see any Greek words.

I want to fine tune your model for text generation in the Greek language

Sign up or log in to comment