strange tokens of Greek sentence tokenization

#2
by Geo - opened

Hi,
When I use the model's tokenizer in order to tokenize a Greek sentence

tokenizer = load_tokenizer(model_path)
tokenizer.tokenize(“Ποιο τρίγωνο λέγεται αμβλυγώνιο?”)

I get

[‘Î’,
‘ł’,
‘ο’,
‘ιο’,
‘ĠÏĦÏģίγÏīνο’,
‘ĠλÎŃγεÏĦαι’,
‘Ġαμβ’,
‘λÏħ’,
‘γÏİν’,
‘ιο’,
‘?’]

Is this normal? Should't I see tokens or sub-word tokens in Greek?
Also when I open the vocabulary I don't see any Greek words.

I want to fine tune your model for text generation in the Greek language

Dear Niko, I would like to ask you some questions about your fine tuned model gpt2- greek. Do you have any contact info?

Sign up or log in to comment