Strange character appear when tokenize text

#9
by guidobenb - opened

Hi!,
I want to understand why I obtain as result this output when i tokenize a text with this model...

the "Ġ" in each token

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['', 'ĠGermany', "Ġ'", 's', 'Ġrepresentative', 'Ġto', 'Ġthe', 'ĠEuropean', 'ĠUnion', "Ġ'", 's', 'Ġveterinary', 'Ġcommittee', 'ĠWerner', 'ĠZ', 'wing', 'mann', 'Ġsaid', 'Ġon', 'ĠWednesday', 'Ġconsumers', 'Ġshould', 'Ġbuy', 'Ġsheep', 'meat', 'Ġfrom', 'Ġcountries', 'Ġother', 'Ġthan', 'ĠBritain', 'Ġuntil', 'Ġthe', 'Ġscientific', 'Ġadvice', 'Ġwas', 'Ġclearer', 'Ġ.', '']

Sign up or log in to comment