outputs with lots of <unk> tokens

#2
by mnwato - opened

I used pipeline based inference and it was ok. But when generate manually, output have lots of tokens:

input: توانا بود هر که
generated: توانا بود هر که را که توانا بوده در این روز یاری داد و در راه خدا جهاد کرد و شهید کرد
    model_inputs = tokenizer(sent, padding=True, return_tensors='pt').to('cuda')
    generated_ids = model.generate(**model_inputs, max_length=100)
    generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
generated: توانا بود هر که بهکستاربُن امنت<unk>نپ ازرفاهأبت<unk>ه<unk>نناری<unk>ک<unk>ت<unk>ز<unk>ت<unk>نُ<unk>خجَثّهتنوشکب<unk>ت<unk>بابم<unk>هان<unk>ه کُک<unk>نی<unk>کن<unk><unk>ک<unk> <unk>ب<unk>لّت ن<unk>ی<unk> ب<unk><unk>

But generating with this script will generate output without token but different from pipeline based inference.

    generated_ids = model.generate(input_ids=model_inputs['input_ids'], max_length=100)
generated: توانا بود هر که به گزارش خبرنگار گروه استان‌های باشگاه خبرنگاران جوان از ساری ، روابط عمومی اداره کل حفاظت محیط زیست مازندران اعلام کرد : ماموران یگان حفاظت محیط زیست آمل در حین گشت و کنترل در منطقه حفاظت شده میانکاله به یک شکارچی غیرمجاز که در حال شکار یک راس کل وحشی بود ، مشکوک شدند.

Any help is appreciated

Generally, don’t expect too much from this model, the model is small and stupid. You can get the best response considering following: 1) lower the temperature 2) give a longer input that is a correct in terms of grammar and semantic 3) don’t use any non persian characters in your input, as mentioned in the doc this model only training on text with persian characters. 4) note that this is not a poetry model that is used in bolbolzaban.com so don’t expect it to continue persian poetry. 5) when you get simply retry programmatically.

Generally, don’t expect too much from this model, the model is small and stupid. You can get the best response considering following: 1) lower the temperature 2) give a longer input that is a correct in terms of grammar and semantic 3) don’t use any non persian characters in your input, as mentioned in the doc this model only training on text with persian characters. 4) note that this is not a poetry model that is used in bolbolzaban.com so don’t expect it to continue persian poetry. 5) when you get simply retry programmatically.

Thanks for your response. You are right, but I think the problem I faced is a mistake in the inference method. Although the model is small, it generates the correct text (the one which is generated with the pipeline). It seems to relate to tokenization.

Actually, I plan to train (or maybe fine-tune Open LLM models like Llama2) a task-specific GPT model in Persian. What's your idea about fine-tuning or retraining 'bolbolzaban/gpt2-persian' model to achieve better accuracy in a specific task?

You are right the tokenizer could be the issue, I assume you noticed in the example that the tokenizer can be created with AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian').

Regarding fine tuning of this model it depends on your task. In bolbolzaban model as mentioned in the post and the medium articles English characters are replaced with [LAT]. So if your usecase have mixed language text this is not a good one to fine tune.

You are right the tokenizer could be the issue, I assume you noticed in the example that the tokenizer can be created with AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian').

Regarding fine tuning of this model it depends on your task. In bolbolzaban model as mentioned in the post and the medium articles English characters are replaced with [LAT]. So if your usecase have mixed language text this is not a good one to fine tune.

It is so weird because I have used AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian').

Sign up or log in to comment