Potentially incorrect tokenizer
Hello,
I was trying to play with your model for a summarization task and the output is not understandable.
When inputting text I am getting out put like this:
[{'summary_text': ' academic us ascend carpet than ask commence '
'concentrate content and raise de total at '}]
This is my code using the text from the huggingface transformers tutorial for your reference.
Thank you,
Dale
from pprint import pprint
from transformers import pipeline
input_text = """
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.
Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
"""
summarizer = pipeline("summarization",
model="pszemraj/long-t5-tglobal-base-16384-book-summary")
params = {
"max_length": 100,
"min_length": 8,
"no_repeat_ngram_size": 3,
"early_stopping": True,
"repetition_penalty": 3.5,
"length_penalty": 0.3,
"encoder_no_repeat_ngram_size": 3,
"num_beams": 4,
} # parameters for text generation out of model
pprint( summarizer(input_text, **params) )
Hi, what package versions are you using? Try updating with pip install -U transformers
BTW, I don't want to discourage you from using this model, but unless there is a specific reason you want to use models trained on the Booksum dataset, I would recommend these models, which I trained more recently for general use. if you continue to have problems you can try the pegasus-x model instead, etc https://hf.co/BEE-spoke-data/pegasus-x-base-synthsumm_open-16k
Hello pszemraj,
Thank you for the reply. I am not using your model for a specific reason. I am just playing with a variety of seq2seq models to see how they preform on summarization tasks and yours landed on the list.
I forgot to mention that I am not running on a GPU. I am just doing toy problems on a laptop to get a feel for the models. This may be the root of the problem.
The version of the packages I think might be relevant are:
torch 2.5.1
transformers 4.47.1
transformers was installed using this command.
pip install -U 'transformers[torch]'
After running the upgrade command it changed transformers to version 4.48.0, but generated nonsense again.
If this issue is not important to you, I am happy for you to not address it and close this thread. I will take a look at the models you suggested.
I sincerely appreciate your your reply and thank you for contributing your work.
Dale
Hello,
After rereading your Model Card I added the cuda test to the pipeline call and things look better.
device=0 if torch.cuda.is_available() else -1
The output now looks like this, not the odd output from before.
[{'summary_text': 'parent origin which went ran demandstan union back name '
'rule fact out over run and gut find " cons carpet When cu '
'Stewart more how where when see the up money hair The'}]
I am sorry to have bothered you.
Dale