Potentially incorrect tokenizer

#23

by drodel - opened 4 days ago

Discussion

drodel

4 days ago

Hello,

I was trying to play with your model for a summarization task and the output is not understandable.

When inputting text I am getting out put like this:

[{'summary_text': ' academic us ascend carpet than ask commence '
'concentrate content and raise de total at '}]

This is my code using the text from the huggingface transformers tutorial for your reference.

Thank you,
Dale


from pprint import pprint
from transformers import pipeline


input_text = """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""

summarizer = pipeline("summarization",
                      model="pszemraj/long-t5-tglobal-base-16384-book-summary")
params = {
    "max_length": 100,
    "min_length": 8,
    "no_repeat_ngram_size": 3,
    "early_stopping": True,
    "repetition_penalty": 3.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
} # parameters for text generation out of model

pprint( summarizer(input_text, **params) )

pszemraj

Owner 2 days ago

•

edited 2 days ago

Hi, what package versions are you using? Try updating with pip install -U transformers

pszemraj

Owner 2 days ago

BTW, I don't want to discourage you from using this model, but unless there is a specific reason you want to use models trained on the Booksum dataset, I would recommend these models, which I trained more recently for general use. if you continue to have problems you can try the pegasus-x model instead, etc https://hf.co/BEE-spoke-data/pegasus-x-base-synthsumm_open-16k

drodel

1 day ago

Hello pszemraj,

Thank you for the reply. I am not using your model for a specific reason. I am just playing with a variety of seq2seq models to see how they preform on summarization tasks and yours landed on the list.

I forgot to mention that I am not running on a GPU. I am just doing toy problems on a laptop to get a feel for the models. This may be the root of the problem.

The version of the packages I think might be relevant are:
torch 2.5.1
transformers 4.47.1

transformers was installed using this command.

pip install -U 'transformers[torch]'

After running the upgrade command it changed transformers to version 4.48.0, but generated nonsense again.

If this issue is not important to you, I am happy for you to not address it and close this thread. I will take a look at the models you suggested.

I sincerely appreciate your your reply and thank you for contributing your work.
Dale

drodel

1 day ago

Hello,

After rereading your Model Card I added the cuda test to the pipeline call and things look better.

device=0 if torch.cuda.is_available() else -1

The output now looks like this, not the odd output from before.

[{'summary_text': 'parent origin which  went ran demandstan union back name '
                  'rule fact out over run and gut find " cons carpet When cu '
                  'Stewart more how where when see the up money hair The'}]

I am sorry to have bothered you.
Dale

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment