Format of the Input article

#5
by SeemaChhatani03 - opened

Hello I am trying to use this model for research paper summarization.
My dataset has Article as the input, research paper abstract as the ground truth.
Do I need to add the tokens and in the article and in the ground truth which is research paper abstract.?

Or I can feed the data itself and model tokenizer will take care of the required tokens.?

Hello!

You will have to format the input a bit with some tokens.

Format your data so that each new article or evidence to add have token in front with each title prefixed by and each abstract prefixed by . Please have the original summary also in the same format. You can have the list of articles and original summary concatenated in any order as long as they have the correct separator tokens.

What it would look like: <EV> <t> title_of_article_1 <abs> abstract_of_article_1 <EV> <t> title_of_article_2 <abs> abstract_of_article_2 ...

Hi.. Thanks for reverting back.
I am using Pubmed summarization dataset added on huggingface: https://huggingface.co/datasets/scientific_papers/viewer/pubmed/train
It has three fields article, abstract and section_names. The title of the article is not included in article and its not added as a separate field as well.
So for this dataset I just need to add token in article and token in abstract field correct .?

Also I am not sure how to add these tokens to this entire chunk of dataset. Any suggestions on that.?

Thank you in advance.

oh I see. Yes, you can just not add <t> title part. You can just do something like <EV> <abs> abstract text but not really sure what the performance would look like when title is not given.

This model is specifically trained to generate a synthesis summary based on an existing summary and new articles. This is a specific form of multi document summarization.

What exactly are you trying to achieve? Could you provide some example input and expected output for your task?

Actually I want to perform text summarization on research papers dataset provided on huggingface https://huggingface.co/datasets/scientific_papers/viewer/pubmed/train .
My aim is to achieve better rouge score ir some comparable rouge score by fine tuning the summarization models on my dataset.

I would suggest using a different summarization model for this. The type of summarization this particular model is trained to do is a different kind of summarization from regular multi-doc summarization. Perhaps just LED (https://huggingface.co/docs/transformers/model_doc/led) which this model is based on or maybe even T5 can be good candidate models for the pubmed text summarization fine tuning that you want to do.

Thanks alot for your feedback.

Sign up or log in to comment