Format of the Input article

by SeemaChhatani03 - opened Apr 14, 2023

Apr 14, 2023

Hello I am trying to use this model for research paper summarization.
My dataset has Article as the input, research paper abstract as the ground truth.
Do I need to add the tokens and in the article and in the ground truth which is research paper abstract.?

Or I can feed the data itself and model tokenizer will take care of the required tokens.?

hyesunyun

Owner Apr 14, 2023

Hello!

You will have to format the input a bit with some tokens.

Format your data so that each new article or evidence to add have token in front with each title prefixed by and each abstract prefixed by . Please have the original summary also in the same format. You can have the list of articles and original summary concatenated in any order as long as they have the correct separator tokens.

What it would look like: <EV> <t> title_of_article_1 <abs> abstract_of_article_1 <EV> <t> title_of_article_2 <abs> abstract_of_article_2 ...

SeemaChhatani03

Apr 14, 2023

Hi.. Thanks for reverting back.
I am using Pubmed summarization dataset added on huggingface: https://huggingface.co/datasets/scientific_papers/viewer/pubmed/train
It has three fields article, abstract and section_names. The title of the article is not included in article and its not added as a separate field as well.
So for this dataset I just need to add token in article and token in abstract field correct .?

Also I am not sure how to add these tokens to this entire chunk of dataset. Any suggestions on that.?

Thank you in advance.

hyesunyun

Owner Apr 14, 2023

oh I see. Yes, you can just not add <t> title part. You can just do something like <EV> <abs> abstract text but not really sure what the performance would look like when title is not given.

This model is specifically trained to generate a synthesis summary based on an existing summary and new articles. This is a specific form of multi document summarization.

What exactly are you trying to achieve? Could you provide some example input and expected output for your task?

SeemaChhatani03

Apr 14, 2023

Actually I want to perform text summarization on research papers dataset provided on huggingface https://huggingface.co/datasets/scientific_papers/viewer/pubmed/train .
My aim is to achieve better rouge score ir some comparable rouge score by fine tuning the summarization models on my dataset.

hyesunyun

Owner Apr 14, 2023

I would suggest using a different summarization model for this. The type of summarization this particular model is trained to do is a different kind of summarization from regular multi-doc summarization. Perhaps just LED (https://huggingface.co/docs/transformers/model_doc/led) which this model is based on or maybe even T5 can be good candidate models for the pubmed text summarization fine tuning that you want to do.

SeemaChhatani03

Apr 15, 2023

Thanks alot for your feedback.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment