language:
- en
pipeline_tag: summarization
News articles teacher-student abstractive summarizer model fine-tuned from BART-large and which used StableBeluga-7B
as teacher.
DataSet consists of 295,174 news articles scrapped from a Mexican Newspaper, along with its summary. For simplicity, the Spanish news articles were translated to English using Helsinki-NLP/opus-mt-es-en
NLP model.
Summaries teacher observations were created using StableBeluga-7B
. The teacher observations are then used for fine tuning a BART lightweight model.
The objective for this is to have a lightweight model that can perform summarization as good as StableBeluga-7B
, much faster and with much less computing resources.
We achieved very similar summary results (.66 ROUGE1 and .90 cosine similarity) on a validation DataSet with the lightweight BART model, 3x faster predictions and considerably less GPU memory usage.
How to use:
# Load the Model
model = AutoModelForSeq2SeqLM.from_pretrained("JordiAb/BART_news_summarizer")
tokenizer = AutoTokenizer.from_pretrained("JordiAb/BART_news_summarizer")
# News article text
article_text = """
Los Angeles Lakers will have more time than anticipated. The four-time NBA Most Valuable Player (MVP) extended his contract for two years and $85 million, keeping him in California until 2023. In 2018, The King had already signed for 153 mdd and, in his second campaign in the quintet, led the championship in the Orlando bubble. With 35 years of life – he turns 36 on December 30 – and 17 campaigns of experience, LeBron is still considered one of the best (or the best) NBA players. You can read: "Mercedes found Lewis Hamilton\'s substitute" James just took the Lakers to his first NBA title since 2010 and was named MVP of the Finals; he led the League in assists per game (10.2) for the first time in his career, while adding 25.3 points and 7.8 rebounds per performance, during the last campaign. James has adapted to life in Hollywood, as he will be part of the sequel to Space Jam, to be released next year.
"""
# tokenize text
inputs = tokenizer(article_text, return_tensors='pt')
# generate summary
with torch.no_grad():
summary_ids = model.generate(
inputs['input_ids'],
num_beams=4,
max_length=250,
early_stopping=True
)
# decode summary
summary = tokenizer.decode(
summary_ids[0],
skip_special_tokens=True
)