metadata
license: cc-by-nc-sa-4.0
datasets:
- NorGLM/NO-CNN-DailyMail
language:
- 'no'
pipeline_tag: summarization
Model Card
NorGPT-3B-summarization-peft is trained on top of NorGPT-3B model using RLHF strategy on NO-CNN-DailyMail dataset.
Different from step 2 in the original RLHF, we trained the reward model by estimating the semantic similarity between the candidate generated text and the human annotated summary (golden summary) using the NorBERT model. Generated summaries with higher cosine similarity to the golden summary will be ranked higher in the training of the reward model.
Prompt format:
Summarise the article:\\n{article} |||\\n{positive_sample}
Inference prompt:
Summarise the article:\\n{article} |||\\n
Training Split
We split data to train on step 1-step 3 for RLHF:
#samples | |
---|---|
step 1 | 61181 |
step 2 | 16798 |
step 3 | 9758 |
Run the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "NorGLM/NorGPT-3B-rfhl-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
torch_dtype=torch.bfloat16
)
Inference on test set
Load the model to evaluate on the test set of NO-CNN-DailyMail dataset:
def generate_texts(model, tokenizer, prompts, max_seq_length=200, do_sample=True, top_p=0.95, top_k=10):
# prompts are a list of news articles
results = []
cnt = 0
for prompt in prompts:
cnt += 1
pro_len = len(prompt.split())
if pro_len>1024:
results.append('')
continue
prompt = 'Summarise the article:\\n' + prompt + ' |||\\n'
model_inputs = tokenizer(prompt, return_tensors='pt').to(torch_device)
output = model.generate(**model_inputs, do_sample=False, max_new_tokens=max_seq_length)
result = tokenizer.decode(output[0], skip_special_tokens=True)
result = result.split("|||\\n")[-1]
results.append(result)
return results
print("--LOADING EVAL DATAS---")
eval_data = load_dataset("NorGLM/NO-CNN-DailyMail", data_files="test.csv")
prompts = eval_data['train']['article']
positive_samples = eval_data['train']['positive_sample']
print("--MAKING PREDICTIONS---")
model.eval()
output_file = <output file name>
with torch.no_grad():
results = generate_texts(model, tokenizer, prompts)
df = pd.DataFrame({'article':prompts, 'generated_text':results, 'positive_sample':positive_samples})
print("Save results to csv file...")
df.to_csv(output_file)
Citation Information
If you feel our work is helpful, please cite our paper:
@article{liu2023nlebench+,
title={NLEBench+ NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian},
author={Liu, Peng and Zhang, Lemei and Farup, Terje Nissen and Lauvrak, Even W and Ingvaldsen, Jon Espen and Eide, Simen and Gulla, Jon Atle and Yang, Zhirong},
journal={arXiv preprint arXiv:2312.01314},
year={2023}
}