SciBERT Fine-tuned for Scientific Multi-Document Summarization Embeddings

Model description

This model is a fine-tuned version of allenai/scibert_scivocab_uncased for creating embeddings used in scientific multi-document summarization tasks. It has been optimized to generate meaningful representations of scientific text that can be used in downstream summarization processes.

Intended uses & limitations

This model is intended for creating embeddings of scientific documents, specifically for use in multi-document summarization tasks. It should not be used for generating summaries directly, but rather for creating vector representations of scientific text that can be used as input for summarization models or algorithms.

The model may not perform optimally on non-scientific text or for tasks significantly different from multi-document summarization.

Training data

This model was trained on the SciTLDR dataset. SciTLDR (Scientific Too Long; Didn't Read) is a dataset of scientific papers and their corresponding TL;DR summaries. It contains around 5,400 papers from the computer science domain, primarily from arXiv. Each paper in the dataset includes:

The paper's title
The abstract
The full text of the paper
Two types of summaries:
1. Author-written TL;DR
2. Expert-written TL;DR

The dataset is designed to support the task of extreme summarization in the scientific domain, where the goal is to create very short, high-level summaries of scientific papers.

For more information about the SciTLDR dataset, you can refer to the official paper and the dataset repository.

Training procedure

The model was trained for 15 epochs with early stopping based on validation loss. The best model was saved at epoch 15.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-5 to 1e-7 (cosine annealing)
train_batch_size: 16
eval_batch_size: 16
optimizer: AdamW

Framework versions

Transformers 4.41.2
PyTorch 2.3.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1
CUDA 12.1

Evaluation results

The model achieved the following results:

Training Loss: 0.2272
Validation Loss: 0.8738

Model Limitations and Bias

This model is trained on scientific literature from the SciTLDR dataset, which primarily contains computer science papers from arXiv. As such, it may not generalize well to other scientific domains or non-scientific text. Users should be aware of potential biases in the training data, which may be reflected in the generated embeddings. The model's performance might be optimal for computer science-related texts but could be less effective for other scientific fields.

Author

callaghanmt