Fine-tuned SciBERT for Multi-Classification of Publications Metadata
This repository provides a fine-tuned version of the allenai/scibert_scivocab_uncased model, specifically designed for the task of multi-class classification of publications metadata into 10 disciplinary categories.
Model Details
- Base Model:
allenai/scibert_scivocab_uncased
- Task: Multi-class text classification
- Number of Classes: 10
- Labels:
- Biology (found)
- Chemistry
- Computer and Information Sciences
- Engineering
- Mathematics
- Medical Research
- Earth, Ecology, Energy, and Applied Biology
- Humanities
- Physical Sciences and Astronomy
- Social Sciences
The model is trained to classify textual metadata of scientific publications (e.g., title, journal name, publisher name, open access) into these categories.
Training Dataset
- Dataset: BSO Publications Indexation
- Number of Examples: 50,000 rows of labeled publication metadata
- Split:
- 70% for training
- 30% for evaluation
- Text Source: Metadata fields such as title and abstract (
text_plain
) - Labels: Disciplinary categories mapped to integers (
label_int
).
Training Configuration
- Framework: Hugging Face's
transformers
library - Training Arguments:
- Learning Rate:
1e-5
- Batch Size:
8
- Weight Decay:
0.01
- Epochs:
8
- Evaluation Metric: Accuracy
- Best Model Selection: Based on accuracy on the evaluation dataset
- Learning Rate:
- Hardware: Trained on a single GPU (e.g., NVIDIA Tesla V100)
Model Performance
The model achieved the following performance on the evaluation dataset:
- Accuracy:
XX.XX%
(replace with your evaluation results) - F1-Score (weighted):
XX.XX%
(replace with your evaluation results)
Usage
Model Loading
To load and use the model, you can do the following:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the fine-tuned model and tokenizer
model_name = "Geraldine/scibert-publications-classification" # Replace with your model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example input
text = "This paper explores the application of quantum computing in solving complex chemical problems."
# Tokenize the input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
# Get predictions
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1).item()
print(f"Predicted category: {predictions}")
Category Mapping
The predicted output is an integer corresponding to the following categories:
Label ID | Category |
---|---|
0 | Medical Research |
1 | Biology (fond.) |
2 | Earth, Ecology, Energy and applied biology |
3 | Physical sciences and Astronomy |
4 | Social sciences |
5 | Mathematics |
6 | Humanities |
7 | Computer and information sciences |
8 | Chemistry |
9 | Engineering |
Evaluation Metrics
To compute evaluation metrics on your data, you can use the following code snippet:
from sklearn.metrics import classification_report, accuracy_score
# Example: Ground truth labels and predicted labels
y_true = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Replace with actual labels
y_pred = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Replace with predictions
# Accuracy
acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc}")
# Classification report
print(classification_report(y_true, y_pred, target_names=[
"Medical Research","Biology (fond.)","Earth, Ecology, Energy and applied biology",
"Physical sciences and Astronomy","Social sciences","Mathematics","Humanities",
"Computer and information sciences","Chemistry","Engineering"
]))
Model Limitations
- Domain-Specific Bias: The model is fine-tuned on publication metadata and may not generalize well to non-academic text.
- Text Length: Inputs are truncated to 512 tokens. Very long texts may lose information.
- Imbalanced Data: If some categories are underrepresented in the dataset, performance for those classes may be lower.
How to Cite
If you use this model in your work, please cite it as follows:
@article{your_citation_key,
title={Fine-tuned SciBERT for Multi-Class Classification of Publications Metadata},
author={Your Name(s)},
year={2024},
publisher={Hugging Face}
}
Acknowledgments
This fine-tuning work builds upon the allenai/scibert_scivocab_uncased model and utilizes the BSO Publications Indexation dataset.
Model Card Authors [optional]
Géraldine Geoffroy
Model Card Contact
- Downloads last month
- 103