Fine-tuned SciBERT for Multi-Classification of Publications Metadata

This repository provides a fine-tuned version of the allenai/scibert_scivocab_uncased model, specifically designed for the task of multi-class classification of publications metadata into 10 disciplinary categories.

Model Details

Base Model: allenai/scibert_scivocab_uncased
Task: Multi-class text classification
Number of Classes: 10
Labels:
1. Biology (found)
2. Chemistry
3. Computer and Information Sciences
4. Engineering
5. Mathematics
6. Medical Research
7. Earth, Ecology, Energy, and Applied Biology
8. Humanities
9. Physical Sciences and Astronomy
10. Social Sciences

The model is trained to classify textual metadata of scientific publications (e.g., title, journal name, publisher name, open access) into these categories.

Training Dataset

Dataset: BSO Publications Indexation
Number of Examples: 50,000 rows of labeled publication metadata
Split:
- 70% for training
- 30% for evaluation
Text Source: Metadata fields such as title and abstract (text_plain)
Labels: Disciplinary categories mapped to integers (label_int).

Training Configuration

Framework: Hugging Face's transformers library
Training Arguments:
- Learning Rate: 1e-5
- Batch Size: 8
- Weight Decay: 0.01
- Epochs: 8
- Evaluation Metric: Accuracy
- Best Model Selection: Based on accuracy on the evaluation dataset
Hardware: Trained on a single GPU (e.g., NVIDIA Tesla V100)

Model Performance

The model achieved the following performance on the evaluation dataset:

Accuracy: XX.XX% (replace with your evaluation results)
F1-Score (weighted): XX.XX% (replace with your evaluation results)

Usage

Model Loading

To load and use the model, you can do the following:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the fine-tuned model and tokenizer
model_name = "Geraldine/scibert-publications-classification"  # Replace with your model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example input
text = "This paper explores the application of quantum computing in solving complex chemical problems."

# Tokenize the input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

# Get predictions
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1).item()
print(f"Predicted category: {predictions}")

Category Mapping

The predicted output is an integer corresponding to the following categories:

Label ID	Category
0	Medical Research
1	Biology (fond.)
2	Earth, Ecology, Energy and applied biology
3	Physical sciences and Astronomy
4	Social sciences
5	Mathematics
6	Humanities
7	Computer and information sciences
8	Chemistry
9	Engineering

Evaluation Metrics

To compute evaluation metrics on your data, you can use the following code snippet:

from sklearn.metrics import classification_report, accuracy_score

# Example: Ground truth labels and predicted labels
y_true = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  # Replace with actual labels
y_pred = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  # Replace with predictions

# Accuracy
acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc}")

# Classification report
print(classification_report(y_true, y_pred, target_names=[
    "Medical Research","Biology (fond.)","Earth, Ecology, Energy and applied biology",
    "Physical sciences and Astronomy","Social sciences","Mathematics","Humanities",
    "Computer and information sciences","Chemistry","Engineering"
]))

Model Limitations

Domain-Specific Bias: The model is fine-tuned on publication metadata and may not generalize well to non-academic text.
Text Length: Inputs are truncated to 512 tokens. Very long texts may lose information.
Imbalanced Data: If some categories are underrepresented in the dataset, performance for those classes may be lower.

How to Cite

If you use this model in your work, please cite it as follows:

@article{your_citation_key,
  title={Fine-tuned SciBERT for Multi-Class Classification of Publications Metadata},
  author={Your Name(s)},
  year={2024},
  publisher={Hugging Face}
}

Acknowledgments

This fine-tuning work builds upon the allenai/scibert_scivocab_uncased model and utilizes the BSO Publications Indexation dataset.

Model Card Authors [optional]

Géraldine Geoffroy

Model Card Contact

grldn.geoffroy@gmail.com