Fine-tuned SciBERT for Multi-Classification of Publications Metadata

This repository provides a fine-tuned version of the allenai/scibert_scivocab_uncased model, specifically designed for the task of multi-class classification of publications metadata into 10 disciplinary categories.

Model Details

  • Base Model: allenai/scibert_scivocab_uncased
  • Task: Multi-class text classification
  • Number of Classes: 10
  • Labels:
    1. Biology (found)
    2. Chemistry
    3. Computer and Information Sciences
    4. Engineering
    5. Mathematics
    6. Medical Research
    7. Earth, Ecology, Energy, and Applied Biology
    8. Humanities
    9. Physical Sciences and Astronomy
    10. Social Sciences

The model is trained to classify textual metadata of scientific publications (e.g., title, journal name, publisher name, open access) into these categories.


Training Dataset

  • Dataset: BSO Publications Indexation
  • Number of Examples: 50,000 rows of labeled publication metadata
  • Split:
    • 70% for training
    • 30% for evaluation
  • Text Source: Metadata fields such as title and abstract (text_plain)
  • Labels: Disciplinary categories mapped to integers (label_int).

Training Configuration

  • Framework: Hugging Face's transformers library
  • Training Arguments:
    • Learning Rate: 1e-5
    • Batch Size: 8
    • Weight Decay: 0.01
    • Epochs: 8
    • Evaluation Metric: Accuracy
    • Best Model Selection: Based on accuracy on the evaluation dataset
  • Hardware: Trained on a single GPU (e.g., NVIDIA Tesla V100)

Model Performance

The model achieved the following performance on the evaluation dataset:

  • Accuracy: XX.XX% (replace with your evaluation results)
  • F1-Score (weighted): XX.XX% (replace with your evaluation results)

Usage

Model Loading

To load and use the model, you can do the following:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the fine-tuned model and tokenizer
model_name = "Geraldine/scibert-publications-classification"  # Replace with your model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example input
text = "This paper explores the application of quantum computing in solving complex chemical problems."

# Tokenize the input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

# Get predictions
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1).item()
print(f"Predicted category: {predictions}")

Category Mapping

The predicted output is an integer corresponding to the following categories:

Label ID Category
0 Medical Research
1 Biology (fond.)
2 Earth, Ecology, Energy and applied biology
3 Physical sciences and Astronomy
4 Social sciences
5 Mathematics
6 Humanities
7 Computer and information sciences
8 Chemistry
9 Engineering

Evaluation Metrics

To compute evaluation metrics on your data, you can use the following code snippet:

from sklearn.metrics import classification_report, accuracy_score

# Example: Ground truth labels and predicted labels
y_true = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  # Replace with actual labels
y_pred = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  # Replace with predictions

# Accuracy
acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc}")

# Classification report
print(classification_report(y_true, y_pred, target_names=[
    "Medical Research","Biology (fond.)","Earth, Ecology, Energy and applied biology",
    "Physical sciences and Astronomy","Social sciences","Mathematics","Humanities",
    "Computer and information sciences","Chemistry","Engineering"
]))

Model Limitations

  • Domain-Specific Bias: The model is fine-tuned on publication metadata and may not generalize well to non-academic text.
  • Text Length: Inputs are truncated to 512 tokens. Very long texts may lose information.
  • Imbalanced Data: If some categories are underrepresented in the dataset, performance for those classes may be lower.

How to Cite

If you use this model in your work, please cite it as follows:

@article{your_citation_key,
  title={Fine-tuned SciBERT for Multi-Class Classification of Publications Metadata},
  author={Your Name(s)},
  year={2024},
  publisher={Hugging Face}
}

Acknowledgments

This fine-tuning work builds upon the allenai/scibert_scivocab_uncased model and utilizes the BSO Publications Indexation dataset.


Model Card Authors [optional]

Géraldine Geoffroy

Model Card Contact

grldn.geoffroy@gmail.com

Downloads last month
103
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.