Lit2Vec Subfield Classifier (MLP)
Multi-label classifier for chemistry subfields using dense text embeddings.
Repo: https://huggingface.co/Bocklitz-Lab/lit2vec-subfield-classifier-model Dataset: https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset
Model Summary
This model is a Keras MLP for multi-label classification of chemistry-related scientific texts. It consumes a dense embedding vector and predicts one or more subfields (e.g., Catalysis, Energy Chemistry, Materials Science).
- Input: dense embedding vector (
embedding
from the dataset) - Output: 18 sigmoid probabilities (one per subfield)
- Task: multilabel text classification (thresholded at 0.5 by default)
Intended Use & Limitations
Intended use
- Subfield tagging for chemistry abstracts/summaries
- Metadata enrichment for literature databases
- Retrieval, filtering, and analytics
Limitations
- Trained only on chemistry texts → may not generalize to other domains
- Requires the same embedding space as the dataset encoder (raw text is not accepted directly)
- Long-tail subfields (few examples) may have lower F1
Labels
ID | Subfield |
---|---|
0 | Catalysis |
1 | Organic Chemistry |
2 | Polymer Chemistry |
3 | Inorganic Chemistry |
4 | Materials Science |
5 | Analytical Chemistry |
6 | Physical Chemistry |
7 | Biochemistry |
8 | Environmental Chemistry |
9 | Energy Chemistry |
10 | Medicinal Chemistry |
11 | Chemical Engineering |
12 | Supramolecular Chemistry |
13 | Radiochemistry & Nuclear Chemistry |
14 | Forensic & Legal Chemistry |
15 | Food Chemistry |
16 | Chemical Education |
17 | Others |
The repo includes label_mapping.json
.
Training Details
Framework: TensorFlow/Keras
Architecture:
- Input → Dense(256, ReLU) → (BatchNorm) → Dropout(0.3)
- Dense(256, ReLU) → (BatchNorm) → Dropout(0.3)
- Output: Dense(18, sigmoid)
Loss: Weighted Binary Cross-Entropy (per-class weights from train frequency)
Optimizer: Adam (ReduceLROnPlateau)
Callbacks: EarlyStopping (restore best), ReduceLROnPlateau, optional W&B logging
Validation: 5-fold CV on train+val; final training on official splits
Best epoch (val): 11 (from W&B)
Evaluation
Validation (final run):
- PR-AUC: 0.8688
- ROC-AUC: 0.9725
- Binary Accuracy: 0.9597
Test (held-out split, threshold = 0.5):
- Micro F1: 0.81
- Macro F1: 0.75
- Weighted F1: 0.80
- Samples F1: 0.80
Per-label (F1, support):
Subfield | F1 | Support |
---|---|---|
Catalysis | 0.80 | 197 |
Organic Chemistry | 0.70 | 245 |
Polymer Chemistry | 0.72 | 120 |
Inorganic Chemistry | 0.71 | 203 |
Materials Science | 0.80 | 917 |
Analytical Chemistry | 0.71 | 633 |
Physical Chemistry | 0.63 | 240 |
Biochemistry | 0.92 | 2106 |
Environmental Chemistry | 0.79 | 508 |
Energy Chemistry | 0.79 | 166 |
Medicinal Chemistry | 0.82 | 1343 |
Chemical Engineering | 0.53 | 413 |
Supramolecular Chemistry | 0.68 | 34 |
Radiochemistry & Nuclear Chemistry | 0.65 | 20 |
Forensic & Legal Chemistry | 0.70 | 16 |
Food Chemistry | 0.83 | 282 |
Chemical Education | 0.85 | 20 |
Others | 0.83 | 19 |
Notes
- Strong performance on frequent classes like Biochemistry, Medicinal Chemistry, Food Chemistry.
- Lower F1 on long-tail or heterogeneous labels like Chemical Engineering and Physical Chemistry.
If included in the repo, the plot f1_vs_freq.png
shows F1 vs. training label frequency.
Usage
The model expects the same embedding space as the dataset’s
embedding
. If you want to apply it to new texts, you must compute embeddings with the same encoder used to create the dataset.
Quick start (load from Hub, inference)
# pip install -r requirements.txt
import json
import numpy as np
from typing import List, Tuple
from huggingface_hub import hf_hub_download
from tensorflow import keras
from sentence_transformers import SentenceTransformer
REPO_ID = "Bocklitz-Lab/lit2vec-subfield-classifier-model"
EMBED_MODEL = "intfloat/e5-large-v2" # must match what you used to train!
TEXT_PREFIX = {"abstract": "abstract: ", "summary": "summary: "} # keep consistent with your pipeline
THRESHOLD = 0.5 # decision threshold for multilabel
# ----- Load model + label mapping -----
model_path = hf_hub_download(REPO_ID, filename="mlp_model.h5")
label_map_path = hf_hub_download(REPO_ID, filename="label_mapping.json")
with open(label_map_path, "r", encoding="utf-8") as f:
mapping = json.load(f)
index_to_label = {int(k): v for k, v in mapping["index_to_label"].items()}
model = keras.models.load_model(model_path, compile=False) # inference only
encoder = SentenceTransformer(EMBED_MODEL)
def encode_text(text: str, text_type: str = "summary") -> np.ndarray:
"""
Encode text into a normalized embedding compatible with the classifier.
text_type: "summary" or "abstract" (affects prefix)
"""
prefix = TEXT_PREFIX.get(text_type, "")
emb = encoder.encode([prefix + text], normalize_embeddings=True) # shape: (1, D)
return emb.astype("float32")
def predict_labels_from_text(text: str, text_type: str = "summary", threshold: float = THRESHOLD
) -> Tuple[List[int], List[str], np.ndarray]:
"""
Returns (predicted_ids, predicted_labels, probabilities)
"""
x = encode_text(text, text_type=text_type) # (1, D)
probs = model.predict(x, verbose=0)[0] # (18,)
pred_ids = [i for i, p in enumerate(probs) if p > threshold]
pred_labels = [index_to_label[i] for i in pred_ids]
return pred_ids, pred_labels, probs
# ----- Example -----
if __name__ == "__main__":
sample_text = (
"The adsorption capacity of Helix aspera shell for Pb2+, Zn2+ and Ni2+ has been studied. This shell has the potential of adsorbing Pb2+, Zn2+ and Ni2+ from aqueous solution. The adsorption potentials of Helix aspera shell is largely influenced by the ionic character of the ions and occurred according to the order Pb2+ > Ni2+ > Zn2+. The adsorption of Pb(II), Zn(II) and Ni(II) ions from aqueous solutions by Helix aspera shell is thermodynamically feasible and is consistent with the models of Langmuir and Freundlich adsorption isotherms. From the results of the study, the shell of Helix aspera is recommended for use in the removal of Pb2+, Zn2+ and Ni2+ from aqueous solution."
)
ids, labels, probs = predict_labels_from_text(sample_text, text_type="abstract", threshold=0.5)
print("Predicted IDs:", ids)
print("Predicted Labels:", labels)
print("Top scores:", sorted(((index_to_label[i], float(p)) for i, p in enumerate(probs)),
key=lambda x: x[1], reverse=True)[:5])
Batch inference
X = np.load("embeddings_batch.npy").astype("float32") # shape (N, D)
probs = model.predict(X, verbose=0) # shape (N, 18)
labels_per_row = [[index_to_label[i] for i, p in enumerate(row) if p > 0.5] for row in probs]
Tip: If you need to fine-tune the model, recompile it and reuse the weighted BCE:
- At load time, pass
compile=False
and thenmodel.compile(loss="binary_crossentropy", ...)
for inference-only.- For training with class weights, reintroduce the weighted loss you used in the training script.
Files in this repository
mlp_model.h5
– Keras model weights/graphlabel_mapping.json
– name ↔ id mappingtraining_history.json
– training curves (optional)f1_vs_freq.png
– F1 vs frequency plot (optional)README.md
– this model card
Dataset
Lit2Vec Subfield Classifier Dataset
- ~39.9k CC-BY scientific texts with embeddings and subfield labels
- Splits: ~80/10/10 (train/val/test)
- https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset
Model Index
model-index:
- name: Lit2Vec Subfield Classifier (MLP)
results:
- task:
type: text-classification
name: Multi-label text classification
dataset:
name: Lit2Vec Subfield Classifier Dataset
type: Bocklitz-Lab/lit2vec-subfield-classifier-dataset
split: test
metrics:
- type: micro_f1
value: 0.81
- type: macro_f1
value: 0.75
- type: weighted_f1
value: 0.80
- type: pr_auc
value: 0.8688
- type: roc_auc
value: 0.9725
License
- Model: CC BY 4.0
- Dataset: CC BY 4.0
Citations
Dataset
@dataset{lit2vec_classifier_2025,
author = {Mahmoud Amiri, Thomas Bocklitz},
title = {Lit2Vec Subfield Classifier Dataset},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset}},
note = {Submitted to Nature Scientific Data}
}
Model
@misc{lit2vec_mlp_classifier_2025,
title = {Lit2Vec Subfield Classifier Model},
author = {Mahmoud Amiri and Thomas Bocklitz},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Bocklitz-Lab/lit2vec-subfield-classifier-model}}
}
- Downloads last month
- -
Dataset used to train Bocklitz-Lab/lit2vec-subfield-classifier-model
Space using Bocklitz-Lab/lit2vec-subfield-classifier-model 1
Evaluation results
- micro_f1 on Lit2Vec Subfield Classifier Datasettest set self-reported0.810
- macro_f1 on Lit2Vec Subfield Classifier Datasettest set self-reported0.750
- weighted_f1 on Lit2Vec Subfield Classifier Datasettest set self-reported0.800
- pr_auc on Lit2Vec Subfield Classifier Datasettest set self-reported0.869
- roc_auc on Lit2Vec Subfield Classifier Datasettest set self-reported0.973