Lit2Vec Subfield Classifier (MLP)

Multi-label classifier for chemistry subfields using dense text embeddings.

Repo: https://huggingface.co/Bocklitz-Lab/lit2vec-subfield-classifier-model Dataset: https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset

Model Summary

This model is a Keras MLP for multi-label classification of chemistry-related scientific texts. It consumes a dense embedding vector and predicts one or more subfields (e.g., Catalysis, Energy Chemistry, Materials Science).

Input: dense embedding vector (embedding from the dataset)
Output: 18 sigmoid probabilities (one per subfield)
Task: multilabel text classification (thresholded at 0.5 by default)

Intended Use & Limitations

Intended use

Subfield tagging for chemistry abstracts/summaries
Metadata enrichment for literature databases
Retrieval, filtering, and analytics

Limitations

Trained only on chemistry texts → may not generalize to other domains
Requires the same embedding space as the dataset encoder (raw text is not accepted directly)
Long-tail subfields (few examples) may have lower F1

Labels

ID	Subfield
0	Catalysis
1	Organic Chemistry
2	Polymer Chemistry
3	Inorganic Chemistry
4	Materials Science
5	Analytical Chemistry
6	Physical Chemistry
7	Biochemistry
8	Environmental Chemistry
9	Energy Chemistry
10	Medicinal Chemistry
11	Chemical Engineering
12	Supramolecular Chemistry
13	Radiochemistry & Nuclear Chemistry
14	Forensic & Legal Chemistry
15	Food Chemistry
16	Chemical Education
17	Others

The repo includes label_mapping.json.

Training Details

Framework: TensorFlow/Keras
Architecture:
- Input → Dense(256, ReLU) → (BatchNorm) → Dropout(0.3)
- Dense(256, ReLU) → (BatchNorm) → Dropout(0.3)
- Output: Dense(18, sigmoid)
Loss: Weighted Binary Cross-Entropy (per-class weights from train frequency)
Optimizer: Adam (ReduceLROnPlateau)
Callbacks: EarlyStopping (restore best), ReduceLROnPlateau, optional W&B logging
Validation: 5-fold CV on train+val; final training on official splits
Best epoch (val): 11 (from W&B)

Evaluation

Validation (final run):

PR-AUC: 0.8688
ROC-AUC: 0.9725
Binary Accuracy: 0.9597

Test (held-out split, threshold = 0.5):

Micro F1: 0.81
Macro F1: 0.75
Weighted F1: 0.80
Samples F1: 0.80

Per-label (F1, support):

Subfield	F1	Support
Catalysis	0.80	197
Organic Chemistry	0.70	245
Polymer Chemistry	0.72	120
Inorganic Chemistry	0.71	203
Materials Science	0.80	917
Analytical Chemistry	0.71	633
Physical Chemistry	0.63	240
Biochemistry	0.92	2106
Environmental Chemistry	0.79	508
Energy Chemistry	0.79	166
Medicinal Chemistry	0.82	1343
Chemical Engineering	0.53	413
Supramolecular Chemistry	0.68	34
Radiochemistry & Nuclear Chemistry	0.65	20
Forensic & Legal Chemistry	0.70	16
Food Chemistry	0.83	282
Chemical Education	0.85	20
Others	0.83	19

Notes

Strong performance on frequent classes like Biochemistry, Medicinal Chemistry, Food Chemistry.
Lower F1 on long-tail or heterogeneous labels like Chemical Engineering and Physical Chemistry.

If included in the repo, the plot f1_vs_freq.png shows F1 vs. training label frequency.

Usage

The model expects the same embedding space as the dataset’s embedding. If you want to apply it to new texts, you must compute embeddings with the same encoder used to create the dataset.

Quick start (load from Hub, inference)

# pip install -r requirements.txt 
import json
import numpy as np
from typing import List, Tuple
from huggingface_hub import hf_hub_download
from tensorflow import keras
from sentence_transformers import SentenceTransformer

REPO_ID = "Bocklitz-Lab/lit2vec-subfield-classifier-model"
EMBED_MODEL = "intfloat/e5-large-v2"   # must match what you used to train!
TEXT_PREFIX = {"abstract": "abstract: ", "summary": "summary: "}  # keep consistent with your pipeline
THRESHOLD = 0.5  # decision threshold for multilabel

# ----- Load model + label mapping -----
model_path = hf_hub_download(REPO_ID, filename="mlp_model.h5")
label_map_path = hf_hub_download(REPO_ID, filename="label_mapping.json")

with open(label_map_path, "r", encoding="utf-8") as f:
    mapping = json.load(f)
index_to_label = {int(k): v for k, v in mapping["index_to_label"].items()}

model = keras.models.load_model(model_path, compile=False)  # inference only
encoder = SentenceTransformer(EMBED_MODEL)

def encode_text(text: str, text_type: str = "summary") -> np.ndarray:
    """
    Encode text into a normalized embedding compatible with the classifier.
    text_type: "summary" or "abstract" (affects prefix)
    """
    prefix = TEXT_PREFIX.get(text_type, "")
    emb = encoder.encode([prefix + text], normalize_embeddings=True)  # shape: (1, D)
    return emb.astype("float32")

def predict_labels_from_text(text: str, text_type: str = "summary", threshold: float = THRESHOLD
                            ) -> Tuple[List[int], List[str], np.ndarray]:
    """
    Returns (predicted_ids, predicted_labels, probabilities)
    """
    x = encode_text(text, text_type=text_type)       # (1, D)
    probs = model.predict(x, verbose=0)[0]           # (18,)
    pred_ids = [i for i, p in enumerate(probs) if p > threshold]
    pred_labels = [index_to_label[i] for i in pred_ids]
    return pred_ids, pred_labels, probs

# ----- Example -----
if __name__ == "__main__":
    sample_text = (
        "The adsorption capacity of Helix aspera shell for Pb2+, Zn2+ and Ni2+ has been studied. This shell has the potential of adsorbing Pb2+, Zn2+ and Ni2+ from aqueous solution. The adsorption potentials of Helix aspera shell is largely influenced by the ionic character of the ions and occurred according to the order Pb2+ > Ni2+ > Zn2+. The adsorption of Pb(II), Zn(II) and Ni(II) ions from aqueous solutions by Helix aspera shell is thermodynamically feasible and is consistent with the models of Langmuir and Freundlich adsorption isotherms. From the results of the study, the shell of Helix aspera is recommended for use in the removal of Pb2+, Zn2+ and Ni2+ from aqueous solution."
    )
    ids, labels, probs = predict_labels_from_text(sample_text, text_type="abstract", threshold=0.5)
    print("Predicted IDs:", ids)
    print("Predicted Labels:", labels)
    print("Top scores:", sorted(((index_to_label[i], float(p)) for i, p in enumerate(probs)),
                               key=lambda x: x[1], reverse=True)[:5])

Batch inference

X = np.load("embeddings_batch.npy").astype("float32")  # shape (N, D)
probs = model.predict(X, verbose=0)  # shape (N, 18)
labels_per_row = [[index_to_label[i] for i, p in enumerate(row) if p > 0.5] for row in probs]

Tip: If you need to fine-tune the model, recompile it and reuse the weighted BCE:

At load time, pass compile=False and then model.compile(loss="binary_crossentropy", ...) for inference-only.

For training with class weights, reintroduce the weighted loss you used in the training script.

Files in this repository

mlp_model.h5 – Keras model weights/graph
label_mapping.json – name ↔ id mapping
training_history.json – training curves (optional)
f1_vs_freq.png – F1 vs frequency plot (optional)
README.md – this model card

Dataset

Lit2Vec Subfield Classifier Dataset

~39.9k CC-BY scientific texts with embeddings and subfield labels
Splits: ~80/10/10 (train/val/test)
https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset

Model Index

model-index:
- name: Lit2Vec Subfield Classifier (MLP)
  results:
  - task:
      type: text-classification
      name: Multi-label text classification
    dataset:
      name: Lit2Vec Subfield Classifier Dataset
      type: Bocklitz-Lab/lit2vec-subfield-classifier-dataset
      split: test
    metrics:
    - type: micro_f1
      value: 0.81
    - type: macro_f1
      value: 0.75
    - type: weighted_f1
      value: 0.80
    - type: pr_auc
      value: 0.8688
    - type: roc_auc
      value: 0.9725

License

Model: CC BY 4.0
Dataset: CC BY 4.0

Citations

Dataset

@dataset{lit2vec_classifier_2025,
  author       = {Mahmoud Amiri, Thomas Bocklitz},
  title        = {Lit2Vec Subfield Classifier Dataset},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset}},
  note         = {Submitted to Nature Scientific Data}
}

Model

@misc{lit2vec_mlp_classifier_2025,
  title        = {Lit2Vec Subfield Classifier Model},
  author       = {Mahmoud Amiri and Thomas Bocklitz},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Bocklitz-Lab/lit2vec-subfield-classifier-model}}
}

Bocklitz-Lab
/

lit2vec-subfield-classifier-model

You need to agree to share your contact information to access this model