sds-ner-compliance

A named entity recognition model for extracting structured compliance fields from Safety Data Sheets (SDS) and GHS chemical documents. Fine-tuned on allenai/scibert_scivocab_uncased with domain-specific annotation covering the mandatory label elements defined in the UN GHS Purple Book and EU CLP Regulation (EC) No 1272/2008.

Designed to serve as the document understanding layer in automated GHS label generation pipelines.

Model Description

Safety Data Sheets are the authoritative source of hazard information for chemical products, but they arrive in inconsistent formats across suppliers, languages, and revision dates. This model extracts the structured fields needed to generate a compliant GHS label without manual data entry.

The model was fine-tuned on Section 2 (Hazard Identification) and Section 3 (Composition/Information on Ingredients) of real-world SDS documents, with labels covering signal words, hazard statements (H-codes), precautionary statements (P-codes), CAS numbers, and product identifiers.

Property	Value
Model type	Token classification (NER)
Base model	`allenai/scibert_scivocab_uncased`
Languages	English, Dutch, German, French
License	Apache 2.0
GHS revision coverage	Rev 9 (UN), CLP amendment 2023/707 (EU)

Intended Use

Primary use cases:

Extracting hazard fields from SDS PDFs for automated GHS label generation
Populating label templates in barcode and labeling software pipelines
Pre-processing step before jurisdiction-specific label validation
Building structured chemical compliance datasets from unstructured SDS libraries

Out-of-scope use cases:

Full SDS compliance validation (the model extracts fields; a rule-based validator should confirm correctness)
Classification of new substances (the model reads existing classifications, it does not perform hazard classification)
High-stakes regulatory decisions without human review

Labels

The model predicts the following entity types using BIO tagging:

Label	Description	GHS / CLP reference
`SIGNAL_WORD`	"Danger" or "Warning"	GHS Purple Book 1.4.10
`HAZARD_STATEMENT`	H-code and statement text (e.g. H225, H302)	Annex 3
`PRECAUTIONARY_STATEMENT`	P-code and statement text (e.g. P210, P264)	Annex 4
`CAS_NUMBER`	CAS Registry Number of substance or ingredient	Section 3
`PRODUCT_NAME`	Product identifier as declared by supplier	Section 1
`SUPPLIER_NAME`	Manufacturer or responsible party name	Section 1
`PICTOGRAM_CODE`	GHS pictogram reference (e.g. GHS02, GHS07)	Section 2

Usage

Basic inference

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="daanhoekstra/sds-ner-compliance",
    aggregation_strategy="simple"
)

sds_text = """
Section 2: Hazard Identification
Signal Word: Danger
Hazard Statements: H225 Highly flammable liquid and vapour.
H302 Harmful if swallowed. H319 Causes serious eye irritation.
Precautionary Statements: P210 Keep away from heat and open flames.
P264 Wash hands thoroughly after handling.
"""

entities = ner(sds_text)
for entity in entities:
    print(f"{entity['entity_group']:30s} {entity['word']}")

Expected output:

SIGNAL_WORD                    Danger
HAZARD_STATEMENT               H225 Highly flammable liquid and vapour
HAZARD_STATEMENT               H302 Harmful if swallowed
HAZARD_STATEMENT               H319 Causes serious eye irritation
PRECAUTIONARY_STATEMENT        P210 Keep away from heat and open flames
PRECAUTIONARY_STATEMENT        P264 Wash hands thoroughly after handling

Structured field extraction

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("daanhoekstra/sds-ner-compliance")
model = AutoModelForTokenClassification.from_pretrained("daanhoekstra/sds-ner-compliance")

def extract_label_fields(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)

    predictions = outputs.logits.argmax(-1).squeeze().tolist()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze().tolist())
    id2label = model.config.id2label

    fields = {
        "signal_word": None,
        "h_statements": [],
        "p_statements": [],
        "cas_numbers": [],
        "product_name": None,
    }

    current_label = None
    current_tokens = []

    for token, pred in zip(tokens, predictions):
        label = id2label[pred]
        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue
        if label.startswith("B-"):
            if current_label and current_tokens:
                _flush(fields, current_label, current_tokens)
            current_label = label[2:]
            current_tokens = [token]
        elif label.startswith("I-") and current_label:
            current_tokens.append(token)
        else:
            if current_label and current_tokens:
                _flush(fields, current_label, current_tokens)
            current_label = None
            current_tokens = []

    return fields


def _flush(fields: dict, label: str, tokens: list):
    value = tokenizer.convert_tokens_to_string(tokens).strip()
    if label == "SIGNAL_WORD":
        fields["signal_word"] = value
    elif label == "HAZARD_STATEMENT":
        fields["h_statements"].append(value)
    elif label == "PRECAUTIONARY_STATEMENT":
        fields["p_statements"].append(value)
    elif label == "CAS_NUMBER":
        fields["cas_numbers"].append(value)
    elif label == "PRODUCT_NAME":
        fields["product_name"] = value

Scanned PDFs and image-based SDS

For image-based SDS documents where text is not selectable, use the layout-aware variant of this pipeline built on microsoft/layoutlmv3-base, which processes both the visual layout and OCR text jointly. See the GHS label compliance blog post for a full walkthrough.

Training

Data

The model was fine-tuned on a curated dataset of SDS documents annotated at the token level. Documents were sourced from publicly available SDS libraries and supplier portals, covering a range of chemical product categories including solvents, adhesives, cleaning agents, and industrial gases.

Annotation was performed using Label Studio with a custom NER schema aligned to GHS Rev 9 mandatory label elements. Inter-annotator agreement (Cohen's kappa) was measured at 0.91 on the held-out validation split.

Split	Documents	Tokens
Train	1,840	412,000
Validation	230	51,500
Test	230	51,200

Procedure

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./sds-ner-compliance",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

Results

Evaluated on the held-out test split:

Entity	Precision	Recall	F1
SIGNAL_WORD	0.98	0.97	0.98
HAZARD_STATEMENT	0.94	0.93	0.93
PRECAUTIONARY_STATEMENT	0.91	0.90	0.91
CAS_NUMBER	0.97	0.96	0.97
PRODUCT_NAME	0.89	0.87	0.88
SUPPLIER_NAME	0.86	0.84	0.85
Macro average	0.93	0.91	0.92

Performance is lower on SUPPLIER_NAME and PRODUCT_NAME due to high variability in how suppliers format these fields. A rule-based post-processing step (checking Section 1 headers) improves practical accuracy on these fields.

Limitations and Bias

Format sensitivity. Performance degrades on SDS documents that deviate significantly from standard GHS section structure. Documents that combine sections or use non-standard headers may produce incomplete extractions.
Language coverage. While the model handles EN, NL, DE, and FR, accuracy is highest on English documents, which dominate the training set. Fine-tuning on a target-language SDS set is recommended for non-English production deployments.
Novel hazard classes. The 2023 EU CLP amendment (Delegated Regulation 2023/707) introduced new hazard classes for endocrine disruption and PBT/vPvB substances. Documents using these classes may not extract correctly until the model is updated with annotated examples.
Not a classifier. This model extracts what the SDS says. It does not verify whether the classification is scientifically correct or regulatory-compliant. A downstream validation step against an authoritative H-statement list is required before generating a label.

Environmental Impact

Fine-tuning was performed on a single A100 GPU for approximately 3 hours.

Hardware: 1x NVIDIA A100 80GB
Training duration: ~3 hours
Estimated CO2 emissions: ~0.4 kg (via ML CO2 Impact calculator)

Citation

If you use this model in your research or production pipeline, please cite:

@misc{daanhoekstra2025sdsner,
  title={SDS NER Compliance: Named Entity Recognition for GHS Safety Data Sheets},
  author={Daan Hoekstra},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/daanhoekstra/sds-ner-compliance}
}

Related Resources

How Language Models Can Accelerate GHS and Supply Chain Label Compliance — accompanying blog post with full pipeline walkthrough
Automating E-Invoicing Compliance with Open-Source NLP — related post on document AI for compliance
ML-driven automated SDS indexing (Scientific Reports, 2024) — peer-reviewed reference for BERT-based SDS extraction
GHS Purple Book (UN, Rev 9) — authoritative label element definitions

Downloads last month: 13

Model tree for daanhoekstra/sds-ner-compliance

Base model

allenai/scibert_scivocab_uncased

Finetuned

(102)

this model