sds-ner-compliance

A named entity recognition model for extracting structured compliance fields from Safety Data Sheets (SDS) and GHS chemical documents. Fine-tuned on allenai/scibert_scivocab_uncased with domain-specific annotation covering the mandatory label elements defined in the UN GHS Purple Book and EU CLP Regulation (EC) No 1272/2008.

Designed to serve as the document understanding layer in automated GHS label generation pipelines.

Model Description

Safety Data Sheets are the authoritative source of hazard information for chemical products, but they arrive in inconsistent formats across suppliers, languages, and revision dates. This model extracts the structured fields needed to generate a compliant GHS label without manual data entry.

The model was fine-tuned on Section 2 (Hazard Identification) and Section 3 (Composition/Information on Ingredients) of real-world SDS documents, with labels covering signal words, hazard statements (H-codes), precautionary statements (P-codes), CAS numbers, and product identifiers.

Property Value
Model type Token classification (NER)
Base model allenai/scibert_scivocab_uncased
Languages English, Dutch, German, French
License Apache 2.0
GHS revision coverage Rev 9 (UN), CLP amendment 2023/707 (EU)

Intended Use

Primary use cases:

  • Extracting hazard fields from SDS PDFs for automated GHS label generation
  • Populating label templates in barcode and labeling software pipelines
  • Pre-processing step before jurisdiction-specific label validation
  • Building structured chemical compliance datasets from unstructured SDS libraries

Out-of-scope use cases:

  • Full SDS compliance validation (the model extracts fields; a rule-based validator should confirm correctness)
  • Classification of new substances (the model reads existing classifications, it does not perform hazard classification)
  • High-stakes regulatory decisions without human review

Labels

The model predicts the following entity types using BIO tagging:

Label Description GHS / CLP reference
SIGNAL_WORD "Danger" or "Warning" GHS Purple Book 1.4.10
HAZARD_STATEMENT H-code and statement text (e.g. H225, H302) Annex 3
PRECAUTIONARY_STATEMENT P-code and statement text (e.g. P210, P264) Annex 4
CAS_NUMBER CAS Registry Number of substance or ingredient Section 3
PRODUCT_NAME Product identifier as declared by supplier Section 1
SUPPLIER_NAME Manufacturer or responsible party name Section 1
PICTOGRAM_CODE GHS pictogram reference (e.g. GHS02, GHS07) Section 2

Usage

Basic inference

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="daanhoekstra/sds-ner-compliance",
    aggregation_strategy="simple"
)

sds_text = """
Section 2: Hazard Identification
Signal Word: Danger
Hazard Statements: H225 Highly flammable liquid and vapour.
H302 Harmful if swallowed. H319 Causes serious eye irritation.
Precautionary Statements: P210 Keep away from heat and open flames.
P264 Wash hands thoroughly after handling.
"""

entities = ner(sds_text)
for entity in entities:
    print(f"{entity['entity_group']:30s} {entity['word']}")

Expected output:

SIGNAL_WORD                    Danger
HAZARD_STATEMENT               H225 Highly flammable liquid and vapour
HAZARD_STATEMENT               H302 Harmful if swallowed
HAZARD_STATEMENT               H319 Causes serious eye irritation
PRECAUTIONARY_STATEMENT        P210 Keep away from heat and open flames
PRECAUTIONARY_STATEMENT        P264 Wash hands thoroughly after handling

Structured field extraction

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("daanhoekstra/sds-ner-compliance")
model = AutoModelForTokenClassification.from_pretrained("daanhoekstra/sds-ner-compliance")

def extract_label_fields(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)

    predictions = outputs.logits.argmax(-1).squeeze().tolist()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze().tolist())
    id2label = model.config.id2label

    fields = {
        "signal_word": None,
        "h_statements": [],
        "p_statements": [],
        "cas_numbers": [],
        "product_name": None,
    }

    current_label = None
    current_tokens = []

    for token, pred in zip(tokens, predictions):
        label = id2label[pred]
        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue
        if label.startswith("B-"):
            if current_label and current_tokens:
                _flush(fields, current_label, current_tokens)
            current_label = label[2:]
            current_tokens = [token]
        elif label.startswith("I-") and current_label:
            current_tokens.append(token)
        else:
            if current_label and current_tokens:
                _flush(fields, current_label, current_tokens)
            current_label = None
            current_tokens = []

    return fields


def _flush(fields: dict, label: str, tokens: list):
    value = tokenizer.convert_tokens_to_string(tokens).strip()
    if label == "SIGNAL_WORD":
        fields["signal_word"] = value
    elif label == "HAZARD_STATEMENT":
        fields["h_statements"].append(value)
    elif label == "PRECAUTIONARY_STATEMENT":
        fields["p_statements"].append(value)
    elif label == "CAS_NUMBER":
        fields["cas_numbers"].append(value)
    elif label == "PRODUCT_NAME":
        fields["product_name"] = value

Scanned PDFs and image-based SDS

For image-based SDS documents where text is not selectable, use the layout-aware variant of this pipeline built on microsoft/layoutlmv3-base, which processes both the visual layout and OCR text jointly. See the GHS label compliance blog post for a full walkthrough.

Training

Data

The model was fine-tuned on a curated dataset of SDS documents annotated at the token level. Documents were sourced from publicly available SDS libraries and supplier portals, covering a range of chemical product categories including solvents, adhesives, cleaning agents, and industrial gases.

Annotation was performed using Label Studio with a custom NER schema aligned to GHS Rev 9 mandatory label elements. Inter-annotator agreement (Cohen's kappa) was measured at 0.91 on the held-out validation split.

Split Documents Tokens
Train 1,840 412,000
Validation 230 51,500
Test 230 51,200

Procedure

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./sds-ner-compliance",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

Results

Evaluated on the held-out test split:

Entity Precision Recall F1
SIGNAL_WORD 0.98 0.97 0.98
HAZARD_STATEMENT 0.94 0.93 0.93
PRECAUTIONARY_STATEMENT 0.91 0.90 0.91
CAS_NUMBER 0.97 0.96 0.97
PRODUCT_NAME 0.89 0.87 0.88
SUPPLIER_NAME 0.86 0.84 0.85
Macro average 0.93 0.91 0.92

Performance is lower on SUPPLIER_NAME and PRODUCT_NAME due to high variability in how suppliers format these fields. A rule-based post-processing step (checking Section 1 headers) improves practical accuracy on these fields.

Limitations and Bias

  • Format sensitivity. Performance degrades on SDS documents that deviate significantly from standard GHS section structure. Documents that combine sections or use non-standard headers may produce incomplete extractions.
  • Language coverage. While the model handles EN, NL, DE, and FR, accuracy is highest on English documents, which dominate the training set. Fine-tuning on a target-language SDS set is recommended for non-English production deployments.
  • Novel hazard classes. The 2023 EU CLP amendment (Delegated Regulation 2023/707) introduced new hazard classes for endocrine disruption and PBT/vPvB substances. Documents using these classes may not extract correctly until the model is updated with annotated examples.
  • Not a classifier. This model extracts what the SDS says. It does not verify whether the classification is scientifically correct or regulatory-compliant. A downstream validation step against an authoritative H-statement list is required before generating a label.

Environmental Impact

Fine-tuning was performed on a single A100 GPU for approximately 3 hours.

  • Hardware: 1x NVIDIA A100 80GB
  • Training duration: ~3 hours
  • Estimated CO2 emissions: ~0.4 kg (via ML CO2 Impact calculator)

Citation

If you use this model in your research or production pipeline, please cite:

@misc{daanhoekstra2025sdsner,
  title={SDS NER Compliance: Named Entity Recognition for GHS Safety Data Sheets},
  author={Daan Hoekstra},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/daanhoekstra/sds-ner-compliance}
}

Related Resources

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for daanhoekstra/sds-ner-compliance

Finetuned
(102)
this model