sds-ner-compliance
A named entity recognition model for extracting structured compliance fields from Safety Data Sheets (SDS) and GHS chemical documents. Fine-tuned on allenai/scibert_scivocab_uncased with domain-specific annotation covering the mandatory label elements defined in the UN GHS Purple Book and EU CLP Regulation (EC) No 1272/2008.
Designed to serve as the document understanding layer in automated GHS label generation pipelines.
Model Description
Safety Data Sheets are the authoritative source of hazard information for chemical products, but they arrive in inconsistent formats across suppliers, languages, and revision dates. This model extracts the structured fields needed to generate a compliant GHS label without manual data entry.
The model was fine-tuned on Section 2 (Hazard Identification) and Section 3 (Composition/Information on Ingredients) of real-world SDS documents, with labels covering signal words, hazard statements (H-codes), precautionary statements (P-codes), CAS numbers, and product identifiers.
| Property | Value |
|---|---|
| Model type | Token classification (NER) |
| Base model | allenai/scibert_scivocab_uncased |
| Languages | English, Dutch, German, French |
| License | Apache 2.0 |
| GHS revision coverage | Rev 9 (UN), CLP amendment 2023/707 (EU) |
Intended Use
Primary use cases:
- Extracting hazard fields from SDS PDFs for automated GHS label generation
- Populating label templates in barcode and labeling software pipelines
- Pre-processing step before jurisdiction-specific label validation
- Building structured chemical compliance datasets from unstructured SDS libraries
Out-of-scope use cases:
- Full SDS compliance validation (the model extracts fields; a rule-based validator should confirm correctness)
- Classification of new substances (the model reads existing classifications, it does not perform hazard classification)
- High-stakes regulatory decisions without human review
Labels
The model predicts the following entity types using BIO tagging:
| Label | Description | GHS / CLP reference |
|---|---|---|
SIGNAL_WORD |
"Danger" or "Warning" | GHS Purple Book 1.4.10 |
HAZARD_STATEMENT |
H-code and statement text (e.g. H225, H302) | Annex 3 |
PRECAUTIONARY_STATEMENT |
P-code and statement text (e.g. P210, P264) | Annex 4 |
CAS_NUMBER |
CAS Registry Number of substance or ingredient | Section 3 |
PRODUCT_NAME |
Product identifier as declared by supplier | Section 1 |
SUPPLIER_NAME |
Manufacturer or responsible party name | Section 1 |
PICTOGRAM_CODE |
GHS pictogram reference (e.g. GHS02, GHS07) | Section 2 |
Usage
Basic inference
from transformers import pipeline
ner = pipeline(
"token-classification",
model="daanhoekstra/sds-ner-compliance",
aggregation_strategy="simple"
)
sds_text = """
Section 2: Hazard Identification
Signal Word: Danger
Hazard Statements: H225 Highly flammable liquid and vapour.
H302 Harmful if swallowed. H319 Causes serious eye irritation.
Precautionary Statements: P210 Keep away from heat and open flames.
P264 Wash hands thoroughly after handling.
"""
entities = ner(sds_text)
for entity in entities:
print(f"{entity['entity_group']:30s} {entity['word']}")
Expected output:
SIGNAL_WORD Danger
HAZARD_STATEMENT H225 Highly flammable liquid and vapour
HAZARD_STATEMENT H302 Harmful if swallowed
HAZARD_STATEMENT H319 Causes serious eye irritation
PRECAUTIONARY_STATEMENT P210 Keep away from heat and open flames
PRECAUTIONARY_STATEMENT P264 Wash hands thoroughly after handling
Structured field extraction
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("daanhoekstra/sds-ner-compliance")
model = AutoModelForTokenClassification.from_pretrained("daanhoekstra/sds-ner-compliance")
def extract_label_fields(text: str) -> dict:
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze().tolist())
id2label = model.config.id2label
fields = {
"signal_word": None,
"h_statements": [],
"p_statements": [],
"cas_numbers": [],
"product_name": None,
}
current_label = None
current_tokens = []
for token, pred in zip(tokens, predictions):
label = id2label[pred]
if token in ["[CLS]", "[SEP]", "[PAD]"]:
continue
if label.startswith("B-"):
if current_label and current_tokens:
_flush(fields, current_label, current_tokens)
current_label = label[2:]
current_tokens = [token]
elif label.startswith("I-") and current_label:
current_tokens.append(token)
else:
if current_label and current_tokens:
_flush(fields, current_label, current_tokens)
current_label = None
current_tokens = []
return fields
def _flush(fields: dict, label: str, tokens: list):
value = tokenizer.convert_tokens_to_string(tokens).strip()
if label == "SIGNAL_WORD":
fields["signal_word"] = value
elif label == "HAZARD_STATEMENT":
fields["h_statements"].append(value)
elif label == "PRECAUTIONARY_STATEMENT":
fields["p_statements"].append(value)
elif label == "CAS_NUMBER":
fields["cas_numbers"].append(value)
elif label == "PRODUCT_NAME":
fields["product_name"] = value
Scanned PDFs and image-based SDS
For image-based SDS documents where text is not selectable, use the layout-aware variant of this pipeline built on microsoft/layoutlmv3-base, which processes both the visual layout and OCR text jointly. See the GHS label compliance blog post for a full walkthrough.
Training
Data
The model was fine-tuned on a curated dataset of SDS documents annotated at the token level. Documents were sourced from publicly available SDS libraries and supplier portals, covering a range of chemical product categories including solvents, adhesives, cleaning agents, and industrial gases.
Annotation was performed using Label Studio with a custom NER schema aligned to GHS Rev 9 mandatory label elements. Inter-annotator agreement (Cohen's kappa) was measured at 0.91 on the held-out validation split.
| Split | Documents | Tokens |
|---|---|---|
| Train | 1,840 | 412,000 |
| Validation | 230 | 51,500 |
| Test | 230 | 51,200 |
Procedure
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./sds-ner-compliance",
num_train_epochs=5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
)
Results
Evaluated on the held-out test split:
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| SIGNAL_WORD | 0.98 | 0.97 | 0.98 |
| HAZARD_STATEMENT | 0.94 | 0.93 | 0.93 |
| PRECAUTIONARY_STATEMENT | 0.91 | 0.90 | 0.91 |
| CAS_NUMBER | 0.97 | 0.96 | 0.97 |
| PRODUCT_NAME | 0.89 | 0.87 | 0.88 |
| SUPPLIER_NAME | 0.86 | 0.84 | 0.85 |
| Macro average | 0.93 | 0.91 | 0.92 |
Performance is lower on SUPPLIER_NAME and PRODUCT_NAME due to high variability in how suppliers format these fields. A rule-based post-processing step (checking Section 1 headers) improves practical accuracy on these fields.
Limitations and Bias
- Format sensitivity. Performance degrades on SDS documents that deviate significantly from standard GHS section structure. Documents that combine sections or use non-standard headers may produce incomplete extractions.
- Language coverage. While the model handles EN, NL, DE, and FR, accuracy is highest on English documents, which dominate the training set. Fine-tuning on a target-language SDS set is recommended for non-English production deployments.
- Novel hazard classes. The 2023 EU CLP amendment (Delegated Regulation 2023/707) introduced new hazard classes for endocrine disruption and PBT/vPvB substances. Documents using these classes may not extract correctly until the model is updated with annotated examples.
- Not a classifier. This model extracts what the SDS says. It does not verify whether the classification is scientifically correct or regulatory-compliant. A downstream validation step against an authoritative H-statement list is required before generating a label.
Environmental Impact
Fine-tuning was performed on a single A100 GPU for approximately 3 hours.
- Hardware: 1x NVIDIA A100 80GB
- Training duration: ~3 hours
- Estimated CO2 emissions: ~0.4 kg (via ML CO2 Impact calculator)
Citation
If you use this model in your research or production pipeline, please cite:
@misc{daanhoekstra2025sdsner,
title={SDS NER Compliance: Named Entity Recognition for GHS Safety Data Sheets},
author={Daan Hoekstra},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/daanhoekstra/sds-ner-compliance}
}
Related Resources
- How Language Models Can Accelerate GHS and Supply Chain Label Compliance โ accompanying blog post with full pipeline walkthrough
- Automating E-Invoicing Compliance with Open-Source NLP โ related post on document AI for compliance
- ML-driven automated SDS indexing (Scientific Reports, 2024) โ peer-reviewed reference for BERT-based SDS extraction
- GHS Purple Book (UN, Rev 9) โ authoritative label element definitions
- Downloads last month
- 13
Model tree for daanhoekstra/sds-ner-compliance
Base model
allenai/scibert_scivocab_uncased