incidents_ner_synthetic2

BERT-based token classifier for named-entity recognition on public-safety / dispatch-style English text (incident narratives, unit callouts, locations, etc.). Trained on local public safety radio traffic.

  • Architecture: BertForTokenClassification (BERT-base, cased tokenizer vocabulary)
  • Base checkpoint: dslim/bert-base-NER โ€” classification head replaced for this label set
  • Training stack: Hugging Face transformers Trainer
  • Variant name: synthetic2 โ€” includes synthetic augmentation alongside real-style incident data (stronger coverage for patterns like unit IDs and templated phrases)

Labels (BIO)

Label Description (typical use)
O Outside any entity
ADDRESS Street address / numbered location
AGENCY Agency / department name
CONTEXT Surrounding context
DESC Free-text description
EVT_TYPE Incident / event type
LOC Location name (street without number, place, etc.)
STATUS Unit / incident status
SUBJECT Subject (person role)
SUSPECT_DATA Suspect-related details
UNIT Radio / resource unit identifiers
VEHICLE Vehicle references
X_STREET Cross street

Spans use standard B- / I- prefixes (e.g. B-UNIT, I-UNIT).

Intended use

  • Assistive labeling (e.g. Doccano, pre-annotation pipelines)
  • Prototyping NER for CAD / radio / narrative workflows

Limitations

  • Domain-specific: Tuned for a particular schema and dialect; performance drops on unrelated genres or regions.
  • Synthetic mix: Heavier synthetic exposure can improve rare token patterns but may bias the model toward generator artifacts โ€” validate on held-out real audio transcripts or logs.
  • Not for automated enforcement without human review.

Training details (reference)

Default settings from the training script (override if your run differed):

Setting Typical value
Optimizer / schedule AdamW, weight decay 0.01
Learning rate 2e-5
Epochs 3
Batch size 8 (per device)
Max sequence length 128
Validation split 15% (last slice of shuffled or ordered JSONL โ€” match your run)
Seed 42

Input format for training: JSONL with pre-tokenized tokens and parallel labels per token (IOB strings), as consumed by train_ner.py.

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "YOUR_HF_USERNAME/incidents_ner_v1"  # after upload

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tok,
    aggregation_strategy="simple",
)
text = "5137, 5188 copy a traffic stop on US 67 at Maple Street."
print(ner(text))

For word-aligned labels (like training), tokenize with is_split_into_words=True and align labels to word pieces the same way as in training.

Citation

If you use this model, cite the base model and your dataset. Base model card: dslim/bert-base-NER.

Model index

  • Local export path (this repo): models/incident_ner_v1
  • Config: config.json includes id2label / label2id and transformers version used at export.

Replace YOUR_HF_USERNAME/incidents_ner_v1 with your Hub repo id after publishing. Add evaluation metrics (precision/recall/F1 per entity) in a Evaluation section when you have them.

Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xxbubziexx/incident_ner_v1

Finetuned
(37)
this model