incidents_ner_synthetic2

BERT-based token classifier for named-entity recognition on public-safety / dispatch-style English text (incident narratives, unit callouts, locations, etc.). Trained on local public safety radio traffic.

Architecture: BertForTokenClassification (BERT-base, cased tokenizer vocabulary)
Base checkpoint: dslim/bert-base-NER — classification head replaced for this label set
Training stack: Hugging Face transformers Trainer
Variant name: synthetic2 — includes synthetic augmentation alongside real-style incident data (stronger coverage for patterns like unit IDs and templated phrases)

Labels (BIO)

Label	Description (typical use)
`O`	Outside any entity
`ADDRESS`	Street address / numbered location
`AGENCY`	Agency / department name
`CONTEXT`	Surrounding context
`DESC`	Free-text description
`EVT_TYPE`	Incident / event type
`LOC`	Location name (street without number, place, etc.)
`STATUS`	Unit / incident status
`SUBJECT`	Subject (person role)
`SUSPECT_DATA`	Suspect-related details
`UNIT`	Radio / resource unit identifiers
`VEHICLE`	Vehicle references
`X_STREET`	Cross street

Spans use standard B- / I- prefixes (e.g. B-UNIT, I-UNIT).

Intended use

Assistive labeling (e.g. Doccano, pre-annotation pipelines)
Prototyping NER for CAD / radio / narrative workflows

Limitations

Domain-specific: Tuned for a particular schema and dialect; performance drops on unrelated genres or regions.
Synthetic mix: Heavier synthetic exposure can improve rare token patterns but may bias the model toward generator artifacts — validate on held-out real audio transcripts or logs.
Not for automated enforcement without human review.

Training details (reference)

Default settings from the training script (override if your run differed):

Setting	Typical value
Optimizer / schedule	AdamW, weight decay `0.01`
Learning rate	`2e-5`
Epochs	`3`
Batch size	`8` (per device)
Max sequence length	`128`
Validation split	`15%` (last slice of shuffled or ordered JSONL — match your run)
Seed	`42`

Input format for training: JSONL with pre-tokenized tokens and parallel labels per token (IOB strings), as consumed by train_ner.py.

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "YOUR_HF_USERNAME/incidents_ner_v1"  # after upload

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tok,
    aggregation_strategy="simple",
)
text = "5137, 5188 copy a traffic stop on US 67 at Maple Street."
print(ner(text))

For word-aligned labels (like training), tokenize with is_split_into_words=True and align labels to word pieces the same way as in training.

Citation

If you use this model, cite the base model and your dataset. Base model card: dslim/bert-base-NER.

Model index

Local export path (this repo): models/incident_ner_v1
Config: config.json includes id2label / label2id and transformers version used at export.

Replace YOUR_HF_USERNAME/incidents_ner_v1 with your Hub repo id after publishing. Add evaluation metrics (precision/recall/F1 per entity) in a Evaluation section when you have them.

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for xxbubziexx/incident_ner_v1

Base model

dslim/bert-base-NER

Finetuned

(37)

this model