incidents_ner_synthetic2
BERT-based token classifier for named-entity recognition on public-safety / dispatch-style English text (incident narratives, unit callouts, locations, etc.). Trained on local public safety radio traffic.
- Architecture:
BertForTokenClassification(BERT-base, cased tokenizer vocabulary) - Base checkpoint:
dslim/bert-base-NERโ classification head replaced for this label set - Training stack: Hugging Face
transformersTrainer - Variant name:
synthetic2โ includes synthetic augmentation alongside real-style incident data (stronger coverage for patterns like unit IDs and templated phrases)
Labels (BIO)
| Label | Description (typical use) |
|---|---|
O |
Outside any entity |
ADDRESS |
Street address / numbered location |
AGENCY |
Agency / department name |
CONTEXT |
Surrounding context |
DESC |
Free-text description |
EVT_TYPE |
Incident / event type |
LOC |
Location name (street without number, place, etc.) |
STATUS |
Unit / incident status |
SUBJECT |
Subject (person role) |
SUSPECT_DATA |
Suspect-related details |
UNIT |
Radio / resource unit identifiers |
VEHICLE |
Vehicle references |
X_STREET |
Cross street |
Spans use standard B- / I- prefixes (e.g. B-UNIT, I-UNIT).
Intended use
- Assistive labeling (e.g. Doccano, pre-annotation pipelines)
- Prototyping NER for CAD / radio / narrative workflows
Limitations
- Domain-specific: Tuned for a particular schema and dialect; performance drops on unrelated genres or regions.
- Synthetic mix: Heavier synthetic exposure can improve rare token patterns but may bias the model toward generator artifacts โ validate on held-out real audio transcripts or logs.
- Not for automated enforcement without human review.
Training details (reference)
Default settings from the training script (override if your run differed):
| Setting | Typical value |
|---|---|
| Optimizer / schedule | AdamW, weight decay 0.01 |
| Learning rate | 2e-5 |
| Epochs | 3 |
| Batch size | 8 (per device) |
| Max sequence length | 128 |
| Validation split | 15% (last slice of shuffled or ordered JSONL โ match your run) |
| Seed | 42 |
Input format for training: JSONL with pre-tokenized tokens and parallel labels per token (IOB strings), as consumed by train_ner.py.
How to use
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "YOUR_HF_USERNAME/incidents_ner_v1" # after upload
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tok,
aggregation_strategy="simple",
)
text = "5137, 5188 copy a traffic stop on US 67 at Maple Street."
print(ner(text))
For word-aligned labels (like training), tokenize with is_split_into_words=True and align labels to word pieces the same way as in training.
Citation
If you use this model, cite the base model and your dataset. Base model card: dslim/bert-base-NER.
Model index
- Local export path (this repo):
models/incident_ner_v1 - Config:
config.jsonincludesid2label/label2idandtransformersversion used at export.
Replace YOUR_HF_USERNAME/incidents_ner_v1 with your Hub repo id after publishing. Add evaluation metrics (precision/recall/F1 per entity) in a Evaluation section when you have them.
- Downloads last month
- 5
Model tree for xxbubziexx/incident_ner_v1
Base model
dslim/bert-base-NER