flan-t5-large ยท UCDP Conflict Event Extraction
A fine-tuned Flan-T5-large model for structured conflict event extraction from news articles. Given a news article and its publication date, the model extracts 14 structured fields describing the conflict event โ actors, dates, location, and casualty counts โ in a single forward pass.
Model description
The model is fine-tuned on the UCDP Armed Conflict Dataset corpus (synced on 2026-01-30). It learns to map a raw news article to a flat key-value string covering all fields of a UCDP conflict event record.
Base model: google/flan-t5-large (780M parameters)
Task: Sequence-to-sequence structured extraction
Input: Publication date + news article text
Output: 14 structured conflict event fields
License
As the original Flan-T5 model, this model is shared under Apache 2.0 license.
Data
Before fine-tuning, we apply several filters to the UCDP data and keep:
- Documents that represent single, one-day events
- Documents that are longer than 100 characters
- Documents that are shorter than 512 tokens. Documents that are longer than that are truncated by the tokenizer.
An open, though much smaller version of the training dataset (UCDP-AEC) is available here, also see the accompanying paper.
For more information on UCDP dataset creation, see UCDP's methodology section.
Splits
| Split | Size | Publication Date Range |
|---|---|---|
| Train | 109,947 | 1949-08-28 โ 2023-12-31 |
| Valid | 10,445 | 2024-01-02 โ 2024-12-31 |
| Test | 10,283 | 2025-01-02 โ 2026-01-29 |
Extracted fields
| Field | Description |
|---|---|
side_a_name |
Name of the first actor |
side_b_name |
Name of the second actor |
start_date |
Earliest possible date of conflict event (YYYY-MM-DD) |
end_date |
Latest possible date of conflict event (YYYY-MM-DD) |
location_root_name |
Country |
location_adm1_name |
ADM1 region |
location_adm2_name |
ADM2 region |
location_where_name |
Specific location name |
deaths_side_a |
Casualties on side A |
deaths_side_b |
Casualties on side B |
deaths_civilian |
Civilian casualties |
deaths_unknown |
Casualties of unknown affiliation |
deaths_low |
Low estimate of total deaths |
deaths_high |
High estimate of total deaths |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("anastaph/t5-ucdp-conflict-extraction")
model = AutoModelForSeq2SeqLM.from_pretrained("anastaph/t5-ucdp-conflict-extraction")
article = """
At least 12 people were killed when government forces clashed with rebel fighters
in the Tigray region of northern Ethiopia on Monday, local officials said.
The fighting broke out near the town of Shire and lasted several hours.
"""
input_text = f"2024-03-18 <extra_id_0>\n{article.strip()}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=150, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Example output:
side A: Ethiopian Armed Forces <extra_id_0> side B: Tigray People's Liberation Front <extra_id_0> start date: 2024-03-18 <extra_id_0> end date: 2024-03-18 <extra_id_0> country: Ethiopia <extra_id_0> ADM1: Tigray <extra_id_0> ADM2: North Western Zone <extra_id_0> where: Shire <extra_id_0> deaths side A: 0 <extra_id_0> deaths side B: 0 <extra_id_0> deaths civilian: 0 <extra_id_0> deaths unknown: 12 <extra_id_0> deaths low: 12 <extra_id_0> deaths high: 12
Parsing the output
The output follows a fixed template separated by <extra_id_0> tokens. A simple parser:
import re
OUTPUT_TEMPLATE_REGEX = re.compile(
r'side A: (.*?) <extra_id_0> '
r'side B: (.*?) <extra_id_0> '
r'start date: (.*?) <extra_id_0> '
r'end date: (.*?) <extra_id_0> '
r'country: (.*?) <extra_id_0> '
r'ADM1: (.*?) <extra_id_0> '
r'ADM2: (.*?) <extra_id_0> '
r'where: (.*?) <extra_id_0> '
r'deaths side A: (.*?) <extra_id_0> '
r'deaths side B: (.*?) <extra_id_0> '
r'deaths civilian: (.*?) <extra_id_0> '
r'deaths unknown: (.*?) <extra_id_0> '
r'deaths low: (.*?) <extra_id_0> '
r'deaths high: (.*?)'
)
FIELDS = [
'side_a_name', 'side_b_name', 'start_date', 'end_date',
'location_root_name', 'location_adm1_name', 'location_adm2_name',
'location_where_name', 'deaths_side_a', 'deaths_side_b',
'deaths_civilian', 'deaths_unknown', 'deaths_low', 'deaths_high',
]
decoded = tokenizer.decode(outputs[0], skip_special_tokens=False).strip()
match = OUTPUT_TEMPLATE_REGEX.fullmatch(decoded)
if match:
result = dict(zip(FIELDS, (v.strip() for v in match.groups())))
Training details
| Base model | google/flan-t5-large |
| Training dataset | UCDP GED, sync from 2026-01-30 |
| Training set size | 110,000 articles (after filtering) |
| Test set size | 10,283 articles |
| Batch size | 16 |
| Learning rate | 3e-4 with linear warmup (10%) |
| Epochs | 6 (early stopping on validation accuracy) |
| Max input length | 512 tokens |
| Max output length | 150 tokens |
| Hardware | 1ร NVIDIA GH200 120GB |
| Model selection | Best validation mean string accuracy |
The learning rate of 3e-4 was selected via a parallel hyperparameter search comparing 1e-5, 3e-5, and 3e-4 on a 20k-sample subset. Training used epoch-level early stopping with patience of 3 consecutive epochs without improvement in mean string accuracy.
Evaluation results
Evaluated on 10,283 held-out test articles.
Grouped accuracy
| Group | Accuracy |
|---|---|
| Actor | 90.2% |
| Date | 69.6% |
| Location | 72.8% |
| Deaths | 87.8% |
| Overall | 80.1% |
Per-field accuracy
| Field | Accuracy |
|---|---|
| side_a_name | 91.3% |
| side_b_name | 89.0% |
| start_date | 70.1% |
| end_date | 69.1% |
| location_root_name | 96.2% |
| location_adm1_name | 80.5% |
| location_adm2_name | 65.5% |
| location_where_name | 48.9% |
| deaths_side_a | 95.0% |
| deaths_side_b | 94.4% |
| deaths_civilian | 89.2% |
| deaths_unknown | 87.0% |
| deaths_low | 75.6% |
| deaths_high | 85.4% |
RMSE
| Field | RMSE |
|---|---|
| start_date | 127.12 days |
| end_date | 119.37 days |
| deaths_side_a | 1.25 |
| deaths_side_b | 5.49 |
| deaths_civilian | 2.17 |
| deaths_unknown | 2.29 |
| deaths_low | 6.05 |
| deaths_high | 19.09 |
Unparsable predictions (output did not match the expected template): 1 / 10,283 for both date fields โ a template adherence rate of >99.99%.
Limitations
- The model is trained on conflict events from UCDP-GED and may not generalise well to conflict types or regions underrepresented in this dataset.
- The model is trained on documents with 1-to-1 document to conflict event mapping. The model therefore will not perform well on documents where no conflict event or multiple conflict events are present.
- Downloads last month
- 48
Model tree for anastaph/t5-ucdp-conflict-extraction
Base model
google/flan-t5-large