flan-t5-large · UCDP Conflict Event Extraction

A fine-tuned Flan-T5-large model for structured conflict event extraction from news articles. Given a news article and its publication date, the model extracts 14 structured fields describing the conflict event — actors, dates, location, and casualty counts — in a single forward pass.

Model description

The model is fine-tuned on the UCDP Armed Conflict Dataset corpus (synced on 2026-01-30). It learns to map a raw news article to a flat key-value string covering all fields of a UCDP conflict event record.

Base model: google/flan-t5-large (780M parameters)
Task: Sequence-to-sequence structured extraction
Input: Publication date + news article text
Output: 14 structured conflict event fields

License

As the original Flan-T5 model, this model is shared under Apache 2.0 license.

Data

Before fine-tuning, we apply several filters to the UCDP data and keep:

Documents that represent single, one-day events
Documents that are longer than 100 characters
Documents that are shorter than 512 tokens. Documents that are longer than that are truncated by the tokenizer.

An open, though much smaller version of the training dataset (UCDP-AEC) is available here, also see the accompanying paper.

For more information on UCDP dataset creation, see UCDP's methodology section.

Splits

Split	Size	Publication Date Range
Train	109,947	1949-08-28 → 2023-12-31
Valid	10,445	2024-01-02 → 2024-12-31
Test	10,283	2025-01-02 → 2026-01-29

Extracted fields

Field	Description
`side_a_name`	Name of the first actor
`side_b_name`	Name of the second actor
`start_date`	Earliest possible date of conflict event (YYYY-MM-DD)
`end_date`	Latest possible date of conflict event (YYYY-MM-DD)
`location_root_name`	Country
`location_adm1_name`	ADM1 region
`location_adm2_name`	ADM2 region
`location_where_name`	Specific location name
`deaths_side_a`	Casualties on side A
`deaths_side_b`	Casualties on side B
`deaths_civilian`	Civilian casualties
`deaths_unknown`	Casualties of unknown affiliation
`deaths_low`	Low estimate of total deaths
`deaths_high`	High estimate of total deaths

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("anastaph/t5-ucdp-conflict-extraction")
model     = AutoModelForSeq2SeqLM.from_pretrained("anastaph/t5-ucdp-conflict-extraction")

article = """
At least 12 people were killed when government forces clashed with rebel fighters
in the Tigray region of northern Ethiopia on Monday, local officials said.
The fighting broke out near the town of Shire and lasted several hours.
"""

input_text = f"2024-03-18 <extra_id_0>\n{article.strip()}"

inputs  = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=150, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Example output:

side A: Ethiopian Armed Forces <extra_id_0> side B: Tigray People's Liberation Front <extra_id_0> start date: 2024-03-18 <extra_id_0> end date: 2024-03-18 <extra_id_0> country: Ethiopia <extra_id_0> ADM1: Tigray <extra_id_0> ADM2: North Western Zone <extra_id_0> where: Shire <extra_id_0> deaths side A: 0 <extra_id_0> deaths side B: 0 <extra_id_0> deaths civilian: 0 <extra_id_0> deaths unknown: 12 <extra_id_0> deaths low: 12 <extra_id_0> deaths high: 12

Parsing the output

The output follows a fixed template separated by <extra_id_0> tokens. A simple parser:

import re

OUTPUT_TEMPLATE_REGEX = re.compile(
    r'side A: (.*?) <extra_id_0> '
    r'side B: (.*?) <extra_id_0> '
    r'start date: (.*?) <extra_id_0> '
    r'end date: (.*?) <extra_id_0> '
    r'country: (.*?) <extra_id_0> '
    r'ADM1: (.*?) <extra_id_0> '
    r'ADM2: (.*?) <extra_id_0> '
    r'where: (.*?) <extra_id_0> '
    r'deaths side A: (.*?) <extra_id_0> '
    r'deaths side B: (.*?) <extra_id_0> '
    r'deaths civilian: (.*?) <extra_id_0> '
    r'deaths unknown: (.*?) <extra_id_0> '
    r'deaths low: (.*?) <extra_id_0> '
    r'deaths high: (.*?)'
)

FIELDS = [
    'side_a_name', 'side_b_name', 'start_date', 'end_date',
    'location_root_name', 'location_adm1_name', 'location_adm2_name',
    'location_where_name', 'deaths_side_a', 'deaths_side_b',
    'deaths_civilian', 'deaths_unknown', 'deaths_low', 'deaths_high',
]

decoded = tokenizer.decode(outputs[0], skip_special_tokens=False).strip()
match   = OUTPUT_TEMPLATE_REGEX.fullmatch(decoded)
if match:
    result = dict(zip(FIELDS, (v.strip() for v in match.groups())))

Training details


Base model	google/flan-t5-large
Training dataset	UCDP GED, sync from 2026-01-30
Training set size	110,000 articles (after filtering)
Test set size	10,283 articles
Batch size	16
Learning rate	3e-4 with linear warmup (10%)
Epochs	6 (early stopping on validation accuracy)
Max input length	512 tokens
Max output length	150 tokens
Hardware	1× NVIDIA GH200 120GB
Model selection	Best validation mean string accuracy

The learning rate of 3e-4 was selected via a parallel hyperparameter search comparing 1e-5, 3e-5, and 3e-4 on a 20k-sample subset. Training used epoch-level early stopping with patience of 3 consecutive epochs without improvement in mean string accuracy.

Evaluation results

Evaluated on 10,283 held-out test articles.

Grouped accuracy

Group	Accuracy
Actor	90.2%
Date	69.6%
Location	72.8%
Deaths	87.8%
Overall	80.1%

Per-field accuracy

Field	Accuracy
side_a_name	91.3%
side_b_name	89.0%
start_date	70.1%
end_date	69.1%
location_root_name	96.2%
location_adm1_name	80.5%
location_adm2_name	65.5%
location_where_name	48.9%
deaths_side_a	95.0%
deaths_side_b	94.4%
deaths_civilian	89.2%
deaths_unknown	87.0%
deaths_low	75.6%
deaths_high	85.4%

RMSE

Field	RMSE
start_date	127.12 days
end_date	119.37 days
deaths_side_a	1.25
deaths_side_b	5.49
deaths_civilian	2.17
deaths_unknown	2.29
deaths_low	6.05
deaths_high	19.09

Unparsable predictions (output did not match the expected template): 1 / 10,283 for both date fields — a template adherence rate of >99.99%.

Limitations

The model is trained on conflict events from UCDP-GED and may not generalise well to conflict types or regions underrepresented in this dataset.
The model is trained on documents with 1-to-1 document to conflict event mapping. The model therefore will not perform well on documents where no conflict event or multiple conflict events are present.

Downloads last month: 48

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for anastaph/t5-ucdp-conflict-extraction

Base model

google/flan-t5-large

Finetuned

(210)

this model

anastaph
/

t5-ucdp-conflict-extraction