flan-t5-large ยท UCDP Conflict Event Extraction

A fine-tuned Flan-T5-large model for structured conflict event extraction from news articles. Given a news article and its publication date, the model extracts 14 structured fields describing the conflict event โ€” actors, dates, location, and casualty counts โ€” in a single forward pass.

Model description

The model is fine-tuned on the UCDP Armed Conflict Dataset corpus (synced on 2026-01-30). It learns to map a raw news article to a flat key-value string covering all fields of a UCDP conflict event record.

Base model: google/flan-t5-large (780M parameters)
Task: Sequence-to-sequence structured extraction
Input: Publication date + news article text
Output: 14 structured conflict event fields

License

As the original Flan-T5 model, this model is shared under Apache 2.0 license.

Data

Before fine-tuning, we apply several filters to the UCDP data and keep:

  • Documents that represent single, one-day events
  • Documents that are longer than 100 characters
  • Documents that are shorter than 512 tokens. Documents that are longer than that are truncated by the tokenizer.

An open, though much smaller version of the training dataset (UCDP-AEC) is available here, also see the accompanying paper.

For more information on UCDP dataset creation, see UCDP's methodology section.

Splits

Split Size Publication Date Range
Train 109,947 1949-08-28 โ†’ 2023-12-31
Valid 10,445 2024-01-02 โ†’ 2024-12-31
Test 10,283 2025-01-02 โ†’ 2026-01-29

Extracted fields

Field Description
side_a_name Name of the first actor
side_b_name Name of the second actor
start_date Earliest possible date of conflict event (YYYY-MM-DD)
end_date Latest possible date of conflict event (YYYY-MM-DD)
location_root_name Country
location_adm1_name ADM1 region
location_adm2_name ADM2 region
location_where_name Specific location name
deaths_side_a Casualties on side A
deaths_side_b Casualties on side B
deaths_civilian Civilian casualties
deaths_unknown Casualties of unknown affiliation
deaths_low Low estimate of total deaths
deaths_high High estimate of total deaths

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("anastaph/t5-ucdp-conflict-extraction")
model     = AutoModelForSeq2SeqLM.from_pretrained("anastaph/t5-ucdp-conflict-extraction")

article = """
At least 12 people were killed when government forces clashed with rebel fighters
in the Tigray region of northern Ethiopia on Monday, local officials said.
The fighting broke out near the town of Shire and lasted several hours.
"""

input_text = f"2024-03-18 <extra_id_0>\n{article.strip()}"

inputs  = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=150, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Example output:

side A: Ethiopian Armed Forces <extra_id_0> side B: Tigray People's Liberation Front <extra_id_0> start date: 2024-03-18 <extra_id_0> end date: 2024-03-18 <extra_id_0> country: Ethiopia <extra_id_0> ADM1: Tigray <extra_id_0> ADM2: North Western Zone <extra_id_0> where: Shire <extra_id_0> deaths side A: 0 <extra_id_0> deaths side B: 0 <extra_id_0> deaths civilian: 0 <extra_id_0> deaths unknown: 12 <extra_id_0> deaths low: 12 <extra_id_0> deaths high: 12

Parsing the output

The output follows a fixed template separated by <extra_id_0> tokens. A simple parser:

import re

OUTPUT_TEMPLATE_REGEX = re.compile(
    r'side A: (.*?) <extra_id_0> '
    r'side B: (.*?) <extra_id_0> '
    r'start date: (.*?) <extra_id_0> '
    r'end date: (.*?) <extra_id_0> '
    r'country: (.*?) <extra_id_0> '
    r'ADM1: (.*?) <extra_id_0> '
    r'ADM2: (.*?) <extra_id_0> '
    r'where: (.*?) <extra_id_0> '
    r'deaths side A: (.*?) <extra_id_0> '
    r'deaths side B: (.*?) <extra_id_0> '
    r'deaths civilian: (.*?) <extra_id_0> '
    r'deaths unknown: (.*?) <extra_id_0> '
    r'deaths low: (.*?) <extra_id_0> '
    r'deaths high: (.*?)'
)

FIELDS = [
    'side_a_name', 'side_b_name', 'start_date', 'end_date',
    'location_root_name', 'location_adm1_name', 'location_adm2_name',
    'location_where_name', 'deaths_side_a', 'deaths_side_b',
    'deaths_civilian', 'deaths_unknown', 'deaths_low', 'deaths_high',
]

decoded = tokenizer.decode(outputs[0], skip_special_tokens=False).strip()
match   = OUTPUT_TEMPLATE_REGEX.fullmatch(decoded)
if match:
    result = dict(zip(FIELDS, (v.strip() for v in match.groups())))

Training details

Base model google/flan-t5-large
Training dataset UCDP GED, sync from 2026-01-30
Training set size 110,000 articles (after filtering)
Test set size 10,283 articles
Batch size 16
Learning rate 3e-4 with linear warmup (10%)
Epochs 6 (early stopping on validation accuracy)
Max input length 512 tokens
Max output length 150 tokens
Hardware 1ร— NVIDIA GH200 120GB
Model selection Best validation mean string accuracy

The learning rate of 3e-4 was selected via a parallel hyperparameter search comparing 1e-5, 3e-5, and 3e-4 on a 20k-sample subset. Training used epoch-level early stopping with patience of 3 consecutive epochs without improvement in mean string accuracy.

Evaluation results

Evaluated on 10,283 held-out test articles.

Grouped accuracy

Group Accuracy
Actor 90.2%
Date 69.6%
Location 72.8%
Deaths 87.8%
Overall 80.1%

Per-field accuracy

Field Accuracy
side_a_name 91.3%
side_b_name 89.0%
start_date 70.1%
end_date 69.1%
location_root_name 96.2%
location_adm1_name 80.5%
location_adm2_name 65.5%
location_where_name 48.9%
deaths_side_a 95.0%
deaths_side_b 94.4%
deaths_civilian 89.2%
deaths_unknown 87.0%
deaths_low 75.6%
deaths_high 85.4%

RMSE

Field RMSE
start_date 127.12 days
end_date 119.37 days
deaths_side_a 1.25
deaths_side_b 5.49
deaths_civilian 2.17
deaths_unknown 2.29
deaths_low 6.05
deaths_high 19.09

Unparsable predictions (output did not match the expected template): 1 / 10,283 for both date fields โ€” a template adherence rate of >99.99%.

Limitations

  • The model is trained on conflict events from UCDP-GED and may not generalise well to conflict types or regions underrepresented in this dataset.
  • The model is trained on documents with 1-to-1 document to conflict event mapping. The model therefore will not perform well on documents where no conflict event or multiple conflict events are present.
Downloads last month
48
Safetensors
Model size
0.8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anastaph/t5-ucdp-conflict-extraction

Finetuned
(210)
this model

Space using anastaph/t5-ucdp-conflict-extraction 1