DiSCo Situation-Entity Classifier (18-way)

A roberta-base model pre-trained on the SitEnt corpus (Friedrich, 2016) and then fine-tuned on the DiSCo corpus (Hemmatian, 2022; fortchoming) to label an English Elementary Discourse Unit (roughly a clause) with one of 18 Situation Entity (SE) types. The labels are an extension of Smith's (2003) discourse-mode framework with Grisot (2018) boundedness (see Hemmatian, 2022, for details). Each SE label decomposes into three content-agnostic attributes used in downstream analyses:

  • Genericity of the main referent: specific / generic
  • Eventivity of the main verb constellation: dynamic / stative
  • Boundedness/Habituality of the eventuality: static / episodic / habitual

This is the SE-classification component of the DiSCo pipeline (the precursor companion model is BabakScrapes/disco-clause-segmenter, which splits raw text into clauses first).

See the demos.

Label set

The model outputs 18 classes. The mapping to (genericity, eventivity, boundedness) is:

id SE label genericity eventivity boundedness
0 BOUNDED EVENT (SPECIFIC) specific dynamic episodic
1 BOUNDED EVENT (GENERIC) generic dynamic episodic
2 UNBOUNDED EVENT (SPECIFIC) specific dynamic static
3 UNBOUNDED EVENT (GENERIC) generic dynamic static
4 BASIC STATE specific stative static
5 COERCED STATE (SPECIFIC) specific dynamic static
6 COERCED STATE (GENERIC) generic dynamic static
7 PERFECT COERCED STATE (SPECIFIC) specific dynamic episodic
8 PERFECT COERCED STATE (GENERIC) generic dynamic episodic
9 GENERIC SENTENCE (DYNAMIC) generic dynamic habitual
10 GENERIC SENTENCE (STATIC) generic stative static
11 GENERIC SENTENCE (HABITUAL) generic stative habitual
12 GENERALIZING SENTENCE (DYNAMIC) specific dynamic habitual
13 GENERALIZING SENTENCE (STATIVE) specific stative habitual
14 QUESTION NA NA NA
15 IMPERATIVE NA NA NA
16 NONSENSE NA NA NA
17 OTHER NA NA NA

Classes 14–17 carry no attribute decomposition per the definitions of the linguistic attributes and are excluded from attribute-share calculations in the DiSCo analyses.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
model = AutoModelForSequenceClassification.from_pretrained("BabakScrapes/disco-se-classifier").eval()

clause = "my friend smoked marijuana daily"
enc = tok(clause, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
    pred_id = model(**enc).logits.argmax(-1).item()
print(model.config.id2label[pred_id])   

Training data

Pre-trained on the SitEnt corpus, then fine-tuned on the DiSCo corpus clause annotations: opinionated, mixed-register English text (news from four outlets across the political spectrum, Reddit, and AI-generated text) primarily on the topic of marijuana legalization. This domain is deliberately harder and more varied than the Wikipedia-dominated SitEnt corpus on which prior SE classifiers were trained.

Performance

Training-time validation performance (seed 42, random 10% validation):

Task Accuracy Macro-F1 Micro-F1
18-way SE label .737 .514 .689
Genericity (3-class) .860 .852 .841
Eventivity (3-class) .894 .879 .873
Boundedness/habituality (4-class) .850 .804 .860

The 18-way macro-F1 is depressed by severe class imbalance (the most frequent labels are ~10× more common than the rarest); the per-attribute metrics better reflect what downstream analyses consume.

Note on evaluation. These are training-time validation numbers. The released weights and corpus let you re-train and re-evaluate under any split you prefer.

Limitations

  • English only
  • pre-trained on largely encyclopedic texts then tuned for opinionated, mixed-register text on one controversial policy topic. Performance on other genres may differ. However, as the features are formal and content-agnostic by design, good cross-genre generalization is expected.

Citation

Hemmatian, B. (2022). Taking the High Road: A Big Data Investigation of Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization. Brown University. And the DiSCo corpus paper (forthcoming, Behavior Research Methods). See the DiSCo demos.

Downloads last month
26
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BabakScrapes/disco-se-classifier

Finetuned
(2344)
this model