DiSCo Situation-Entity Classifier (18-way)

A roberta-base model pre-trained on the SitEnt corpus (Friedrich, 2016) and then fine-tuned on the DiSCo corpus (Hemmatian, 2022; fortchoming) to label an English Elementary Discourse Unit (roughly a clause) with one of 18 Situation Entity (SE) types. The labels are an extension of Smith's (2003) discourse-mode framework with Grisot (2018) boundedness (see Hemmatian, 2022, for details). Each SE label decomposes into three content-agnostic attributes used in downstream analyses:

Genericity of the main referent: specific / generic
Eventivity of the main verb constellation: dynamic / stative
Boundedness/Habituality of the eventuality: static / episodic / habitual

This is the SE-classification component of the DiSCo pipeline (the precursor companion model is BabakScrapes/disco-clause-segmenter, which splits raw text into clauses first).

See the demos.

Label set

The model outputs 18 classes. The mapping to (genericity, eventivity, boundedness) is:

id	SE label	genericity	eventivity	boundedness
0	BOUNDED EVENT (SPECIFIC)	specific	dynamic	episodic
1	BOUNDED EVENT (GENERIC)	generic	dynamic	episodic
2	UNBOUNDED EVENT (SPECIFIC)	specific	dynamic	static
3	UNBOUNDED EVENT (GENERIC)	generic	dynamic	static
4	BASIC STATE	specific	stative	static
5	COERCED STATE (SPECIFIC)	specific	dynamic	static
6	COERCED STATE (GENERIC)	generic	dynamic	static
7	PERFECT COERCED STATE (SPECIFIC)	specific	dynamic	episodic
8	PERFECT COERCED STATE (GENERIC)	generic	dynamic	episodic
9	GENERIC SENTENCE (DYNAMIC)	generic	dynamic	habitual
10	GENERIC SENTENCE (STATIC)	generic	stative	static
11	GENERIC SENTENCE (HABITUAL)	generic	stative	habitual
12	GENERALIZING SENTENCE (DYNAMIC)	specific	dynamic	habitual
13	GENERALIZING SENTENCE (STATIVE)	specific	stative	habitual
14	QUESTION	NA	NA	NA
15	IMPERATIVE	NA	NA	NA
16	NONSENSE	NA	NA	NA
17	OTHER	NA	NA	NA

Classes 14–17 carry no attribute decomposition per the definitions of the linguistic attributes and are excluded from attribute-share calculations in the DiSCo analyses.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
model = AutoModelForSequenceClassification.from_pretrained("BabakScrapes/disco-se-classifier").eval()

clause = "my friend smoked marijuana daily"
enc = tok(clause, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
    pred_id = model(**enc).logits.argmax(-1).item()
print(model.config.id2label[pred_id])

Training data

Pre-trained on the SitEnt corpus, then fine-tuned on the DiSCo corpus clause annotations: opinionated, mixed-register English text (news from four outlets across the political spectrum, Reddit, and AI-generated text) primarily on the topic of marijuana legalization. This domain is deliberately harder and more varied than the Wikipedia-dominated SitEnt corpus on which prior SE classifiers were trained.

Performance

Training-time validation performance (seed 42, random 10% validation):

Task	Accuracy	Macro-F1	Micro-F1
18-way SE label	.737	.514	.689
Genericity (3-class)	.860	.852	.841
Eventivity (3-class)	.894	.879	.873
Boundedness/habituality (4-class)	.850	.804	.860

The 18-way macro-F1 is depressed by severe class imbalance (the most frequent labels are ~10× more common than the rarest); the per-attribute metrics better reflect what downstream analyses consume.

Note on evaluation. These are training-time validation numbers. The released weights and corpus let you re-train and re-evaluate under any split you prefer.

Limitations

English only
pre-trained on largely encyclopedic texts then tuned for opinionated, mixed-register text on one controversial policy topic. Performance on other genres may differ. However, as the features are formal and content-agnostic by design, good cross-genre generalization is expected.

Citation

Hemmatian, B. (2022). Taking the High Road: A Big Data Investigation of Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization. Brown University. And the DiSCo corpus paper (forthcoming, Behavior Research Methods). See the DiSCo demos.

Downloads last month: 26

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Model tree for BabakScrapes/disco-se-classifier

Base model

FacebookAI/roberta-base

Finetuned

(2344)

this model