Instructions to use BabakScrapes/disco-se-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BabakScrapes/disco-se-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="BabakScrapes/disco-se-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("BabakScrapes/disco-se-classifier") model = AutoModelForSequenceClassification.from_pretrained("BabakScrapes/disco-se-classifier") - Notebooks
- Google Colab
- Kaggle
DiSCo Situation-Entity Classifier (18-way)
A roberta-base model pre-trained on the SitEnt corpus (Friedrich, 2016) and then fine-tuned on the DiSCo corpus (Hemmatian, 2022; fortchoming) to label an English Elementary Discourse Unit (roughly a clause) with one of 18 Situation Entity (SE) types. The labels are an extension of Smith's (2003) discourse-mode framework with Grisot (2018) boundedness (see Hemmatian, 2022, for details). Each SE label decomposes into three content-agnostic attributes used in downstream analyses:
- Genericity of the main referent:
specific/generic - Eventivity of the main verb constellation:
dynamic/stative - Boundedness/Habituality of the eventuality:
static/episodic/habitual
This is the SE-classification component of the DiSCo pipeline (the precursor companion model is BabakScrapes/disco-clause-segmenter, which splits raw text into clauses first).
See the demos.
Label set
The model outputs 18 classes. The mapping to (genericity, eventivity, boundedness) is:
| id | SE label | genericity | eventivity | boundedness |
|---|---|---|---|---|
| 0 | BOUNDED EVENT (SPECIFIC) | specific | dynamic | episodic |
| 1 | BOUNDED EVENT (GENERIC) | generic | dynamic | episodic |
| 2 | UNBOUNDED EVENT (SPECIFIC) | specific | dynamic | static |
| 3 | UNBOUNDED EVENT (GENERIC) | generic | dynamic | static |
| 4 | BASIC STATE | specific | stative | static |
| 5 | COERCED STATE (SPECIFIC) | specific | dynamic | static |
| 6 | COERCED STATE (GENERIC) | generic | dynamic | static |
| 7 | PERFECT COERCED STATE (SPECIFIC) | specific | dynamic | episodic |
| 8 | PERFECT COERCED STATE (GENERIC) | generic | dynamic | episodic |
| 9 | GENERIC SENTENCE (DYNAMIC) | generic | dynamic | habitual |
| 10 | GENERIC SENTENCE (STATIC) | generic | stative | static |
| 11 | GENERIC SENTENCE (HABITUAL) | generic | stative | habitual |
| 12 | GENERALIZING SENTENCE (DYNAMIC) | specific | dynamic | habitual |
| 13 | GENERALIZING SENTENCE (STATIVE) | specific | stative | habitual |
| 14 | QUESTION | NA | NA | NA |
| 15 | IMPERATIVE | NA | NA | NA |
| 16 | NONSENSE | NA | NA | NA |
| 17 | OTHER | NA | NA | NA |
Classes 14–17 carry no attribute decomposition per the definitions of the linguistic attributes and are excluded from attribute-share calculations in the DiSCo analyses.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tok = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
model = AutoModelForSequenceClassification.from_pretrained("BabakScrapes/disco-se-classifier").eval()
clause = "my friend smoked marijuana daily"
enc = tok(clause, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
pred_id = model(**enc).logits.argmax(-1).item()
print(model.config.id2label[pred_id])
Training data
Pre-trained on the SitEnt corpus, then fine-tuned on the DiSCo corpus clause annotations: opinionated, mixed-register English text (news from four outlets across the political spectrum, Reddit, and AI-generated text) primarily on the topic of marijuana legalization. This domain is deliberately harder and more varied than the Wikipedia-dominated SitEnt corpus on which prior SE classifiers were trained.
Performance
Training-time validation performance (seed 42, random 10% validation):
| Task | Accuracy | Macro-F1 | Micro-F1 |
|---|---|---|---|
| 18-way SE label | .737 | .514 | .689 |
| Genericity (3-class) | .860 | .852 | .841 |
| Eventivity (3-class) | .894 | .879 | .873 |
| Boundedness/habituality (4-class) | .850 | .804 | .860 |
The 18-way macro-F1 is depressed by severe class imbalance (the most frequent labels are ~10× more common than the rarest); the per-attribute metrics better reflect what downstream analyses consume.
Note on evaluation. These are training-time validation numbers. The released weights and corpus let you re-train and re-evaluate under any split you prefer.
Limitations
- English only
- pre-trained on largely encyclopedic texts then tuned for opinionated, mixed-register text on one controversial policy topic. Performance on other genres may differ. However, as the features are formal and content-agnostic by design, good cross-genre generalization is expected.
Citation
Hemmatian, B. (2022). Taking the High Road: A Big Data Investigation of Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization. Brown University. And the DiSCo corpus paper (forthcoming, Behavior Research Methods). See the DiSCo demos.
- Downloads last month
- 26
Model tree for BabakScrapes/disco-se-classifier
Base model
FacebookAI/roberta-base