DiSCo Clause Segmenter

A roberta-base token-classification model that splits English text into Elementary Discourse Unites, roughly clauses. Each token is tagged with a 3-way label and clause spans are recovered by matching the pattern B I* E:

id	label	meaning
0	B	clause beginning
1	I	inside clause
2	E	clause end

This is the segmentation component of the DiSCo pipeline; feed its clause output to the companion BabakScrapes/disco-se-classifier for Situation Entity typing.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tok = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("BabakScrapes/disco-clause-segmenter").eval()

words = "There was bad weather at the airport and so our flight got delayed".split()
enc = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=512)
with torch.inference_mode():
    token_preds = model(**enc).logits.argmax(-1)[0].tolist()
# align token_preds back to words (majority vote per word), then cut clauses on B...E

See pipeline.py in the DiSCo release for the full word-alignment + B I* E clause-recovery logic.

Training data

Trained on the public SitEnt corpus (Friedrich et al. 2016/2017), converted to B/I/E token labels, with a deterministic split (seed 42, 90/10 train/validation). The construction recipe (construct_clause_corpus.ipynb) is distributed with the DiSCo code/corpus release, not in this model repo.

Performance

In-domain SitEnt held-out evaluation (seed-42 10% partition; 16,515 tokens, 1,848 gold clause spans):

Metric	Value
Token macro-F1	.753 (B .703 / I .917 / E .639)
Token accuracy (= micro-F1)	.861
Gold clauses with ≥50% predicted-overlap coverage	95.5%
Predicted vs gold clause counts	2,149 vs 1,848 (slight over-segmentation)

The segmenter recovers clause regions reliably; the span-overlap numbers are the operationally relevant summaries because the downstream pipeline consumes whole clauses rather than exact boundary tokens. Macro-F1 is depressed by the sparse B/E classes.

Limitations

English only.
Tends to slightly over-segment, which is benign for downstream SE classification.

Citation

Hemmatian, B. (2022). Taking the High Road: A Big Data Investigation of Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization. Brown University. And the DiSCo corpus paper (forthcoming, Behavior Research Methods). SitEnt: Friedrich, Palmer & Pinkal (2016).

Downloads last month: 50

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Model tree for BabakScrapes/disco-clause-segmenter

Base model

FacebookAI/roberta-base

Finetuned

(2344)

this model