DiSCo Clause Segmenter

A roberta-base token-classification model that splits English text into Elementary Discourse Unites, roughly clauses. Each token is tagged with a 3-way label and clause spans are recovered by matching the pattern B I* E:

id label meaning
0 B clause beginning
1 I inside clause
2 E clause end

This is the segmentation component of the DiSCo pipeline; feed its clause output to the companion BabakScrapes/disco-se-classifier for Situation Entity typing.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tok = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("BabakScrapes/disco-clause-segmenter").eval()

words = "There was bad weather at the airport and so our flight got delayed".split()
enc = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=512)
with torch.inference_mode():
    token_preds = model(**enc).logits.argmax(-1)[0].tolist()
# align token_preds back to words (majority vote per word), then cut clauses on B...E

See pipeline.py in the DiSCo release for the full word-alignment + B I* E clause-recovery logic.

Training data

Trained on the public SitEnt corpus (Friedrich et al. 2016/2017), converted to B/I/E token labels, with a deterministic split (seed 42, 90/10 train/validation). The construction recipe (construct_clause_corpus.ipynb) is distributed with the DiSCo code/corpus release, not in this model repo.

Performance

In-domain SitEnt held-out evaluation (seed-42 10% partition; 16,515 tokens, 1,848 gold clause spans):

Metric Value
Token macro-F1 .753 (B .703 / I .917 / E .639)
Token accuracy (= micro-F1) .861
Gold clauses with ≥50% predicted-overlap coverage 95.5%
Predicted vs gold clause counts 2,149 vs 1,848 (slight over-segmentation)

The segmenter recovers clause regions reliably; the span-overlap numbers are the operationally relevant summaries because the downstream pipeline consumes whole clauses rather than exact boundary tokens. Macro-F1 is depressed by the sparse B/E classes.

Limitations

  • English only.
  • Tends to slightly over-segment, which is benign for downstream SE classification.

Citation

Hemmatian, B. (2022). Taking the High Road: A Big Data Investigation of Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization. Brown University. And the DiSCo corpus paper (forthcoming, Behavior Research Methods). SitEnt: Friedrich, Palmer & Pinkal (2016).

Downloads last month
50
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BabakScrapes/disco-clause-segmenter

Finetuned
(2344)
this model