Instructions to use BabakScrapes/disco-clause-segmenter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BabakScrapes/disco-clause-segmenter with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="BabakScrapes/disco-clause-segmenter")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("BabakScrapes/disco-clause-segmenter") model = AutoModelForTokenClassification.from_pretrained("BabakScrapes/disco-clause-segmenter") - Notebooks
- Google Colab
- Kaggle
DiSCo Clause Segmenter
A roberta-base token-classification model that splits English text into Elementary Discourse Unites, roughly clauses. Each token is tagged with a 3-way label and clause spans are recovered by matching the pattern B I* E:
| id | label | meaning |
|---|---|---|
| 0 | B | clause beginning |
| 1 | I | inside clause |
| 2 | E | clause end |
This is the segmentation component of the DiSCo pipeline; feed its clause output to the companion BabakScrapes/disco-se-classifier for Situation Entity typing.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tok = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("BabakScrapes/disco-clause-segmenter").eval()
words = "There was bad weather at the airport and so our flight got delayed".split()
enc = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=512)
with torch.inference_mode():
token_preds = model(**enc).logits.argmax(-1)[0].tolist()
# align token_preds back to words (majority vote per word), then cut clauses on B...E
See pipeline.py in the DiSCo release for the full word-alignment + B I* E clause-recovery logic.
Training data
Trained on the public SitEnt corpus (Friedrich et al. 2016/2017), converted to B/I/E token labels, with a deterministic split (seed 42, 90/10 train/validation). The construction recipe (construct_clause_corpus.ipynb) is distributed with the DiSCo code/corpus release, not in this model repo.
Performance
In-domain SitEnt held-out evaluation (seed-42 10% partition; 16,515 tokens, 1,848 gold clause spans):
| Metric | Value |
|---|---|
| Token macro-F1 | .753 (B .703 / I .917 / E .639) |
| Token accuracy (= micro-F1) | .861 |
| Gold clauses with ≥50% predicted-overlap coverage | 95.5% |
| Predicted vs gold clause counts | 2,149 vs 1,848 (slight over-segmentation) |
The segmenter recovers clause regions reliably; the span-overlap numbers are the operationally relevant summaries because the downstream pipeline consumes whole clauses rather than exact boundary tokens. Macro-F1 is depressed by the sparse B/E classes.
Limitations
- English only.
- Tends to slightly over-segment, which is benign for downstream SE classification.
Citation
Hemmatian, B. (2022). Taking the High Road: A Big Data Investigation of Natural Discourse in the Emerging U.S. Consensus about Marijuana Legalization. Brown University. And the DiSCo corpus paper (forthcoming, Behavior Research Methods). SitEnt: Friedrich, Palmer & Pinkal (2016).
- Downloads last month
- 50
Model tree for BabakScrapes/disco-clause-segmenter
Base model
FacebookAI/roberta-base