language:
- en
tags:
- sentence-similarity
- text-classification
datasets:
- dennlinger/wiki-paragraphs
metrics:
- f1
license: mit
BERT-Wiki-Paragraphs
Authors: Satya Almasian*, Dennis Aumiller*, Lucienne-Sophie Marmé, Michael Gertz
Contact us at <lastname>@informatik.uni-heidelberg.de
Details for the training method can be found in our work Structural Text Segmentation of Legal Documents.
The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model.
Find the associated training data here: wiki-paragraphs
Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not.
We utilize automatically generated samples from Wikipedia for training, where paragraphs from within the same section are assumed to be topically coherent.
We use the same articles as (Koshorek et al., 2018),
albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level.
Usage
Preferred usage is through transformers.pipeline
:
from transformers import pipeline
pipe = pipeline("text-classification", model="dennlinger/bert-wiki-paragraphs")
pipe("{First paragraph} [SEP] {Second paragraph}")
A predicted "1" means that paragraphs belong to the same topic, a "0" indicates a disconnect.
Training Setup
The model was trained for 3 epochs from bert-base-uncased
on paragraph pairs (limited to 512 subwork with the longest_first
truncation strategy).
We use a batch size of 24 wit 2 iterations gradient accumulation (effective batch size of 48), and a learning rate of 1e-4, with gradient clipping at 5.
Training was performed on a single Titan RTX GPU over the duration of 3 weeks.