Punctuator for Uncased English

The model is fine-tuned based on DistilBertForTokenClassification for adding punctuations to plain text (uncased English)

Usage

from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast

model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_en")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_en")

Model Overview

Training data

Combination of following three dataset:

BBC news: From BBC news website corresponding to stories in five topical areas from 2004-2005. Reference
News articles: 20000 samples of short news articles scraped from Hindu, Indian times and Guardian between Feb 2017 and Aug 2017 Reference
Ted talks: transcripts of over 4,000 TED talks between 2004 and 2019 Reference

Model Performance

Validation with 500 samples of dataset scraped from https://www.thenews.com.pk website. Reference

Metrics Report:

	precision	recall	f1-score	support
COMMA	0.66	0.55	0.60	7064
EXLAMATIONMARK	1.00	0.00	0.00	5
PERIOD	0.73	0.63	0.68	6573
QUESTIONMARK	0.54	0.41	0.47	17
micro avg	0.69	0.59	0.64	13659
macro avg	0.73	0.40	0.44	13659
weighted avg	0.69	0.59	0.64	13659

Validation with 86 news ted talks of 2020 which are not included in training dataset Reference

Metrics Report:

	precision	recall	f1-score	support
COMMA	0.71	0.56	0.63	10712
EXLAMATIONMARK	0.45	0.07	0.12	75
PERIOD	0.75	0.65	0.70	7921
QUESTIONMARK	0.73	0.67	0.70	827
micro avg	0.73	0.60	0.66	19535
macro avg	0.66	0.49	0.53	19535
weighted avg	0.73	0.60	0.66	19535