YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Punctuator for Uncased English
The model is fine-tuned based on DistilBertForTokenClassification
for adding punctuations to plain text (uncased English)
Usage
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_en")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_en")
Model Overview
Training data
Combination of following three dataset:
- BBC news: From BBC news website corresponding to stories in five topical areas from 2004-2005. Reference
- News articles: 20000 samples of short news articles scraped from Hindu, Indian times and Guardian between Feb 2017 and Aug 2017 Reference
- Ted talks: transcripts of over 4,000 TED talks between 2004 and 2019 Reference
Model Performance
Validation with 500 samples of dataset scraped from https://www.thenews.com.pk website. Reference
Metrics Report:
precision recall f1-score support COMMA 0.66 0.55 0.60 7064 EXLAMATIONMARK 1.00 0.00 0.00 5 PERIOD 0.73 0.63 0.68 6573 QUESTIONMARK 0.54 0.41 0.47 17 micro avg 0.69 0.59 0.64 13659 macro avg 0.73 0.40 0.44 13659 weighted avg 0.69 0.59 0.64 13659 Validation with 86 news ted talks of 2020 which are not included in training dataset Reference
Metrics Report:
precision recall f1-score support COMMA 0.71 0.56 0.63 10712 EXLAMATIONMARK 0.45 0.07 0.12 75 PERIOD 0.75 0.65 0.70 7921 QUESTIONMARK 0.73 0.67 0.70 827 micro avg 0.73 0.60 0.66 19535 macro avg 0.66 0.49 0.53 19535 weighted avg 0.73 0.60 0.66 19535
- Downloads last month
- 4,518
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.