YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

๐Ÿงพ Model Card โ€” T5-Small Sentence Validator

This is the model card of a ๐Ÿค— Transformers model fine-tuned for text normalization โ€” restoring punctuation, capitalization, and proper spacing from raw transcripts or unformatted text.


๐Ÿ‘จโ€๐Ÿ’ป Developed by

Pradhap Rajamani

Funded by : Independent academic project โ€” Purdue University (Masterโ€™s in Computer Science)
Shared by : Pradhap Rajamani


๐Ÿง  Model Details

  • Model type: Sequence-to-sequence text generation model (T5-small architecture)
  • Language(s): English ๐Ÿ‡บ๐Ÿ‡ธ
  • License: Apache 2.0
  • Finetuned from model: t5-small

๐Ÿ”— Model Sources


๐ŸŽฏ Uses

Direct Use

Automatically adds missing punctuation, capitalization, and spacing to unformatted English text.

Example:

Input: normalize: helloeveryonewelcome to todayssession

Output: Hello everyone, welcome to today's session.

Downstream Use

Text cleanup for ASR and transcripts

Preprocessing for NLP tasks (NER, summarization, etc.)

Chatbot or conversation logs normalization

๐Ÿšซ Out-of-Scope Use

Grammar correction beyond punctuation

Rewriting semantics or paraphrasing

โš ๏ธ Limitations and Biases

Limitation Description

Limitation Description
Ambiguous punctuation May insert or omit commas/periods differently from human editors
Case sensitivity Rarely fails to capitalize proper nouns
Domain limitation Trained primarily on general English text (Wikipedia-style)

Recommendation:

Always perform a quick review before deploying results in production systems.

๐Ÿš€ How to Use


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "pradhap1125/t5-small-sentence-validator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "helloeveryonewelcome to todayssession"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿงฉ Training Details

๐Ÿ—‚ Dataset

A custom English dataset derived from Wikipedia sentences. Text was noisified by removing punctuation, capitalization, and spacing to simulate ASR output.

Field Description
Source English Wikipedia corpus
Preprocessing Removed punctuation, altered casing, and spacing
Training Samples ~90,000
Validation Samples ~2,5000

๐Ÿงฎ Hyperparameters

Parameter Value
Model T5-small
Epochs 3
Batch Size 16
Learning Rate 3e-4
Weight Decay 0.01
Optimizer AdamW
Scheduler Linear decay
Loss Function CrossEntropyLoss
Evaluation Metric Token-level accuracy, loss

Training Environment

Setting Specification
Hardware NVIDIA Tesla T4 GPU
Platform Google Colab
Framework PyTorch
Library Hugging Face Transformers
Runtime ~4 hours

Evaluation

Metrics

Metric Value
eval_loss 0.0744

Qualitative Example

Input Output
helloeveryonewelcome to todayssession Hello everyone, welcome to today's session.
this isatest ofthe normalizationmodel This is a test of the normalization model.
itwasagoodday today It was a good day today.

๐Ÿง‘โ€๐Ÿ’ป Model Card Authors

Pradhap Rajamani

Contact

๐Ÿ“ง https://www.linkedin.com/in/pradhap-rajamani/ ๐Ÿ“ง https://github.com/pradhap1125

Repo

https://github.com/pradhap1125/t5-small-sentence-validator

Downloads last month
-
Safetensors
Model size
60.5M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for Pradhap1125/t5-small-sentence-validator