🧾 Model Card — T5-Small Sentence Validator

This is the model card of a 🤗 Transformers model fine-tuned for text normalization — restoring punctuation, capitalization, and proper spacing from raw transcripts or unformatted text.

👨‍💻 Developed by

Pradhap Rajamani

Funded by : Independent academic project — Purdue University (Master’s in Computer Science)
Shared by : Pradhap Rajamani

🧠 Model Details

Model type: Sequence-to-sequence text generation model (T5-small architecture)
Language(s): English 🇺🇸
License: Apache 2.0
Finetuned from model: t5-small

🔗 Model Sources

Repository: Hugging Face Hub – pradhap1125/t5-small-sentence-validator
Base Paper: “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”

🎯 Uses

Direct Use

Automatically adds missing punctuation, capitalization, and spacing to unformatted English text.

Example:

Input: normalize: helloeveryonewelcome to todayssession

Output: Hello everyone, welcome to today's session.

Downstream Use

Text cleanup for ASR and transcripts

Preprocessing for NLP tasks (NER, summarization, etc.)

Chatbot or conversation logs normalization

🚫 Out-of-Scope Use

Grammar correction beyond punctuation

Rewriting semantics or paraphrasing

⚠️ Limitations and Biases

Limitation Description

Limitation	Description
Ambiguous punctuation	May insert or omit commas/periods differently from human editors
Case sensitivity	Rarely fails to capitalize proper nouns
Domain limitation	Trained primarily on general English text (Wikipedia-style)

Recommendation:

Always perform a quick review before deploying results in production systems.

🚀 How to Use


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "pradhap1125/t5-small-sentence-validator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "helloeveryonewelcome to todayssession"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧩 Training Details

🗂 Dataset

A custom English dataset derived from Wikipedia sentences. Text was noisified by removing punctuation, capitalization, and spacing to simulate ASR output.

Field	Description
Source	English Wikipedia corpus
Preprocessing	Removed punctuation, altered casing, and spacing
Training Samples	~90,000
Validation Samples	~2,5000

🧮 Hyperparameters

Parameter	Value
Model	T5-small
Epochs	3
Batch Size	16
Learning Rate	3e-4
Weight Decay	0.01
Optimizer	AdamW
Scheduler	Linear decay
Loss Function	CrossEntropyLoss
Evaluation Metric	Token-level accuracy, loss

Training Environment

Setting	Specification
Hardware	NVIDIA Tesla T4 GPU
Platform	Google Colab
Framework	PyTorch
Library	Hugging Face Transformers
Runtime	~4 hours

Evaluation

Metrics

Metric	Value
eval_loss	0.0744

Qualitative Example

Input	Output
helloeveryonewelcome to todayssession	Hello everyone, welcome to today's session.
this isatest ofthe normalizationmodel	This is a test of the normalization model.
itwasagoodday today	It was a good day today.