YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
๐งพ Model Card โ T5-Small Sentence Validator
This is the model card of a ๐ค Transformers model fine-tuned for text normalization โ restoring punctuation, capitalization, and proper spacing from raw transcripts or unformatted text.
๐จโ๐ป Developed by
Pradhap Rajamani
Funded by : Independent academic project โ Purdue University (Masterโs in Computer Science)
Shared by : Pradhap Rajamani
๐ง Model Details
- Model type: Sequence-to-sequence text generation model (T5-small architecture)
- Language(s): English ๐บ๐ธ
- License: Apache 2.0
- Finetuned from model:
t5-small
๐ Model Sources
- Repository: Hugging Face Hub โ pradhap1125/t5-small-sentence-validator
- Base Paper: โExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformerโ
๐ฏ Uses
Direct Use
Automatically adds missing punctuation, capitalization, and spacing to unformatted English text.
Example:
Input: normalize: helloeveryonewelcome to todayssession
Output: Hello everyone, welcome to today's session.
Downstream Use
Text cleanup for ASR and transcripts
Preprocessing for NLP tasks (NER, summarization, etc.)
Chatbot or conversation logs normalization
๐ซ Out-of-Scope Use
Grammar correction beyond punctuation
Rewriting semantics or paraphrasing
โ ๏ธ Limitations and Biases
Limitation Description
| Limitation | Description |
|---|---|
| Ambiguous punctuation | May insert or omit commas/periods differently from human editors |
| Case sensitivity | Rarely fails to capitalize proper nouns |
| Domain limitation | Trained primarily on general English text (Wikipedia-style) |
Recommendation:
Always perform a quick review before deploying results in production systems.
๐ How to Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "pradhap1125/t5-small-sentence-validator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "helloeveryonewelcome to todayssession"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐งฉ Training Details
๐ Dataset
A custom English dataset derived from Wikipedia sentences. Text was noisified by removing punctuation, capitalization, and spacing to simulate ASR output.
| Field | Description |
|---|---|
| Source | English Wikipedia corpus |
| Preprocessing | Removed punctuation, altered casing, and spacing |
| Training Samples | ~90,000 |
| Validation Samples | ~2,5000 |
๐งฎ Hyperparameters
| Parameter | Value |
|---|---|
| Model | T5-small |
| Epochs | 3 |
| Batch Size | 16 |
| Learning Rate | 3e-4 |
| Weight Decay | 0.01 |
| Optimizer | AdamW |
| Scheduler | Linear decay |
| Loss Function | CrossEntropyLoss |
| Evaluation Metric | Token-level accuracy, loss |
Training Environment
| Setting | Specification |
|---|---|
| Hardware | NVIDIA Tesla T4 GPU |
| Platform | Google Colab |
| Framework | PyTorch |
| Library | Hugging Face Transformers |
| Runtime | ~4 hours |
Evaluation
Metrics
| Metric | Value |
|---|---|
| eval_loss | 0.0744 |
Qualitative Example
| Input | Output |
|---|---|
| helloeveryonewelcome to todayssession | Hello everyone, welcome to today's session. |
| this isatest ofthe normalizationmodel | This is a test of the normalization model. |
| itwasagoodday today | It was a good day today. |
๐งโ๐ป Model Card Authors
Pradhap Rajamani
Contact
๐ง https://www.linkedin.com/in/pradhap-rajamani/ ๐ง https://github.com/pradhap1125
Repo
- Downloads last month
- -