File size: 6,343 Bytes
91e0c67 da75cc0 91e0c67 fb89fe7 6a15401 3381ead 6a15401 3381ead da75cc0 6a15401 3381ead 6a15401 da75cc0 2bba69b da75cc0 2bba69b da75cc0 6a15401 3381ead da75cc0 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 6a15401 3381ead 91e0c67 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- AAC
- assistive-technology
- spoken
datasets:
- jfleg
- daily_dialog
---
# t5-small-spoken-typo
This model is a fine-tuned version of T5-small, adapted for correcting typographical errors and missing spaces in text. It has been trained on a combination of spoken corpora, including DailyDialog and BNC, with a focus on short utterances common in conversational English.
## Task
The primary task of this model is **Text Correction**, with a focus on:
- **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
- **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats. Largely dealing with sentences with no spaces
This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
## Usage
```python
from happytransformer import HappyTextToText, TTSettings
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
args = TTSettings(num_beams=5, min_length=1)
# Add the prefix "grammar: " before each input
result = happy_tt.generate_text("grammar: Hihowareyoudoingtaday?.", args=args)
print(result.text) # This sentence has bad grammar and is comrpessed.
```
# Model Details
## Model Description
The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
It has been training on
- BNC 2014 Spoken
- [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
Then injecting typos from a range of places
- **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
- Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
- Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod
Next we would like to C4 200M model - or a subset of it at least
## Developed by:
- **Name**: Will Wade
- **Affiliation**: Research & Innovation Manager, Occupational Therapist, Ace Centre, UK
- **Contact Info**: wwade@acecentre.org.uk
## Model type:
- Language model fine-tuned for text correction tasks.
## Language(s) (NLP):
- English (`en`)
## License:
- apache-2.0
## Parent Model:
- The model is fine-tuned from `t5-small`.
## Resources for more information:
- [GitHub Repo](https://github.com/willwade/dailyDialogCorrections/)
# Uses
## Direct Use
This model can be directly applied for correcting text in various applications, including but not limited to, enhancing the quality of user-generated content, preprocessing text for NLP tasks, and supporting assistive technologies.
## Out-of-Scope Use
The model might not perform well on text significantly longer than the training examples (2-5 words), highly formal documents, or languages other than English. Use in sensitive contexts should be approached with caution due to potential biases. **Our typical use case here is AAC users - i.e. users using technology to communicate face to face to people**
# Bias, Risks, and Limitations
The model may inherit biases present in its training data, potentially reflecting or amplifying societal stereotypes. Given its training on conversational English, it may not generalize well to formal text or other dialects and languages.
## Recommendations
Users are encouraged to critically assess the model's output, especially when used in sensitive or impactful contexts. Further fine-tuning with diverse and representative datasets could mitigate some limitations.
# Training Details
## Training Data
The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this [here](https://github.com/willwade/dailyDialogCorrections/tree/main)
## Training Procedure
### Preprocessing
Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.
### Speeds, Sizes, Times
- Training was conducted on Google Colab, taking approximately 11 hrs to complete.
# Evaluation
## Testing Data, Factors & Metrics
### Testing Data
Evaluation was performed on a held-out test set derived from the same corpora and similar sentences, ensuring a diverse range of sentence structures and error types were represented.
### Metrics
Performance was measured using the accuracy of space insertion and typo correction alongside qualitative assessments of text normalisation.
## Results
The model demonstrates high efficacy in correcting short, erroneous sentences, with particular strength in handling real-world, conversational text.
# Environmental Impact
The training was conducted with an emphasis on efficiency and minimising carbon emissions. Users leveraging cloud compute resources are encouraged to consider the environmental impact of large-scale model training and inference.
# Technical Specifications
## Model Architecture and Objective
The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.
## Compute Infrastructure
- **Hardware**: T4 GPU (Google Colab)
- **Software**: PyTorch 1.8.1 with Transformers 4.8.2
# Citation
**BibTeX:**
```bibtex
@misc{t5_small_spoken_typo_2021,
title={T5-small Spoken Typo Corrector},
author={Your Name},
year={2021},
howpublished={\url{https://huggingface.co/your-username/t5-small-spoken-typo}},
} |