--- library_name: transformers base_model: deberta-v3-xsmall-quality-pretrain tags: - generated_from_trainer model-index: - name: deberta-v3-xsmall-quality results: [] license: mit datasets: - agentlans/text-quality - allenai/c4 - HuggingFaceFW/fineweb-edu - monology/pile-uncopyrighted - agentlans/common-crawl-sample - agentlans/wikipedia-paragraphs language: - en pipeline_tag: text-classification --- # English Text Quality Classifier The **deberta-v3-xsmall-quality** model is designed to evaluate text quality by using a composite score that combines the results from multiple classifiers. This method provides a more thorough assessment than traditional educational metrics, making it ideal for a variety of NLP and AI applications. ## Intended Uses & Limitations **Intended Uses**: - Quality assessment of text across various domains. - Enhancing NLP applications by providing a robust measure of text quality. - Supporting research and development in AI by offering insights into text quality metrics. **Limitations**: - The model's performance may vary depending on the specific characteristics of the input text. - It's also a black box. Hard to explain why something is classified as higher quality than another. - It is essential to consider the context in which the model is applied, as different domains may have unique quality requirements. - May still be biased towards non-fiction and educational genres. ## Training and Evaluation Data The model was trained on the [agentlans/text-quality](https://huggingface.co/datasets/agentlans/text-quality) dataset comprising **100,000 sentences** sourced from five distinct datasets, with **20,000 sentences** drawn from each of the following: 1. **allenai/c4** 2. **HuggingFaceFW/fineweb-edu** 3. **monology/pile-uncopyrighted** 4. **agentlans/common-crawl-sample** 5. **agentlans/wikipedia-paragraphs** This diverse dataset enables the model to generalize well across different text types and domains. 90% of the rows were used for training and the remaining 10% for evaluation. ## How to use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name="agentlans/deberta-v3-xsmall-quality" # Put model on GPU or else CPU tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) def quality(text): """Processes the text using the model and returns its logits. In this case, it's interpreted as the the combined quality score for that text.""" inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) with torch.no_grad(): logits = model(**inputs).logits.squeeze().cpu() return logits.tolist() # Example usage text = [ "Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!!", "Page 1 2 3 4 5 Next Last>>", "Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!!", "Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment!", "In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe."] result = quality(text) [round(x, 2) for x in result] # Estimated quality for each text [-0.89, -0.76, -0.7, 0.3, 1.64] ``` ## Training Procedure
Training hyperparameters, results, framework ### Training Hyperparameters The following hyperparameters were utilized during training: - **Learning Rate**: 5e-05 - **Training Batch Size**: 8 - **Evaluation Batch Size**: 8 - **Seed**: 42 - **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08 - **Learning Rate Scheduler Type**: Linear - **Number of Epochs**: 3.0 ### Training Results - **Loss**: 0.0924 - **Mse**: 0.0924 - **Num Input Tokens Seen**: 34560000 ### Framework Versions The model was developed using the following frameworks and libraries: - Transformers 4.45.1 - Pytorch 2.4.1+cu121 - Datasets 3.0.1 - Tokenizers 0.20.0