grounded-ai
/

phi3-toxicity-judge

Generated from Trainer

Model card Files Files and versions Metrics Training metrics Community

Jlonge4 commited on May 31

Commit

c3829ba

•

1 Parent(s): 69e911e

Update README.md

Files changed (1) hide show

README.md +28 -15

README.md CHANGED Viewed

@@ -11,26 +11,43 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# outputs
-This model is a fine-tuned version of [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on the None dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -47,10 +64,6 @@ The following hyperparameters were used during training:
 - training_steps: 110
 - mixed_precision_training: Native AMP
-### Training results
 ### Framework versions
 - PEFT 0.11.1

   results: []
 ---
+## Toxicity Classification Performance
+Our merged model demonstrates exceptional performance on the toxicity classification task, outperforming several state-of-the-art language models.
+### Classification Metrics
+```
+              precision    recall  f1-score   support
+           0       0.85      0.90      0.87       175
+           1       0.89      0.85      0.87       175
+    accuracy                           0.87       350
+   macro avg       0.87      0.87      0.87       350
+weighted avg       0.87      0.87      0.87       350
+```
+Our model achieves an impressive precision of 0.85 for the toxic class and 0.89 for the non-toxic class, with a high overall accuracy of 0.87. The balanced F1-scores of 0.87 for both classes demonstrate the model's ability to handle this binary classification task effectively.
+### Comparison with Other Models
+| Model             | Precision | Recall | F1     |
+|-------------------|----------:|-------:|-------:|
+| Our Merged Model  | 0.85      | 0.90   | 0.87   |
+| GPT-4             | 0.91      | 0.91   | 0.91   |
+| GPT-4 Turbo       | 0.89      | 0.77   | 0.83   |
+| Gemini Pro        | 0.81      | 0.84   | 0.83   |
+| GPT-3.5 Turbo     | 0.93      | 0.83   | 0.87   |
+| Palm              | -         | -      | -      |
+| Claude V2         | -         | -      | -      |
+[1] Scores from arize/phoenix
+Compared to other language models, our merged model demonstrates competitive performance at a much smaller size, with a precision score of 0.85 and an F1 score of 0.87.
+We will continue to refine and improve our merged model to achieve even better performance on model based toxicity evaluation tasks.
+Citations: [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance
 ### Training hyperparameters
 - training_steps: 110
 - mixed_precision_training: Native AMP
 ### Framework versions
 - PEFT 0.11.1