Jlonge4 commited on
Commit
c3829ba
1 Parent(s): 69e911e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -15
README.md CHANGED
@@ -11,26 +11,43 @@ model-index:
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # outputs
18
 
19
- This model is a fine-tuned version of [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on the None dataset.
20
 
21
- ## Model description
 
22
 
23
- More information needed
 
24
 
25
- ## Intended uses & limitations
 
 
 
26
 
27
- More information needed
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
 
 
 
 
 
 
 
 
 
32
 
33
- ## Training procedure
 
 
 
 
34
 
35
  ### Training hyperparameters
36
 
@@ -47,10 +64,6 @@ The following hyperparameters were used during training:
47
  - training_steps: 110
48
  - mixed_precision_training: Native AMP
49
 
50
- ### Training results
51
-
52
-
53
-
54
  ### Framework versions
55
 
56
  - PEFT 0.11.1
 
11
  results: []
12
  ---
13
 
14
+ ## Toxicity Classification Performance
 
15
 
16
+ Our merged model demonstrates exceptional performance on the toxicity classification task, outperforming several state-of-the-art language models.
17
 
18
+ ### Classification Metrics
19
 
20
+ ```
21
+ precision recall f1-score support
22
 
23
+ 0 0.85 0.90 0.87 175
24
+ 1 0.89 0.85 0.87 175
25
 
26
+ accuracy 0.87 350
27
+ macro avg 0.87 0.87 0.87 350
28
+ weighted avg 0.87 0.87 0.87 350
29
+ ```
30
 
31
+ Our model achieves an impressive precision of 0.85 for the toxic class and 0.89 for the non-toxic class, with a high overall accuracy of 0.87. The balanced F1-scores of 0.87 for both classes demonstrate the model's ability to handle this binary classification task effectively.
32
 
33
+ ### Comparison with Other Models
34
 
35
+ | Model | Precision | Recall | F1 |
36
+ |-------------------|----------:|-------:|-------:|
37
+ | Our Merged Model | 0.85 | 0.90 | 0.87 |
38
+ | GPT-4 | 0.91 | 0.91 | 0.91 |
39
+ | GPT-4 Turbo | 0.89 | 0.77 | 0.83 |
40
+ | Gemini Pro | 0.81 | 0.84 | 0.83 |
41
+ | GPT-3.5 Turbo | 0.93 | 0.83 | 0.87 |
42
+ | Palm | - | - | - |
43
+ | Claude V2 | - | - | - |
44
+ [1] Scores from arize/phoenix
45
 
46
+ Compared to other language models, our merged model demonstrates competitive performance at a much smaller size, with a precision score of 0.85 and an F1 score of 0.87.
47
+
48
+ We will continue to refine and improve our merged model to achieve even better performance on model based toxicity evaluation tasks.
49
+
50
+ Citations: [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance
51
 
52
  ### Training hyperparameters
53
 
 
64
  - training_steps: 110
65
  - mixed_precision_training: Native AMP
66
 
 
 
 
 
67
  ### Framework versions
68
 
69
  - PEFT 0.11.1