agentlans commited on
Commit
eed2815
·
verified ·
1 Parent(s): 6dc0c86

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ base_model: agentlans/snowflake-arctic-embed-xs-zyda-2
5
+ tags:
6
+ - generated_from_trainer
7
+ - text-classification
8
+ - grammar-classification
9
+ metrics:
10
+ - accuracy
11
+ model-index:
12
+ - name: agentlans/snowflake-arctic-xs-grammar-classifier
13
+ results:
14
+ - task:
15
+ type: text-classification
16
+ name: Grammar Classification
17
+ dataset:
18
+ name: agentlans/grammar-classification
19
+ type: agentlans/grammar-classification
20
+ metrics:
21
+ - type: accuracy
22
+ value: 0.8724
23
+ name: Accuracy
24
+ datasets:
25
+ - agentlans/grammar-classification
26
+ - liweili/c4_200m
27
+ language:
28
+ - en
29
+ pipeline_tag: text-classification
30
+ ---
31
+
32
+ # snowflake-arctic-xs-grammar-classifier
33
+
34
+ This model is a fine-tuned version of [agentlans/snowflake-arctic-embed-xs-zyda-2](https://huggingface.co/agentlans/snowflake-arctic-embed-xs-zyda-2) for grammar classification. It achieves an accuracy of 0.8724 on the evaluation set.
35
+
36
+ ## Model description
37
+
38
+ The snowflake-arctic-xs-grammar-classifier is designed to classify the grammatical correctness of English sentences.
39
+ It is based on the snowflake-arctic-embed-xs-zyda-2 model and has been fine-tuned on a grammar classification dataset derived from the C4 (Colossal Clean Crawled Corpus).
40
+
41
+ ## Intended uses & limitations
42
+
43
+ This model is intended for classifying the grammatical correctness of English sentences. It can be used in various applications such as writing assistance tools, educational software, or content moderation systems.
44
+
45
+ ### Usage example
46
+
47
+ ```python
48
+ from transformers import pipeline
49
+ import torch
50
+
51
+ device = 0 if torch.cuda.is_available() else -1
52
+ classifier = pipeline(
53
+ "text-classification",
54
+ model="agentlans/snowflake-arctic-xs-grammar-classifier",
55
+ device=device,
56
+ )
57
+
58
+ text = "I absolutely loved this movie!"
59
+ result = classifier(text)
60
+ print(result) # [{'label': 'grammatical', 'score': 0.8963921666145325}]
61
+ ```
62
+
63
+ ### Example Classifications
64
+
65
+ | Status | Text | Explanation |
66
+ |:--------:|------|-------------|
67
+ | ✔️ | I absolutely loved this movie! | Grammatically correct, clear sentence structure |
68
+ | ❌ | How do I shot web? | Grammatically incorrect, improper verb usage |
69
+ | ✔️ | Beware the Jabberwock, my son! | Poetic language, grammatically sound |
70
+ | ✔️ | Colourless green ideas sleep furiously. | Grammatically correct, though semantically nonsensical |
71
+ | ❌ | Has anyone really been far even as decided to use even go want to do look more like? | Completely incoherent and grammatically incorrect |
72
+
73
+ ### Limitations
74
+
75
+ The model's performance is limited by the quality and diversity of its training data. It may not perform well on specialized or domain-specific text, or on languages other than English. Additionally, it may struggle with complex grammatical structures or nuanced language use.
76
+
77
+ ## Training and evaluation data
78
+
79
+ The model was trained on the [agentlans/grammar-classification](https://huggingface.co/datasets/agentlans/grammar-classification) dataset, which contains 600 000 examples for binary classification of grammatical correctness in English. This dataset is derived from a subset of the C4_200M Synthetic Dataset for Grammatical Error Correction.
80
+
81
+ ## Training procedure
82
+
83
+ ### Training hyperparameters
84
+
85
+ - Learning rate: 5e-05
86
+ - Batch size: 128
87
+ - Number of epochs: 10
88
+ - Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
89
+ - Learning rate scheduler: Linear
90
+
91
+ <details>
92
+ <summary>📊 Detailed Training Results</summary>
93
+
94
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | Input Tokens Seen |
95
+ |:-------------:|:-----:|:-----:|:---------------:|:--------:|:-----------------:|
96
+ | 0.5192 | 1.0 | 3750 | 0.4722 | 0.7738 | 61&thinsp;440&thinsp;000 |
97
+ | 0.4875 | 2.0 | 7500 | 0.4521 | 0.7881 | 122&thinsp;880&thinsp;000 |
98
+ | 0.4590 | 3.0 | 11250 | 0.3895 | 0.8227 | 184&thinsp;320&thinsp;000 |
99
+ | 0.4351 | 4.0 | 15000 | 0.3981 | 0.8197 | 245&thinsp;760&thinsp;000 |
100
+ | 0.4157 | 5.0 | 18750 | 0.3690 | 0.8337 | 307&thinsp;200&thinsp;000 |
101
+ | 0.3955 | 6.0 | 22500 | 0.3260 | 0.8585 | 368&thinsp;640&thinsp;000 |
102
+ | 0.3788 | 7.0 | 26250 | 0.3267 | 0.8566 | 430&thinsp;080&thinsp;000 |
103
+ | 0.3616 | 8.0 | 30000 | 0.3192 | 0.8621 | 491&thinsp;520&thinsp;000 |
104
+ | 0.3459 | 9.0 | 33750 | 0.3017 | 0.8707 | 552&thinsp;960&thinsp;000 |
105
+ | 0.3382 | 10.0 | 37500 | 0.2971 | 0.8724 | 614&thinsp;400&thinsp;000 |
106
+
107
+ </details>
108
+
109
+ ### Framework versions
110
+
111
+ - Transformers: 4.46.3
112
+ - PyTorch: 2.5.1+cu124
113
+ - Datasets: 3.2.0
114
+ - Tokenizers: 20.3