Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
license: mit
|
4 |
+
base_model: agentlans/snowflake-arctic-embed-xs-zyda-2
|
5 |
+
tags:
|
6 |
+
- generated_from_trainer
|
7 |
+
- text-classification
|
8 |
+
- grammar-classification
|
9 |
+
metrics:
|
10 |
+
- accuracy
|
11 |
+
model-index:
|
12 |
+
- name: agentlans/snowflake-arctic-xs-grammar-classifier
|
13 |
+
results:
|
14 |
+
- task:
|
15 |
+
type: text-classification
|
16 |
+
name: Grammar Classification
|
17 |
+
dataset:
|
18 |
+
name: agentlans/grammar-classification
|
19 |
+
type: agentlans/grammar-classification
|
20 |
+
metrics:
|
21 |
+
- type: accuracy
|
22 |
+
value: 0.8724
|
23 |
+
name: Accuracy
|
24 |
+
datasets:
|
25 |
+
- agentlans/grammar-classification
|
26 |
+
- liweili/c4_200m
|
27 |
+
language:
|
28 |
+
- en
|
29 |
+
pipeline_tag: text-classification
|
30 |
+
---
|
31 |
+
|
32 |
+
# snowflake-arctic-xs-grammar-classifier
|
33 |
+
|
34 |
+
This model is a fine-tuned version of [agentlans/snowflake-arctic-embed-xs-zyda-2](https://huggingface.co/agentlans/snowflake-arctic-embed-xs-zyda-2) for grammar classification. It achieves an accuracy of 0.8724 on the evaluation set.
|
35 |
+
|
36 |
+
## Model description
|
37 |
+
|
38 |
+
The snowflake-arctic-xs-grammar-classifier is designed to classify the grammatical correctness of English sentences.
|
39 |
+
It is based on the snowflake-arctic-embed-xs-zyda-2 model and has been fine-tuned on a grammar classification dataset derived from the C4 (Colossal Clean Crawled Corpus).
|
40 |
+
|
41 |
+
## Intended uses & limitations
|
42 |
+
|
43 |
+
This model is intended for classifying the grammatical correctness of English sentences. It can be used in various applications such as writing assistance tools, educational software, or content moderation systems.
|
44 |
+
|
45 |
+
### Usage example
|
46 |
+
|
47 |
+
```python
|
48 |
+
from transformers import pipeline
|
49 |
+
import torch
|
50 |
+
|
51 |
+
device = 0 if torch.cuda.is_available() else -1
|
52 |
+
classifier = pipeline(
|
53 |
+
"text-classification",
|
54 |
+
model="agentlans/snowflake-arctic-xs-grammar-classifier",
|
55 |
+
device=device,
|
56 |
+
)
|
57 |
+
|
58 |
+
text = "I absolutely loved this movie!"
|
59 |
+
result = classifier(text)
|
60 |
+
print(result) # [{'label': 'grammatical', 'score': 0.8963921666145325}]
|
61 |
+
```
|
62 |
+
|
63 |
+
### Example Classifications
|
64 |
+
|
65 |
+
| Status | Text | Explanation |
|
66 |
+
|:--------:|------|-------------|
|
67 |
+
| ✔️ | I absolutely loved this movie! | Grammatically correct, clear sentence structure |
|
68 |
+
| ❌ | How do I shot web? | Grammatically incorrect, improper verb usage |
|
69 |
+
| ✔️ | Beware the Jabberwock, my son! | Poetic language, grammatically sound |
|
70 |
+
| ✔️ | Colourless green ideas sleep furiously. | Grammatically correct, though semantically nonsensical |
|
71 |
+
| ❌ | Has anyone really been far even as decided to use even go want to do look more like? | Completely incoherent and grammatically incorrect |
|
72 |
+
|
73 |
+
### Limitations
|
74 |
+
|
75 |
+
The model's performance is limited by the quality and diversity of its training data. It may not perform well on specialized or domain-specific text, or on languages other than English. Additionally, it may struggle with complex grammatical structures or nuanced language use.
|
76 |
+
|
77 |
+
## Training and evaluation data
|
78 |
+
|
79 |
+
The model was trained on the [agentlans/grammar-classification](https://huggingface.co/datasets/agentlans/grammar-classification) dataset, which contains 600 000 examples for binary classification of grammatical correctness in English. This dataset is derived from a subset of the C4_200M Synthetic Dataset for Grammatical Error Correction.
|
80 |
+
|
81 |
+
## Training procedure
|
82 |
+
|
83 |
+
### Training hyperparameters
|
84 |
+
|
85 |
+
- Learning rate: 5e-05
|
86 |
+
- Batch size: 128
|
87 |
+
- Number of epochs: 10
|
88 |
+
- Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
|
89 |
+
- Learning rate scheduler: Linear
|
90 |
+
|
91 |
+
<details>
|
92 |
+
<summary>📊 Detailed Training Results</summary>
|
93 |
+
|
94 |
+
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Input Tokens Seen |
|
95 |
+
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:-----------------:|
|
96 |
+
| 0.5192 | 1.0 | 3750 | 0.4722 | 0.7738 | 61 440 000 |
|
97 |
+
| 0.4875 | 2.0 | 7500 | 0.4521 | 0.7881 | 122 880 000 |
|
98 |
+
| 0.4590 | 3.0 | 11250 | 0.3895 | 0.8227 | 184 320 000 |
|
99 |
+
| 0.4351 | 4.0 | 15000 | 0.3981 | 0.8197 | 245 760 000 |
|
100 |
+
| 0.4157 | 5.0 | 18750 | 0.3690 | 0.8337 | 307 200 000 |
|
101 |
+
| 0.3955 | 6.0 | 22500 | 0.3260 | 0.8585 | 368 640 000 |
|
102 |
+
| 0.3788 | 7.0 | 26250 | 0.3267 | 0.8566 | 430 080 000 |
|
103 |
+
| 0.3616 | 8.0 | 30000 | 0.3192 | 0.8621 | 491 520 000 |
|
104 |
+
| 0.3459 | 9.0 | 33750 | 0.3017 | 0.8707 | 552 960 000 |
|
105 |
+
| 0.3382 | 10.0 | 37500 | 0.2971 | 0.8724 | 614 400 000 |
|
106 |
+
|
107 |
+
</details>
|
108 |
+
|
109 |
+
### Framework versions
|
110 |
+
|
111 |
+
- Transformers: 4.46.3
|
112 |
+
- PyTorch: 2.5.1+cu124
|
113 |
+
- Datasets: 3.2.0
|
114 |
+
- Tokenizers: 20.3
|