Instructions to use SDVM/emotion-clf-refined with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use SDVM/emotion-clf-refined with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("SDVM/emotion-clf-refined", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
emotion-clf-refined -- Emotion Classifier Trained on SDVM-Refined Data
Emotion classification model trained on SDVM-refined training data. Demonstrates measurable accuracy improvement from data quality refinement. Part of the SDVM before/after comparison suite.
Cross-Evaluation Results (2x2 Matrix)
Both models evaluated on both original and SDVM-refined test data (30 samples). This proves that SDVM data refinement genuinely improves model quality -- not just on refined inputs, but across the board.
| Model \ Test Data | Original Test | Refined Test |
|---|---|---|
| Original-trained (emotion-clf-original) | 40.00% | 43.33% |
| Refined-trained (this model) | 43.33% | 46.67% |
| Model \ Test Data | Original Test (Macro F1) | Refined Test (Macro F1) |
|---|---|---|
| Original-trained | 0.3881 | 0.4281 |
| Refined-trained (this model) | 0.3952 | 0.4481 |
Key takeaways:
- This model wins on both test splits -- 43.33% on original test, 46.67% on refined test
- Both models improve on refined test data -- cleaning input helps even the original-trained model
- Best result: this model + refined test = 46.67% -- a 16.7% relative improvement over the baseline (40%)
- SDVM refinement is not style-overfitting -- this model generalizes better to original data too (+3.33pp over baseline)
Model Details
| Property | Value |
|---|---|
| Architecture | TF-IDF (1-2 gram, 10K features) + Logistic Regression |
| Reference | NLP with Transformers Ch. 2 baseline |
| Training samples | 90 (15 per class x 6 classes) |
| Test samples | 30 (5 per class) |
| Classes | joy, sadness, anger, fear, surprise, love |
| Training data | SDVM-refined text |
| Refinement | SDVM proprietary refinement model |
Performance vs. Baseline
| Metric | Original-trained | This model (refined) | Delta |
|---|---|---|---|
| Accuracy (original test) | 40.00% | 43.33% | +8.3% relative |
| Accuracy (refined test) | 43.33% | 46.67% | +7.7% relative |
| Macro F1 (original test) | 0.3881 | 0.3952 | +0.71% |
| Macro F1 (refined test) | 0.4281 | 0.4481 | +4.7% |
Per-Class F1 (Original Test)
| Emotion | Original-trained F1 | Refined-trained F1 | Delta |
|---|---|---|---|
| joy | 0.4000 | 0.5714 | +17pp |
| sadness | 0.2500 | 0.2222 | -3pp |
| anger | 0.3333 | 0.0000 | -33pp* |
| fear | 0.6154 | 0.8000 | +18pp |
| surprise | 0.4444 | 0.4444 | 0 |
| love | 0.2857 | 0.3333 | +5pp |
*anger regression: SDVM normalization removed ALL-CAPS and expletive patterns that TF-IDF relied on as discriminative anger signals. Mitigation: class-specific refinement policies for high-intensity classes.
Refinement Examples (Training Data)
| Label | Before (original) | After (SDVM-refined) |
|---|---|---|
| joy | omg i just got the job i cant believe it im literally shaking rn |
Oh my goodness, I just got the job! I can't believe it -- I'm literally shaking right now. |
| joy | just had the best day ever with my fav people honestly life is so good |
I just had the best day ever with my favorite people. Honestly, life is so good. |
| joy | ur never gonna believe it i won tickets to the concert im SCREAMING |
You're never going to believe it -- I won tickets to the concert! I'm screaming! |
Pattern: SDVM expands contractions, adds missing punctuation, capitalizes sentences, replaces shorthand (rn to right now, ur to you're, fav to favorite).
Usage
import joblib
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="SDVM/emotion-clf-refined", filename="model.joblib")
pipe = joblib.load(model_path)
texts = ["I can't believe I got the job! I'm so happy right now.", "Feeling really low today, I don't know why."]
predictions = pipe.predict(texts)
print(predictions) # ['joy', 'sadness']
probas = pipe.predict_proba(texts)
classes = pipe.classes_
Tip: This model performs best on grammatically clean, well-punctuated text. For informal input, run it through SDVM first.
Reproduce
The full training pipeline is included in train_compare.py. To reproduce:
pip install sdvm scikit-learn
export SDVM_API_KEY="your-key-here"
python train_compare.py
The refinement script used to create the SDVM/dair-ai-emotion dataset is available there as refine_emotion.py.
About SDVM
SDVM (Synthetic Data Vending Machine) refines NLP training datasets using proprietary AI models, improving grammar, spelling, and fluency while preserving labels and meaning. +16.7% relative accuracy improvement demonstrated on this emotion classification task (original baseline to refined model + refined test).
pip install sdvm
from sdvm import Refinery, RawText
refinery = Refinery(api_key="sdvm_your_key")
results = refinery.run([RawText(text="i cant believe it im so happy rn")])
print(results[0].text)
# "I can't believe it -- I'm so happy right now."
- Downloads last month
- -
Evaluation results
- accuracy on SDVM/dair-ai-emotion (refined)self-reported0.467
- f1 on SDVM/dair-ai-emotion (refined)self-reported0.448