Mistral-Small-3.1-24B-GoEmotions

A multi-label emotion classifier fine-tuned on the GoEmotions dataset. The model detects up to 28 fine-grained emotions simultaneously in short English text. It is built on top of Mistral Small 3.1 24B Instruct, adapted with LoRA, and trained to classify pre-computed Universal Sentence Encoder embeddings through a lightweight projection head. 26 out of 28 emotion labels achieve F1 > 0.50 on the held-out English test set.

Note: six rare-label categories (relief, embarrassment, nervousness, pride, remorse, grief) have fewer than 100 test samples each and show inflated F1 scores β€” they are not recommended for production use without further evaluation.


Architecture

The model uses a three-stage pipeline that decouples text encoding from the Mistral backbone:

Raw text
   β”‚
   β–Ό
Universal Sentence Encoder v4  (TF-Hub)
   β”‚  512-dim embedding
   β–Ό
Projection MLP
   Linear(512 β†’ 2560) + GELU + Linear(2560 β†’ 5120)
   β”‚  5120-dim projected embedding
   β–Ό
Mistral-Small-3.1-24B  (4-bit quantised, LoRA-adapted)
   single-token sequence, last hidden state pooled
   β”‚  5120-dim contextual representation
   β–Ό
Classifier head
   Dropout(0.1) + Linear(5120 β†’ 28)
   β”‚  28 logits  β†’  sigmoid  β†’  threshold @ 0.50
   β–Ό
Multi-hot prediction  (one or more emotions per input)

Focal Loss

Training uses per-label binary focal loss (gamma = 2) to address class imbalance across the 28 GoEmotions categories. Labels with naturally lower support or that were empirically harder to learn are assigned a higher focal alpha (0.75); the remaining labels use alpha = 0.25.

Higher-alpha labels (alpha = 0.75): admiration, amusement, anger, annoyance, confusion, desire, disappointment, disgust, excitement, fear, joy, love, optimism, sadness, surprise

Lower-alpha labels (alpha = 0.25): approval, caring, curiosity, embarrassment, gratitude, grief, nervousness, neutral, pride, realization, relief, remorse


Training Details

Base Model & Quantisation

Setting Value
Base model unsloth/Mistral-Small-3.1-24B-Instruct-2503
Quantisation 4-bit (BnB NF4) via Unsloth
Precision bfloat16

LoRA Configuration

Setting Value
Rank (r) 16
Alpha 32
Dropout 0.0
Bias none
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters ~116 M of 13.2 B (0.88 %)

Dataset

Split Samples
Train 89,201
Validation 7,891
Test 7,891

The training data is a pre-computed embedding cache of the augmented GoEmotions corpus. Each sample is a 512-dimensional Universal Sentence Encoder v4 embedding paired with a 28-dimensional multi-hot label vector.

Hyperparameters

Hyperparameter Value
Epochs 15
Per-device train batch 8
Gradient accumulation steps 4
Effective batch size 32
Learning rate 1e-4
LR scheduler Cosine
Warmup ratio 0.05
Optimizer AdamW 8-bit
Weight decay 0.01
Best model criterion Macro F1 (validation)

Compute

Item Details
Hardware 1Γ— NVIDIA RTX 6000 Ada Generation (48 GB VRAM)
Framework Unsloth 2026.5.2, Transformers 4.57.6, PEFT 0.18.1, PyTorch 2.10.0+cu126
Training time ~18 hours
Final training loss 0.0048

Evaluation Results

Evaluated on the held-out English GoEmotions test set (7,891 samples). Metrics are computed at threshold = 0.50.

Per-Label F1 (test set, English)

Emotion F1
relief 1.0000
embarrassment 0.9877
nervousness 0.9773
pride 0.9677
fear 0.9390
remorse 0.9353
desire 0.9288
grief 0.8980
caring 0.8942
disgust 0.8856
realization 0.8812
gratitude 0.8804
excitement 0.8767
sadness 0.8713
surprise 0.8505
disappointment 0.8473
confusion 0.7899
optimism 0.7882
joy 0.7794
love 0.7656
amusement 0.7611
anger 0.7395
curiosity 0.7319
admiration 0.6844
disapproval 0.6453
annoyance 0.6148
approval ≀ 0.50
neutral ≀ 0.50

26 of 28 labels achieve F1 > 0.50. The two weakest labels β€” approval and neutral β€” are structurally challenging: approval overlaps heavily with admiration and positive sentiment in general, while neutral is the absence of any emotion and therefore poorly separated from all other classes.

Low-Support Labels β€” Overfitting Risk

Warning: The six labels listed below have fewer than 100 positive examples in the test set. Their very high F1 scores are likely inflated by the small sample size and should not be taken as evidence of reliable generalisation. These labels are not recommended for production inference without additional out-of-distribution evaluation.

Emotion Test-set F1 Est. test support Recommendation
relief 1.0000 < 30 Do not use in production
embarrassment 0.9877 < 60 Do not use in production
nervousness 0.9773 < 60 Do not use in production
pride 0.9677 < 30 Do not use in production
remorse 0.9353 < 80 Do not use in production
grief 0.8980 < 30 Do not use in production

If you need to detect any of these emotions, consider:

  • Collecting and annotating a domain-specific test set with at least 200 positive examples before drawing conclusions.
  • Raising the prediction threshold for these labels to reduce false positives.
  • Treating model outputs for these labels as low-confidence signals only.

To suppress these labels from predictions entirely, filter the output dictionary:

UNRELIABLE_LABELS = {"relief", "embarrassment", "nervousness", "pride", "remorse", "grief"}

predictions = clf.predict(texts)
safe_predictions = [
    {k: v for k, v in pred.items() if k not in UNRELIABLE_LABELS}
    for pred in predictions
]

Note: Multilingual evaluation (Italian test set) was ongoing at time of release and results will be added when available.


Requirements

unsloth>=2026.5.2
peft>=0.18.1
transformers>=4.57.6
torch>=2.10.0
tensorflow>=2.0        # for TF-Hub encoder
tensorflow-hub>=0.12
scikit-learn
numpy

A CUDA-capable GPU with at least 48 GB VRAM is required to load the base model in 4-bit quantisation. Inference on CPU is not practical due to the model size.


Inference Guide

The repository ships a self-contained inference helper infer.py that handles all loading and prediction in a single EmotionClassifier class.

Step 1 β€” Download the model

Clone or download the full repository directory (it must contain config.json, focal_config.json, head_weights.pt, and the lora_adapter/ folder).

Step 2 β€” Set the TF-Hub cache directory (optional but recommended)

The Universal Sentence Encoder is downloaded from TF-Hub on first use. Set an environment variable to control where it is cached:

# Windows PowerShell
$env:TFHUB_CACHE_DIR = "C:\path\to\tfhub_cache"

# Linux / macOS
export TFHUB_CACHE_DIR=/path/to/tfhub_cache

Step 3 β€” Run inference

from infer import EmotionClassifier

MODEL_DIR = r"C:\path\to\Mistral-Small-3.1-24B-goemotions_v18"

clf = EmotionClassifier(MODEL_DIR)

texts = [
    "I can't believe how amazing that was!",
    "This is absolutely outrageous and I'm furious.",
    "I feel a bit nervous about the presentation tomorrow.",
]

predictions = clf.predict(texts)

for text, pred in zip(texts, predictions):
    print(f"\nText   : {text}")
    print(f"Emotions: {pred}")

Example output:

Text   : I can't believe how amazing that was!
Emotions: {'admiration': 0.9123, 'excitement': 0.8741, 'surprise': 0.7056}

Text   : This is absolutely outrageous and I'm furious.
Emotions: {'anger': 0.9388, 'annoyance': 0.8112, 'disapproval': 0.7045}

Text   : I feel a bit nervous about the presentation tomorrow.
Emotions: {'nervousness': 0.9512, 'fear': 0.6834}

Adjusting the prediction threshold

The default threshold is 0.50. You can lower it to capture more emotions (at the cost of more false positives) or raise it to return only high-confidence predictions:

# More sensitive β€” returns emotions with probability >= 0.35
predictions = clf.predict(texts, threshold=0.35)

# More conservative β€” only high-confidence emotions
predictions = clf.predict(texts, threshold=0.70)

Batch inference

predict() accepts any list of strings and processes them as a single batch through both the USB encoder and the Mistral backbone. For large inputs, consider splitting into sub-batches of ~64 texts depending on available VRAM.


Repository File Layout

Mistral-Small-3.1-24B-goemotions_v18/
β”œβ”€β”€ README.md                  ← this file
β”œβ”€β”€ config.json                ← architecture & encoder configuration
β”‚                                 (model_name, hidden_size, embed_dim,
β”‚                                  num_labels, emotion_labels, threshold)
β”œβ”€β”€ focal_config.json          ← focal loss alpha values per label
β”œβ”€β”€ head_weights.pt            ← projection MLP + classifier head weights
β”‚                                 (keys: "projection", "classifier")
β”œβ”€β”€ infer.py                   ← self-contained inference class
└── lora_adapter/
    β”œβ”€β”€ adapter_config.json    ← PEFT LoRA configuration
    └── adapter_model.safetensors  ← LoRA delta weights

Limitations

  • Language: The model was trained and evaluated exclusively on English text. Performance on other languages is unknown. An Italian evaluation is currently in progress.
  • Input length: The Universal Sentence Encoder v4 has an effective input range of roughly 1–512 tokens. Very long inputs are truncated internally by the encoder before reaching the Mistral backbone.
  • Threshold sensitivity: The default threshold of 0.50 was selected to balance precision and recall on the English test set. Depending on the application, a different threshold may be more appropriate (see the inference guide above).
  • Overfitting on rare labels: Six labels β€” relief, embarrassment, nervousness, pride, remorse, grief β€” each have fewer than 100 positive examples in the test set. Their F1 scores (0.90–1.00) are likely inflated by this small sample size and are not reliable for production use. See the Low-Support Labels β€” Overfitting Risk section for details and a code snippet to suppress these labels at inference time.
  • Label imbalance: High-frequency, semantically overlapping labels (approval, neutral, annoyance, admiration) are the hardest to classify correctly and show the lowest F1 scores.
  • Compute requirements: Loading the 4-bit quantised 24B-parameter model requires approximately 48 GB of GPU VRAM. The model cannot be used on consumer GPUs without additional quantisation.
  • Data distribution: GoEmotions consists of Reddit comments in English. The model may not generalise well to formal text, non-English dialects, or social media platforms with different writing conventions.

Citation

If you use this model, please cite the GoEmotions dataset:

@inproceedings{demszky2020goemotions,
  title     = {GoEmotions: A Dataset of Fine-Grained Emotions},
  author    = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwook
               and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
  booktitle = {Proceedings of the 58th Annual Meeting of the Association
               for Computational Linguistics},
  year      = {2020},
  pages     = {4040--4054},
}

And the Mistral Small 3.1 base model:

@misc{mistral2025small31,
  title  = {Mistral Small 3.1},
  author = {Mistral AI},
  year   = {2025},
  url    = {https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503},
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TextMiningStories/Mistral-Small-3.1-24B-goemotions

Dataset used to train TextMiningStories/Mistral-Small-3.1-24B-goemotions