Instructions to use TextMiningStories/Mistral-Small-3.1-24B-goemotions with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use TextMiningStories/Mistral-Small-3.1-24B-goemotions with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Mistral-Small-3.1-24B-GoEmotions
A multi-label emotion classifier fine-tuned on the GoEmotions dataset. The model detects up to 28 fine-grained emotions simultaneously in short English text. It is built on top of Mistral Small 3.1 24B Instruct, adapted with LoRA, and trained to classify pre-computed Universal Sentence Encoder embeddings through a lightweight projection head. 26 out of 28 emotion labels achieve F1 > 0.50 on the held-out English test set.
Note: six rare-label categories (relief, embarrassment, nervousness, pride, remorse, grief) have fewer than 100 test samples each and show inflated F1 scores β they are not recommended for production use without further evaluation.
Architecture
The model uses a three-stage pipeline that decouples text encoding from the Mistral backbone:
Raw text
β
βΌ
Universal Sentence Encoder v4 (TF-Hub)
β 512-dim embedding
βΌ
Projection MLP
Linear(512 β 2560) + GELU + Linear(2560 β 5120)
β 5120-dim projected embedding
βΌ
Mistral-Small-3.1-24B (4-bit quantised, LoRA-adapted)
single-token sequence, last hidden state pooled
β 5120-dim contextual representation
βΌ
Classifier head
Dropout(0.1) + Linear(5120 β 28)
β 28 logits β sigmoid β threshold @ 0.50
βΌ
Multi-hot prediction (one or more emotions per input)
Focal Loss
Training uses per-label binary focal loss (gamma = 2) to address class imbalance across the 28 GoEmotions categories. Labels with naturally lower support or that were empirically harder to learn are assigned a higher focal alpha (0.75); the remaining labels use alpha = 0.25.
Higher-alpha labels (alpha = 0.75): admiration, amusement, anger, annoyance, confusion, desire, disappointment, disgust, excitement, fear, joy, love, optimism, sadness, surprise
Lower-alpha labels (alpha = 0.25): approval, caring, curiosity, embarrassment, gratitude, grief, nervousness, neutral, pride, realization, relief, remorse
Training Details
Base Model & Quantisation
| Setting | Value |
|---|---|
| Base model | unsloth/Mistral-Small-3.1-24B-Instruct-2503 |
| Quantisation | 4-bit (BnB NF4) via Unsloth |
| Precision | bfloat16 |
LoRA Configuration
| Setting | Value |
|---|---|
| Rank (r) | 16 |
| Alpha | 32 |
| Dropout | 0.0 |
| Bias | none |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | ~116 M of 13.2 B (0.88 %) |
Dataset
| Split | Samples |
|---|---|
| Train | 89,201 |
| Validation | 7,891 |
| Test | 7,891 |
The training data is a pre-computed embedding cache of the augmented GoEmotions corpus. Each sample is a 512-dimensional Universal Sentence Encoder v4 embedding paired with a 28-dimensional multi-hot label vector.
Hyperparameters
| Hyperparameter | Value |
|---|---|
| Epochs | 15 |
| Per-device train batch | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 32 |
| Learning rate | 1e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Optimizer | AdamW 8-bit |
| Weight decay | 0.01 |
| Best model criterion | Macro F1 (validation) |
Compute
| Item | Details |
|---|---|
| Hardware | 1Γ NVIDIA RTX 6000 Ada Generation (48 GB VRAM) |
| Framework | Unsloth 2026.5.2, Transformers 4.57.6, PEFT 0.18.1, PyTorch 2.10.0+cu126 |
| Training time | ~18 hours |
| Final training loss | 0.0048 |
Evaluation Results
Evaluated on the held-out English GoEmotions test set (7,891 samples). Metrics are computed at threshold = 0.50.
Per-Label F1 (test set, English)
| Emotion | F1 |
|---|---|
| relief | 1.0000 |
| embarrassment | 0.9877 |
| nervousness | 0.9773 |
| pride | 0.9677 |
| fear | 0.9390 |
| remorse | 0.9353 |
| desire | 0.9288 |
| grief | 0.8980 |
| caring | 0.8942 |
| disgust | 0.8856 |
| realization | 0.8812 |
| gratitude | 0.8804 |
| excitement | 0.8767 |
| sadness | 0.8713 |
| surprise | 0.8505 |
| disappointment | 0.8473 |
| confusion | 0.7899 |
| optimism | 0.7882 |
| joy | 0.7794 |
| love | 0.7656 |
| amusement | 0.7611 |
| anger | 0.7395 |
| curiosity | 0.7319 |
| admiration | 0.6844 |
| disapproval | 0.6453 |
| annoyance | 0.6148 |
| approval | β€ 0.50 |
| neutral | β€ 0.50 |
26 of 28 labels achieve F1 > 0.50. The two weakest labels β approval and neutral β are structurally challenging: approval overlaps heavily with admiration and positive sentiment in general, while neutral is the absence of any emotion and therefore poorly separated from all other classes.
Low-Support Labels β Overfitting Risk
Warning: The six labels listed below have fewer than 100 positive examples in the test set. Their very high F1 scores are likely inflated by the small sample size and should not be taken as evidence of reliable generalisation. These labels are not recommended for production inference without additional out-of-distribution evaluation.
| Emotion | Test-set F1 | Est. test support | Recommendation |
|---|---|---|---|
| relief | 1.0000 | < 30 | Do not use in production |
| embarrassment | 0.9877 | < 60 | Do not use in production |
| nervousness | 0.9773 | < 60 | Do not use in production |
| pride | 0.9677 | < 30 | Do not use in production |
| remorse | 0.9353 | < 80 | Do not use in production |
| grief | 0.8980 | < 30 | Do not use in production |
If you need to detect any of these emotions, consider:
- Collecting and annotating a domain-specific test set with at least 200 positive examples before drawing conclusions.
- Raising the prediction threshold for these labels to reduce false positives.
- Treating model outputs for these labels as low-confidence signals only.
To suppress these labels from predictions entirely, filter the output dictionary:
UNRELIABLE_LABELS = {"relief", "embarrassment", "nervousness", "pride", "remorse", "grief"}
predictions = clf.predict(texts)
safe_predictions = [
{k: v for k, v in pred.items() if k not in UNRELIABLE_LABELS}
for pred in predictions
]
Note: Multilingual evaluation (Italian test set) was ongoing at time of release and results will be added when available.
Requirements
unsloth>=2026.5.2
peft>=0.18.1
transformers>=4.57.6
torch>=2.10.0
tensorflow>=2.0 # for TF-Hub encoder
tensorflow-hub>=0.12
scikit-learn
numpy
A CUDA-capable GPU with at least 48 GB VRAM is required to load the base model in 4-bit quantisation. Inference on CPU is not practical due to the model size.
Inference Guide
The repository ships a self-contained inference helper infer.py that handles all loading and prediction in a single EmotionClassifier class.
Step 1 β Download the model
Clone or download the full repository directory (it must contain config.json, focal_config.json, head_weights.pt, and the lora_adapter/ folder).
Step 2 β Set the TF-Hub cache directory (optional but recommended)
The Universal Sentence Encoder is downloaded from TF-Hub on first use. Set an environment variable to control where it is cached:
# Windows PowerShell
$env:TFHUB_CACHE_DIR = "C:\path\to\tfhub_cache"
# Linux / macOS
export TFHUB_CACHE_DIR=/path/to/tfhub_cache
Step 3 β Run inference
from infer import EmotionClassifier
MODEL_DIR = r"C:\path\to\Mistral-Small-3.1-24B-goemotions_v18"
clf = EmotionClassifier(MODEL_DIR)
texts = [
"I can't believe how amazing that was!",
"This is absolutely outrageous and I'm furious.",
"I feel a bit nervous about the presentation tomorrow.",
]
predictions = clf.predict(texts)
for text, pred in zip(texts, predictions):
print(f"\nText : {text}")
print(f"Emotions: {pred}")
Example output:
Text : I can't believe how amazing that was!
Emotions: {'admiration': 0.9123, 'excitement': 0.8741, 'surprise': 0.7056}
Text : This is absolutely outrageous and I'm furious.
Emotions: {'anger': 0.9388, 'annoyance': 0.8112, 'disapproval': 0.7045}
Text : I feel a bit nervous about the presentation tomorrow.
Emotions: {'nervousness': 0.9512, 'fear': 0.6834}
Adjusting the prediction threshold
The default threshold is 0.50. You can lower it to capture more emotions (at the cost of more false positives) or raise it to return only high-confidence predictions:
# More sensitive β returns emotions with probability >= 0.35
predictions = clf.predict(texts, threshold=0.35)
# More conservative β only high-confidence emotions
predictions = clf.predict(texts, threshold=0.70)
Batch inference
predict() accepts any list of strings and processes them as a single batch through both the USB encoder and the Mistral backbone. For large inputs, consider splitting into sub-batches of ~64 texts depending on available VRAM.
Repository File Layout
Mistral-Small-3.1-24B-goemotions_v18/
βββ README.md β this file
βββ config.json β architecture & encoder configuration
β (model_name, hidden_size, embed_dim,
β num_labels, emotion_labels, threshold)
βββ focal_config.json β focal loss alpha values per label
βββ head_weights.pt β projection MLP + classifier head weights
β (keys: "projection", "classifier")
βββ infer.py β self-contained inference class
βββ lora_adapter/
βββ adapter_config.json β PEFT LoRA configuration
βββ adapter_model.safetensors β LoRA delta weights
Limitations
- Language: The model was trained and evaluated exclusively on English text. Performance on other languages is unknown. An Italian evaluation is currently in progress.
- Input length: The Universal Sentence Encoder v4 has an effective input range of roughly 1β512 tokens. Very long inputs are truncated internally by the encoder before reaching the Mistral backbone.
- Threshold sensitivity: The default threshold of 0.50 was selected to balance precision and recall on the English test set. Depending on the application, a different threshold may be more appropriate (see the inference guide above).
- Overfitting on rare labels: Six labels β relief, embarrassment, nervousness, pride, remorse, grief β each have fewer than 100 positive examples in the test set. Their F1 scores (0.90β1.00) are likely inflated by this small sample size and are not reliable for production use. See the Low-Support Labels β Overfitting Risk section for details and a code snippet to suppress these labels at inference time.
- Label imbalance: High-frequency, semantically overlapping labels (approval, neutral, annoyance, admiration) are the hardest to classify correctly and show the lowest F1 scores.
- Compute requirements: Loading the 4-bit quantised 24B-parameter model requires approximately 48 GB of GPU VRAM. The model cannot be used on consumer GPUs without additional quantisation.
- Data distribution: GoEmotions consists of Reddit comments in English. The model may not generalise well to formal text, non-English dialects, or social media platforms with different writing conventions.
Citation
If you use this model, please cite the GoEmotions dataset:
@inproceedings{demszky2020goemotions,
title = {GoEmotions: A Dataset of Fine-Grained Emotions},
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwook
and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle = {Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics},
year = {2020},
pages = {4040--4054},
}
And the Mistral Small 3.1 base model:
@misc{mistral2025small31,
title = {Mistral Small 3.1},
author = {Mistral AI},
year = {2025},
url = {https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503},
}
- Downloads last month
- -
Model tree for TextMiningStories/Mistral-Small-3.1-24B-goemotions
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503