FaceEmo-Set: A Balanced and Diverse Dataset for Facial Emotion Recognition

Model Description

This repository contains the FaceEmo-Set models trained using Vision Transformer (ViT) architecture for facial emotion recognition across seven basic emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise.

We provide two model variants:

  • FaceEmo-Set ViT Model: Trained exclusively on FaceEmo-Set (25,200 balanced images)
  • Combined ViT Model: Trained on FaceEmo-Set + FER2013 + RAF-DB + RAVDESS for enhanced cross-dataset generalization

Key Features

  • Balanced representation across all seven emotion categories
  • Superior minority-class performance (disgust, fear, sadness)
  • Cross-dataset generalization validated on AffectNet and FER2013
  • Vision Transformer architecture (ViT-Base/16) with full fine-tuning

Model Performance

FaceEmo-Set Model Performance

Test Dataset Overall Accuracy Angry Disgust Fear Happy Neutral Sad Surprise
AffectNet 60.84% 0.50 0.75 0.53 0.87 0.60 0.44 0.58
FER2013 56.19% 0.42 0.70 0.24 0.83 0.62 0.62 0.71

Combined Model Performance

Test Dataset Overall Accuracy Angry Disgust Fear Happy Neutral Sad Surprise
AffectNet 65.27% 0.62 0.79 0.63 0.93 0.39 0.53 0.69
FER2013 69.32% 0.56 0.64 0.55 0.93 0.54 0.65 0.82

Notable Achievement: FaceEmo-Set achieves 0.75 recall for disgust on AffectNet, dramatically outperforming models trained on imbalanced datasets (FER2013: 0.04 recall).

Dataset Description

FaceEmo-Set is a carefully curated dataset comprising 25,200 validated images (from 28,979 collected) designed to address critical limitations in existing FER datasets:

Design Principles

  1. Class Balance: Near-uniform distribution (3,300-3,900 samples per emotion, max ratio 1.2:1)
  2. Source Diversity: Multi-source integration (movies, TV shows, GIFs, internet images, AI-generated content, established datasets)
  3. Quality Variance: Deliberate inclusion of varying resolutions, lighting, and acquisition conditions

Data Sources

  • Dynamic media (movies, TV shows, GIFs)
  • Static internet images
  • AI-generated content
  • Samples from FER2013, CREMA-D, and RAVDESS (200, 100, 100 per emotion respectively)

Validation Protocol

  • Multi-annotator validation (3 annotators per image)
  • Majority agreement required (≥2/3 consensus)
  • 13% rejection rate (3,779 ambiguous/low-quality images excluded)

Emotion Distribution

Emotion Training Validation Total
Anger 3,372 596 3,968
Disgust 2,985 527 3,512
Fear 2,807 496 3,303
Happiness 3,231 571 3,802
Neutral 3,130 553 3,683
Sadness 3,086 545 3,631
Surprise 2,805 496 3,301
Total 21,416 3,784 25,200

Visual Overview

Dataset Creation Pipeline

FaceEmo-Set Creation Pipeline

Confusion Matrices

FaceEmo-Set model on FER2013 Confusion Matrix - FaceEmo-Set on FER2013

FaceEmo-Set model on AffectNet Confusion Matrix - FaceEmo-Set on AffectNet

Available Files

Model Weights

  • FaceEmo-Set_ViT_model_weights.pth - FaceEmo-Set standalone model
  • comb_data_ViT_model_weights.pth - Combined multi-dataset model

Predictions and Results

  • FaceEmo-Set_train_FER2013_test_predictions.csv
  • FaceEmo-Set_train_AffectNet_test_predictions.csv
  • comb_data_train_FER2013_test_predictions.csv
  • comb_data_train_AffectNet_test_predictions.csv

Visualizations

  • FaceEmoset_Creation_pipeline.png - Dataset construction pipeline
  • FaceEmo-Set_train_FER2013_test_confusion_m.png - Confusion matrix (FER2013)
  • FaceEmo-Set_train_AffectNet_test_confusion_m.png - Confusion matrix (AffectNet)

Dataset Features

Note: Raw image data cannot be released due to copyright restrictions on source materials.

Quick Start

Installation

pip install torch torchvision transformers pillow huggingface_hub

Single Image Inference

import warnings, torch
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, ViTConfig, ViTForImageClassification
from transformers.utils import logging as hf_logging
from PIL import Image
from IPython.display import display, Markdown

warnings.filterwarnings("ignore")
hf_logging.set_verbosity_error()

REPO_ID = "jihedjabnoun/faceemo-set"

MODEL_TYPE = "combined"   # "faceemo" or "combined"

FILES = {
    "faceemo": ("FaceEmo-Set_ViT_model_weights.pth", "FaceEmo-Set_ViT_model"),
    "combined": ("comb_data_ViT_model_weights.pth", "Combined_ViT_model")
}

MODEL_FILE, MODEL_NAME = FILES[MODEL_TYPE]

EMOTIONS = ["anger","disgust","fear","happiness","neutral","sadness","surprise"]
IMAGE_PATH = "9.png"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k", use_fast=True)

config = ViTConfig.from_pretrained("google/vit-base-patch16-224-in21k", num_labels=len(EMOTIONS))
model = ViTForImageClassification(config)

weights_path = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)

state = torch.load(weights_path, map_location="cpu")
if any(k.startswith("module.") for k in state):
    state = {k.replace("module.", "", 1): v for k, v in state.items()}

model.load_state_dict(state, strict=True)
model.to(device).eval()

print(f"✅ {MODEL_NAME} loaded successfully")

image = Image.open(IMAGE_PATH).convert("RGB")
display(image)

inputs = processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=1)[0]

top = torch.topk(probs, 3)

lines = []
for i, (idx, p) in enumerate(zip(top.indices.tolist(), top.values.tolist()), 1):
    lines.append(f"{i}. **{EMOTIONS[idx]}** — `{p:.2%}`")

display(Markdown("### Prediction (Top-3)\n" + "\n".join(lines)))

Prediction Output Example Prediction Output Example

Batch Inference

import warnings, torch
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, ViTConfig, ViTForImageClassification
from transformers.utils import logging as hf_logging
from PIL import Image
from torch.utils.data import Dataset, DataLoader
import pandas as pd

warnings.filterwarnings("ignore")
hf_logging.set_verbosity_error()

REPO_ID = "jihedjabnoun/faceemo-set"

MODEL_TYPE = "faceemo"   # "faceemo" or "combined"

FILES = {
    "faceemo": ("FaceEmo-Set_ViT_model_weights.pth", "FaceEmo-Set_ViT_model"),
    "combined": ("comb_data_ViT_model_weights.pth", "Combined_ViT_model")
}

MODEL_FILE, MODEL_NAME = FILES[MODEL_TYPE]

EMOTIONS = ["anger","disgust","fear","happiness","neutral","sadness","surprise"]
image_paths = ["9.png", "C.png"]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k", use_fast=True)

config = ViTConfig.from_pretrained("google/vit-base-patch16-224-in21k", num_labels=len(EMOTIONS))
model = ViTForImageClassification(config)

weights_path = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)

state = torch.load(weights_path, map_location="cpu")
if any(k.startswith("module.") for k in state):
    state = {k.replace("module.", "", 1): v for k, v in state.items()}

model.load_state_dict(state, strict=True)
model.to(device).eval()

print(f"✅ {MODEL_NAME} loaded successfully")

class ImgListDataset(Dataset):
    def __init__(self, paths):
        self.paths = paths

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, idx):
        image = Image.open(self.paths[idx]).convert("RGB").resize((224, 224))
        pixel = processor(images=image, return_tensors="pt")["pixel_values"].squeeze(0)
        return pixel, self.paths[idx]

loader = DataLoader(ImgListDataset(image_paths), batch_size=32, shuffle=False)

rows = []
with torch.no_grad():
    for pixels, paths in loader:
        pixels = pixels.to(device)
        probs = torch.softmax(model(pixels).logits, dim=1)
        top = torch.topk(probs, 3, dim=1)

        for path, idxs, vals in zip(paths, top.indices.cpu().tolist(), top.values.cpu().tolist()):
            rows.append({
                "image": path,
                "top1": f"{EMOTIONS[idxs[0]]} ({vals[0]:.2%})",
                "top2": f"{EMOTIONS[idxs[1]]} ({vals[1]:.2%})",
                "top3": f"{EMOTIONS[idxs[2]]} ({vals[2]:.2%})",
            })

display(pd.DataFrame(rows))

Model Architecture

  • Base Model: Vision Transformer (ViT-Base/16)
  • Pre-training: ImageNet-21k
  • Input Size: 224×224 pixels
  • Patch Size: 16×16 pixels
  • Fine-tuning: Full parameter fine-tuning
  • Optimizer: Adam (learning rate: 5×10⁻⁵)
  • Loss Function: Cross-Entropy Loss
  • Batch Size: 32

Use Cases

  • Human-Computer Interaction: Emotion-aware interfaces
  • Mental Health Monitoring: Depression and anxiety screening
  • Customer Service: Sentiment analysis in video calls
  • Education: Student engagement monitoring
  • Entertainment: Audience reaction analysis
  • Security: Suspicious behavior detection

Limitations

  • Static image analysis only (no temporal modeling)
  • Seven basic emotions (no compound emotions or intensity levels)
  • May inherit biases from source datasets
  • Performance varies on extreme poses or occlusions
  • Annotation inconsistencies possible across integrated datasets
  • Requires face detection preprocessing: Images should contain cropped face regions, not full scenes

Citation

@inproceedings{jabnoun2026improving,
  title={Improving Cross-Dataset Generalization in Facial Emotion Recognition Through FaceEmo-Set: A Balanced and Diverse Dataset},
  author={Jabnoun, Jihed and Maraoui, Mohsen and Zrigui, Mounir},
  booktitle={Asian Conference on Intelligent Information and Database Systems},
  pages={355--369},
  year={2026},
  organization={Springer}
}

License

MIT License

Acknowledgments

This work was conducted at the Research Laboratory in Algebra, Numbers Theory and Intelligent Systems, University of Monastir, Tunisia.

Contact

Related Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support