SoyeonHH's picture
Update README.md
628e8db verified

Multimodal Sentiment Model with Augmentation

DeBERTa-v3-Large ๊ธฐ๋ฐ˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ์„ฑ ๋ถ„์„ ๋ชจ๋ธ (CMU-MOSEI)

Model Description

์ด ๋ชจ๋ธ์€ CMU-MOSEI ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ์„ฑ ๋ถ„์„์„ ์œ„ํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. IITP "๋‚˜๋น„ํšจ๊ณผ" ์—ฐ๊ตฌ ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Architecture

  • Text Encoder: DeBERTa-v3-Large (microsoft/deberta-v3-large)
  • Audio Encoder: Transformer Encoder (2 layers)
  • Video Encoder: Transformer Encoder (2 layers)
  • Fusion: Cross-modal attention + Multi-head self-attention

Key Features

  • Cross-modal attention between text, audio, and video
  • Mixup augmentation for audio/video modalities
  • Multi-task learning with auxiliary classifiers (T, A, V branches)
  • Frozen first 20 layers of DeBERTa for efficient training

Performance

Metric Score
Mult_acc_7 56.17%
Mult_acc_5 57.83%
Has0_acc_2 ~84%
MAE -
Corr -

Comparison with Baselines

Model Mult_acc_7
MulT (2020) 50.7%
MMML (2023) 54.95%
Ours 56.17%

Training Details

  • Dataset: CMU-MOSEI (unaligned_50.pkl)
  • Batch Size: 16
  • Learning Rate: 2e-5 (other), 5e-6 (DeBERTa)
  • Epochs: 50 (early stopping: 15)
  • Optimizer: AdamW
  • Scheduler: Cosine with warmup
  • Mixup: alpha=0.4, prob=0.5
  • Loss weights: cls=0.7, aux=0.1

Usage

import torch
from transformers import AutoTokenizer

# Load checkpoint
checkpoint = torch.load('best_model.pt')
args = checkpoint['args']

# Initialize model
from train_deberta_multimodal import DeBERTaMultimodalModel

model = DeBERTaMultimodalModel(
    model_name='microsoft/deberta-v3-large',
    hidden_size=512,
    num_heads=8,
    dropout=0.2,
    freeze_deberta_layers=20
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')

Input Format

  • Text: Raw text string (tokenized by DeBERTa tokenizer)
  • Audio: COVAREP features (74-dim, 500 timesteps)
  • Video: OpenFace features (35-dim, 500 timesteps)