PyTorch
BERT_Resnet50 / README.md
Souha-BH's picture
Update README.md
07128b2 verified
metadata
datasets:
  - Souha-BH/DetectingRiskyHealthBehaviorsInTikTokVideos

Model Card for Multimodal Risk Behavior Detection Model

Model Overview

The Multimodal Risk Behavior Detection Model is designed to detect risky health behaviors in TikTok videos. By leveraging both visual and textual features from TikTok video content, the model can classify whether a video portrays risky health behaviors, such as smoking, alcohol consumption, or unhealthy eating habits. The model integrates two pre-trained architectures: BERT for text feature extraction and ResNet50 for video frame analysis, combining their outputs to make predictions.

Training Data

The model was trained on the "Detecting Risky Health Behaviors in TikTok Videos" dataset: https://huggingface.co/datasets/Souha-BH/DetectingRiskyHealthBehaviorsInTikTokVideos. This dataset includes video metadata, captions, and video clips, which are labeled as either risky or non-risky. The data was collected using the Apify TikTok Hashtag Scraper and annotated for risky health behaviors. The model uses the text column (captions) and the corresponding video files from the dataset to extract text and visual features.

  • Training-Validation-Test Split: The dataset was split into training, validation, and test sets using the "Split" column.
  • Training set: Used to train the model's parameters.
  • Validation set: Used to tune hyperparameters and avoid overfitting.
  • Test set: Used to evaluate the final performance of the model.

Model Architecture

The Multimodal Risk Behavior Detection Model follows a multimodal approach that integrates both textual and visual modalities.

  • Textual Features: Extracted using BERT (bert-base-uncased), with tokenized video captions passed through BERT's transformer layers.
  • Visual Features: Extracted using ResNet50, where frames from each TikTok video are resized and processed to generate high-level visual embeddings.
  • Feature Fusion: The embeddings from BERT and ResNet50 are concatenated and passed through a series of fully connected layers with ReLU activations and dropout regularization to prevent overfitting.
  • Classification Layer: The final layer is a single-unit sigmoid layer that outputs a probability between 0 and 1, with 0.5 as the threshold for classification.

Training Procedure

  • Loss Function: Binary Cross-Entropy Loss (BCE) was used to compute the error between predicted probabilities and true labels.
  • Optimizer: Adam optimizer with a learning rate of 2e-5.
  • Batch Size: 4 video samples per batch.
  • Epochs: The model was trained for 5 epochs.
  • Video Frame Limit: Each video was sampled for 10 frames to reduce computational overhead.
  • Augmentation and Normalization: Frames were resized to 224x224 and normalized using ImageNet's mean and standard deviation.

Evaluation Metrics

The model was evaluated on the test set using the following metrics:

  • Accuracy: Measures the proportion of correct predictions.
  • Precision: Measures how many of the predicted "risky" videos were actually risky.
  • Recall: Measures how many of the actual risky videos were correctly identified.
  • F1 Score: The harmonic mean of precision and recall, balancing both metrics.
  • ROC-AUC: Measures the area under the ROC curve, showing the model's ability to distinguish between risky and non-risky videos.

Model Performance

After training for 5 epochs, the model's performance on the test set was as follows:

  • Accuracy: 63.33%
  • Precision: 55.00%
  • Recall: 84.62%
  • F1 Score: 66.67%
  • ROC-AUC: 65.84%

Usage

  • Input: A TikTok video and its corresponding caption.
  • Output: A probability score indicating the likelihood that the video depicts a risky health behavior.

Limitations

  • Data Balance: If the dataset is imbalanced (more non-risky videos than risky ones), the model may struggle with precision.
  • Contextual Understanding: The model relies heavily on textual captions. If a caption does not explicitly describe risky behavior, the model may underperform.
  • Limited Frame Sampling: Only 10 frames per video are processed, which may miss important content, especially for longer videos.