|
--- |
|
datasets: |
|
- Souha-BH/DetectingRiskyHealthBehaviorsInTikTokVideos |
|
--- |
|
# Model Card for Multimodal Risk Behavior Detection Model |
|
|
|
## Model Overview |
|
The Multimodal Risk Behavior Detection Model is designed to detect risky health behaviors in TikTok videos. |
|
By leveraging both visual and textual features from TikTok video content, the model can classify whether a video portrays risky health behaviors, such as smoking, alcohol consumption, or unhealthy eating habits. |
|
The model integrates two pre-trained architectures: BERT for text feature extraction and ResNet50 for video frame analysis, combining their outputs to make predictions. |
|
|
|
## Training Data |
|
The model was trained on the "Detecting Risky Health Behaviors in TikTok Videos" dataset: https://huggingface.co/datasets/Souha-BH/DetectingRiskyHealthBehaviorsInTikTokVideos. |
|
This dataset includes video metadata, captions, and video clips, which are labeled as either risky or non-risky. The data was collected using the Apify TikTok Hashtag Scraper and annotated for risky health behaviors. |
|
The model uses the text column (captions) and the corresponding video files from the dataset to extract text and visual features. |
|
|
|
- Training-Validation-Test Split: The dataset was split into training, validation, and test sets using the "Split" column. |
|
- Training set: Used to train the model's parameters. |
|
- Validation set: Used to tune hyperparameters and avoid overfitting. |
|
- Test set: Used to evaluate the final performance of the model. |
|
|
|
## Model Architecture |
|
The Multimodal Risk Behavior Detection Model follows a multimodal approach that integrates both textual and visual modalities. |
|
|
|
- Textual Features: Extracted using BERT (bert-base-uncased), with tokenized video captions passed through BERT's transformer layers. |
|
- Visual Features: Extracted using ResNet50, where frames from each TikTok video are resized and processed to generate high-level visual embeddings. |
|
- Feature Fusion: The embeddings from BERT and ResNet50 are concatenated and passed through a series of fully connected layers with ReLU activations and dropout regularization to prevent overfitting. |
|
- Classification Layer: The final layer is a single-unit sigmoid layer that outputs a probability between 0 and 1, with 0.5 as the threshold for classification. |
|
|
|
### Training Procedure |
|
|
|
- Loss Function: Binary Cross-Entropy Loss (BCE) was used to compute the error between predicted probabilities and true labels. |
|
- Optimizer: Adam optimizer with a learning rate of 2e-5. |
|
- Batch Size: 4 video samples per batch. |
|
- Epochs: The model was trained for 5 epochs. |
|
- Video Frame Limit: Each video was sampled for 10 frames to reduce computational overhead. |
|
- Augmentation and Normalization: Frames were resized to 224x224 and normalized using ImageNet's mean and standard deviation. |
|
- |
|
### Evaluation Metrics |
|
The model was evaluated on the test set using the following metrics: |
|
|
|
- Accuracy: Measures the proportion of correct predictions. |
|
- Precision: Measures how many of the predicted "risky" videos were actually risky. |
|
- Recall: Measures how many of the actual risky videos were correctly identified. |
|
- F1 Score: The harmonic mean of precision and recall, balancing both metrics. |
|
- ROC-AUC: Measures the area under the ROC curve, showing the model's ability to distinguish between risky and non-risky videos. |
|
|
|
### Model Performance |
|
After training for 5 epochs, the model's performance on the test set was as follows: |
|
|
|
- Accuracy: 63.33% |
|
- Precision: 55.00% |
|
- Recall: 84.62% |
|
- F1 Score: 66.67% |
|
- ROC-AUC: 65.84% |
|
|
|
### Usage |
|
|
|
- Input: A TikTok video and its corresponding caption. |
|
- Output: A probability score indicating the likelihood that the video depicts a risky health behavior. |
|
|
|
### Limitations |
|
- Data Balance: If the dataset is imbalanced (more non-risky videos than risky ones), the model may struggle with precision. |
|
- Contextual Understanding: The model relies heavily on textual captions. If a caption does not explicitly describe risky behavior, the model may underperform. |
|
- Limited Frame Sampling: Only 10 frames per video are processed, which may miss important content, especially for longer videos. |