File size: 7,283 Bytes
abfeaa2 4fac4a3 abfeaa2 4fac4a3 abfeaa2 061dbe0 4fac4a3 6d7d5fc 4fac4a3 44e5aa3 4fac4a3 922645d 4fac4a3 50abe92 061dbe0 4fac4a3 061dbe0 50abe92 061dbe0 4fac4a3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
language:
- ko
- en
tags:
- transformer
- video
- audio
- homecam
- multimodal
- senior
- yolo
- mediapipe
---
# Model Card for `Silver-Multimodal`
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
- The Silver-Multimodal model integrates both audio and video modalities for real-time situation classification.
- This architecture allows it to process diverse inputs simultaneously and identify scenarios like daily activities, violence, and fall events with high precision.
- The model leverages a Transformer-based architecture to combine features extracted from audio (MFCC) and video (MediaPipe keypoints), enabling robust multimodal learning.
- Key Highlights:
- Multimodal Integration: Combines YOLO, MediaPipe, and MFCC features for comprehensive situation understanding.
- Middle Fusion: The extracted features are fused and passed through the Transformer model for context-aware classification.
- Output Classes:
- 0 Daily Activities: Normal indoor movements like walking or sitting.
- 1 Violence: Aggressive behaviors or physical conflicts.
- 2 Fall Down: Sudden fall or collapse.
![Multimodal Model](./pics/multimodal-overview.png)
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Activity with:** NIPA-Google(2024.10.23-20224.11.08), Kosa Hackathon(2024.12.9)
- **Model type:** Multimodal Transformer Model
- **API used:** Keras
- **Dataset:** [HuggingFace Silver-Multimodal-Dataset](https://huggingface.co/datasets/SilverAvocado/Silver-Multimodal-Dataset)
- **Code:** [GitHub Silver Model Code](https://github.com/silverAvocado/silver-model-code)
- **Language(s) (NLP):** Korean, English
## Training Details
### Dataset Preperation
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- **HuggingFace:** [HuggingFace Silver-Multimodal-Dataset](https://huggingface.co/datasets/SilverAvocado/Silver-Multimodal-Dataset)
- **Description:**
- The dataset is designed to support the development of machine learning models for detecting daily activities, violence, and fall down scenarios from combined audio and video sources.
- The preprocessing pipeline leverages audio feature extraction, human keypoint detection, and relative positional encoding to generate a unified representation for training and inference.
- Classes:
- 0: Daily - Normal indoor activities
- 1: Violence - Aggressive behaviors
- 2: Fall Down - Sudden falls or collapses
### Model Details
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
- **Model Structure:**
![Multimodal Model Structure](./pics/model-structure.png)
- Input Shape and Division
1. Input Shape:
- The input shape for each branch is (N, 100, 750), where:
- N: Batch size (number of sequences in a batch).
- 100: Temporal dimension (time steps).
- 750: Feature dimension, representing extracted features for each input modality.
2. Why Four Inputs?:
- The model processes four distinct inputs, each corresponding to a specific set of features derived from video keypoints. Here’s how they are divided:
- Input 1, Input 2, Input 3:
- For each detected individual (up to 3 people), the model extracts 30 keypoints using MediaPipe.
- Each keypoint contains 3 features (x, y, z), resulting in 30 x 3 = 90 features per frame.
- Input 4:
- Represents relative positional coordinates calculated from the 10 most important key joints (e.g., shoulders, elbows, knees) for all 3 individuals.
- These relative coordinates capture spatial relationships among individuals, crucial for contextual understanding.
- Detailed Explanation of Architecture
1. Positional Encoding:
- Adds temporal position information to the input embeddings, allowing the transformer to consider the sequence order.
2. Multi-Head Attention:
- Captures interdependencies and relationships across the temporal dimension within each input.
- Ensures the model focuses on the most relevant frames or segments of the sequence.
3. Dropout:
- Applies dropout regularization to prevent overfitting and improve generalization.
4. LayerNormalization:
- Normalizes the output of each layer to stabilize training and accelerate convergence.
5. Dense Layers:
- Extracts higher-level features after the attention mechanism.
- The first dense layer processes features from attention, followed by another dropout and dense layer to refine features further.
6. AttentionPooling1D:
- Combines outputs from all four inputs into a unified representation.
- Aggregates temporal features using an attention mechanism, emphasizing the most important segments across modalities.
7. Final Dense Layers:
- The combined representation is passed through dense layers and a softmax activation function for final classification into target classes:
- 0: Daily Activities
- 1: Violence
- 2: Fall Down
- **Model Performance:**
![Confusion Matrix](./pics/confusion-matrix.png)
- Confusion Matrix Insights:
- Class 0 (Daily): 100% accuracy with no misclassifications.
- Class 1 (Violence): 96.96% accuracy with minimal false positives or false negatives.
- Class 2 (Fall Down): 98.67% accuracy, highlighting the model’s robustness in detecting falls.
- The overall accuracy is 98.37%, indicating the model’s reliability for real-time applications.
## Model Usage
- `Silver Assistant` Project
- [GitHub SilverAvocado](https://github.com/silverAvocado)
## Load Model For Inference
```python
# Hugging Face Hub에서 모델 다운로드
MODEL_PATH="silver_assistant_transformer.keras"
model_path = hf_hub_download(repo_id="SilverAvocado/Silver-Multimodal", filename=MODEL_PATH)
# 사용자 정의 클래스 로드
model = load_model(
model_path,
custom_objects={
"PositionalEncoding": PositionalEncoding,
"AttentionPooling1D": AttentionPooling1D
}
)
y_pred = np.argmax(model.predict([X_test1, X_test2, X_test3, X_test4]), axis=1)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
```
## Conclusion
- The Silver-Multimodal model demonstrates exceptional capabilities in multimodal learning for situation classification.
- Its ability to effectively integrate audio and video modalities ensures:
1. High Accuracy: Consistent performance across all classes.
2. Real-World Applicability: Suitable for applications like healthcare monitoring, safety systems, and smart homes.
3. Scalable Architecture: Transformer-based design allows future enhancements and additional modality integration.
- This model sets a new benchmark for multimodal AI systems, empowering safety-critical projects like `Silver Assistant` with state-of-the-art situation awareness.
|