File size: 7,283 Bytes
abfeaa2
 
4fac4a3
 
abfeaa2
4fac4a3
 
 
 
 
 
 
 
abfeaa2
061dbe0
4fac4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d7d5fc
4fac4a3
 
 
 
44e5aa3
4fac4a3
 
 
 
 
 
 
 
 
 
 
922645d
4fac4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50abe92
 
061dbe0
4fac4a3
061dbe0
50abe92
061dbe0
 
 
 
 
 
 
 
 
 
4fac4a3
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
language:
  - ko
  - en
tags:
  - transformer
  - video
  - audio
  - homecam 
  - multimodal
  - senior
  - yolo
  - mediapipe
---

# Model Card for `Silver-Multimodal`

<!-- Provide a quick summary of what the model is/does. -->

## Model Details

- The Silver-Multimodal model integrates both audio and video modalities for real-time situation classification. 
- This architecture allows it to process diverse inputs simultaneously and identify scenarios like daily activities, violence, and fall events with high precision. 
- The model leverages a Transformer-based architecture to combine features extracted from audio (MFCC) and video (MediaPipe keypoints), enabling robust multimodal learning.

- Key Highlights:
	- Multimodal Integration: Combines YOLO, MediaPipe, and MFCC features for comprehensive situation understanding.
	- Middle Fusion: The extracted features are fused and passed through the Transformer model for context-aware classification.
	- Output Classes:
	    - 0 Daily Activities: Normal indoor movements like walking or sitting.
	    - 1 Violence: Aggressive behaviors or physical conflicts.
	    - 2 Fall Down: Sudden fall or collapse.

![Multimodal Model](./pics/multimodal-overview.png)

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Activity with:** NIPA-Google(2024.10.23-20224.11.08), Kosa Hackathon(2024.12.9)
- **Model type:** Multimodal Transformer Model
- **API used:** Keras
- **Dataset:** [HuggingFace Silver-Multimodal-Dataset](https://huggingface.co/datasets/SilverAvocado/Silver-Multimodal-Dataset)
- **Code:** [GitHub Silver Model Code](https://github.com/silverAvocado/silver-model-code)
- **Language(s) (NLP):** Korean, English

## Training Details
### Dataset Preperation

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- **HuggingFace:** [HuggingFace Silver-Multimodal-Dataset](https://huggingface.co/datasets/SilverAvocado/Silver-Multimodal-Dataset)
- **Description:**
   - The dataset is designed to support the development of machine learning models for detecting daily activities, violence, and fall down scenarios from combined audio and video sources.
    - The preprocessing pipeline leverages audio feature extraction, human keypoint detection, and relative positional encoding to generate a unified representation for training and inference.
	- Classes:
	    - 0: Daily - Normal indoor activities
	    - 1: Violence - Aggressive behaviors
	    - 2: Fall Down - Sudden falls or collapses

### Model Details

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

- **Model Structure:**
    ![Multimodal Model Structure](./pics/model-structure.png)

    - Input Shape and Division
	    1.	Input Shape:
            - The input shape for each branch is (N, 100, 750), where:
	            - N: Batch size (number of sequences in a batch).
	            - 100: Temporal dimension (time steps).
	            - 750: Feature dimension, representing extracted features for each input modality.
        2.	Why Four Inputs?:
            - The model processes four distinct inputs, each corresponding to a specific set of features derived from video keypoints. Here’s how they are divided:
            - Input 1, Input 2, Input 3:
                - For each detected individual (up to 3 people), the model extracts 30 keypoints using MediaPipe.
                - Each keypoint contains 3 features (x, y, z), resulting in 30 x 3 = 90 features per frame.
            - Input 4:
                - Represents relative positional coordinates calculated from the 10 most important key joints (e.g., shoulders, elbows, knees) for all 3 individuals.
                - These relative coordinates capture spatial relationships among individuals, crucial for contextual understanding.

    - Detailed Explanation of Architecture
	    1.	Positional Encoding:
	        - Adds temporal position information to the input embeddings, allowing the transformer to consider the sequence order.
	    2.	Multi-Head Attention:
	        - Captures interdependencies and relationships across the temporal dimension within each input.
	        - Ensures the model focuses on the most relevant frames or segments of the sequence.
	    3.	Dropout:
	        - Applies dropout regularization to prevent overfitting and improve generalization.
	    4.	LayerNormalization:
	        - Normalizes the output of each layer to stabilize training and accelerate convergence.
	    5.	Dense Layers:
	        - Extracts higher-level features after the attention mechanism.
	        - The first dense layer processes features from attention, followed by another dropout and dense layer to refine features further.
	    6.	AttentionPooling1D:
	        - Combines outputs from all four inputs into a unified representation.
	        - Aggregates temporal features using an attention mechanism, emphasizing the most important segments across modalities.
	    7.	Final Dense Layers:
	        - The combined representation is passed through dense layers and a softmax activation function for final classification into target classes:
	            - 0: Daily Activities
	            - 1: Violence
	            - 2: Fall Down

- **Model Performance:**
    ![Confusion Matrix](./pics/confusion-matrix.png)
    
    - Confusion Matrix Insights:
	    - Class 0 (Daily): 100% accuracy with no misclassifications.
	    - Class 1 (Violence): 96.96% accuracy with minimal false positives or false negatives.
	    - Class 2 (Fall Down): 98.67% accuracy, highlighting the model’s robustness in detecting falls.
	    - The overall accuracy is 98.37%, indicating the model’s reliability for real-time applications.
    

## Model Usage
- `Silver Assistant` Project
    - [GitHub SilverAvocado](https://github.com/silverAvocado)

## Load Model For Inference
```python
# Hugging Face Hub에서 모델 다운로드
MODEL_PATH="silver_assistant_transformer.keras"
model_path = hf_hub_download(repo_id="SilverAvocado/Silver-Multimodal", filename=MODEL_PATH)

# 사용자 정의 클래스 로드
model = load_model(
    model_path,
    custom_objects={
        "PositionalEncoding": PositionalEncoding,
        "AttentionPooling1D": AttentionPooling1D
    }
)

y_pred = np.argmax(model.predict([X_test1, X_test2, X_test3, X_test4]), axis=1)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
```

## Conclusion
- The Silver-Multimodal model demonstrates exceptional capabilities in multimodal learning for situation classification. 
- Its ability to effectively integrate audio and video modalities ensures:
	1.	High Accuracy: Consistent performance across all classes.
	2.	Real-World Applicability: Suitable for applications like healthcare monitoring, safety systems, and smart homes.
	3.	Scalable Architecture: Transformer-based design allows future enhancements and additional modality integration.

- This model sets a new benchmark for multimodal AI systems, empowering safety-critical projects like `Silver Assistant` with state-of-the-art situation awareness.