HyunaZ commited on
Commit
4f040b4
โ€ข
1 Parent(s): 47f0159

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,4 +1,117 @@
 
 
1
  ---
2
- base_model:
3
- - team-lucid/hubert-base-korean
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
  ---
4
+ license: apache-2.0
5
+ language:
6
+ - ko
7
+ library_name: transformers
8
+ pipeline_tag: automatic-speech-recognition
9
+ tags:
10
+ - speech
11
+ - audio
12
+ ---
13
+
14
+ # hubert-emotion
15
+
16
+ ## Model Details
17
+
18
+ Hubert(Hidden-Unit BERT)๋Š” Facebook์—์„œ ์ œ์•ˆํ•œ Speech Representation Learning ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
19
+ Hubert๋Š” ๊ธฐ์กด์˜ ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ, ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ raw waveform์—์„œ ๋ฐ”๋กœ ํ•™์Šตํ•˜๋Š” self-supervised learning ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
20
+
21
+ https://huggingface.co/team-lucid/hubert-base-korean ๋ฅผ ๋ฒ ์ด์Šค๋ชจ๋ธ๋กœ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
22
+
23
+
24
+ ## How to Get Started with the Model
25
+
26
+ ### Pytorch
27
+
28
+ ```py
29
+ import torch
30
+ import librosa
31
+ from transformers import AutoFeatureExtractor, AutoConfig
32
+ import whisper
33
+ from pytorch_lightning import Trainer
34
+ import pytorch_lightning as pl
35
+ from torch import nn
36
+ from transformers import HubertForSequenceClassification
37
+
38
+ class MyLitModel(pl.LightningModule):
39
+ def __init__(self, audio_model_name, num_label2s, n_layers=1, projector=True, classifier=True, dropout=0.07, lr_decay=1):
40
+ super(MyLitModel, self).__init__()
41
+ self.config = AutoConfig.from_pretrained(audio_model_name)
42
+ self.config.output_hidden_states = True
43
+ self.audio_model = HubertForSequenceClassification.from_pretrained(audio_model_name, config=self.config)
44
+ self.label2_classifier = nn.Linear(self.audio_model.config.hidden_size, num_label2s)
45
+ self.intensity_regressor = nn.Linear(self.audio_model.config.hidden_size, 1)
46
+
47
+ def forward(self, audio_values, audio_attn_mask=None):
48
+ outputs = self.audio_model(input_values=audio_values, attention_mask=audio_attn_mask)
49
+ label2_logits = self.label2_classifier(outputs.hidden_states[-1][:, 0, :])
50
+ intensity_preds = self.intensity_regressor(outputs.hidden_states[-1][:, 0, :]).squeeze(-1)
51
+ return label2_logits, intensity_preds
52
+
53
+ # ๋ชจ๋ธ ๊ด€๋ จ ์„ค์ •
54
+ audio_model_name = "team-lucid/hubert-base-korean"
55
+ NUM_LABELS = 7
56
+ SAMPLING_RATE = 16000
57
+
58
+ # Hubert ๋ชจ๋ธ ๋กœ๋“œ
59
+ pretrained_model_path = "" # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ
60
+ hubert_model = MyLitModel.load_from_checkpoint(
61
+ pretrained_model_path,
62
+ audio_model_name=audio_model_name,
63
+ num_label2s=NUM_LABELS,
64
+ )
65
+ hubert_model.eval()
66
+ hubert_model.to("cuda" if torch.cuda.is_available() else "cpu")
67
+
68
+ # Feature extractor ๋กœ๋“œ
69
+ feature_extractor = AutoFeatureExtractor.from_pretrained(audio_model_name)
70
+
71
+ # ์Œ์„ฑ ํŒŒ์ผ ์ฒ˜๋ฆฌ
72
+ audio_path = "" # ์ฒ˜๋ฆฌํ•  ์Œ์„ฑ ํŒŒ์ผ ๊ฒฝ๋กœ
73
+ audio_np, _ = librosa.load(audio_path, sr=SAMPLING_RATE, mono=True)
74
+ inputs = feature_extractor(raw_speech=audio_np, return_tensors="pt", sampling_rate=SAMPLING_RATE)
75
+ audio_values = inputs["input_values"].to(hubert_model.device)
76
+ audio_attn_mask = inputs.get("attention_mask", None)
77
+ if audio_attn_mask is not None:
78
+ audio_attn_mask = audio_attn_mask.to(hubert_model.device)
79
+
80
+ # ๊ฐ์ • ๋ถ„์„
81
+ with torch.no_grad():
82
+ if audio_attn_mask is None:
83
+ label2_logits, intensity_preds = hubert_model(audio_values)
84
+ else:
85
+ label2_logits, intensity_preds = hubert_model(audio_values, audio_attn_mask)
86
+
87
+ emotion_label = torch.argmax(label2_logits, dim=-1).item()
88
+ emotion_intensity = intensity_preds.item()
89
+
90
+ print(f"Emotion Label: {emotion_label}, Emotion Intensity: {emotion_intensity}")
91
+
92
+
93
+
94
+
95
+ ```
96
+
97
+ ## Training Details
98
+
99
+ ### Training Data
100
+
101
+ ํ•ด๋‹น ๋ชจ๋ธ์€ AI hub์˜ ๊ฐ์ • ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋Œ€ํ™”์Œ์„ฑ๋ฐ์ดํ„ฐ์…‹ (https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=263) ์ค‘
102
+ ๊ฐ ๋ผ๋ฒจ ๋ณ„ ๋ฐ์ดํ„ฐ์…‹ 1000๊ฐœ์”ฉ, ์ด 7000๊ฐœ๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
103
+
104
+
105
+ ### Training Procedure
106
+
107
+ ๊ฐ 7๊ฐ€์ง€ ๊ฐ์ • (ํ–‰๋ณต, ๋ถ„๋…ธ, ํ˜์˜ค, ๊ณตํฌ, ์ค‘๋ฆฝ, ์Šฌํ””, ๋†€๋žŒ)๊ณผ ๊ฐ ๊ฐ์ •์˜ ๊ฐ•๋„(0-2)๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” ๋ฉ€ํ‹ฐํ…Œ์Šคํฌ ๋ชจ๋ธ๋กœ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.
108
+
109
+ #### Training Hyperparameters
110
+
111
+ | Hyperparameter | Base |
112
+ |:--------------------|---------|
113
+ | Learning Rates | 1e-5 |
114
+ | Learning Rate Decay | 0.8 |
115
+ | Batch Size | 8 |
116
+ | Weight Decay | 0.01 |
117
+ | Epoch | 30 |