|
--- |
|
license: apache-2.0 |
|
language: |
|
- ko |
|
library_name: transformers |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- speech |
|
- audio |
|
--- |
|
|
|
# hubert-base-korean |
|
|
|
## Model Details |
|
|
|
Hubert(Hidden-Unit BERT)๋ Facebook์์ ์ ์ํ Speech Representation Learning ๋ชจ๋ธ์
๋๋ค. |
|
Hubert๋ ๊ธฐ์กด์ ์์ฑ ์ธ์ ๋ชจ๋ธ๊ณผ ๋ฌ๋ฆฌ, ์์ฑ ์ ํธ๋ฅผ raw waveform์์ ๋ฐ๋ก ํ์ตํ๋ self-supervised learning ๋ฐฉ์์ ์ฌ์ฉํฉ๋๋ค. |
|
|
|
https://huggingface.co/team-lucid/hubert-base-korean ๋ฅผ ๋ฒ ์ด์ค๋ชจ๋ธ๋ก ํ์ฉํ์ต๋๋ค. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
### Pytorch |
|
|
|
```py |
|
import torch |
|
import librosa |
|
from transformers import AutoFeatureExtractor, AutoConfig |
|
import whisper |
|
from pytorch_lightning import Trainer |
|
import pytorch_lightning as pl |
|
from torch import nn |
|
from transformers import HubertForSequenceClassification |
|
|
|
class MyLitModel(pl.LightningModule): |
|
def __init__(self, audio_model_name, num_label2s, n_layers=1, projector=True, classifier=True, dropout=0.07, lr_decay=1): |
|
super(MyLitModel, self).__init__() |
|
self.config = AutoConfig.from_pretrained(audio_model_name) |
|
self.config.output_hidden_states = True |
|
self.audio_model = HubertForSequenceClassification.from_pretrained(audio_model_name, config=self.config) |
|
self.label2_classifier = nn.Linear(self.audio_model.config.hidden_size, num_label2s) |
|
self.intensity_regressor = nn.Linear(self.audio_model.config.hidden_size, 1) |
|
|
|
def forward(self, audio_values, audio_attn_mask=None): |
|
outputs = self.audio_model(input_values=audio_values, attention_mask=audio_attn_mask) |
|
label2_logits = self.label2_classifier(outputs.hidden_states[-1][:, 0, :]) |
|
intensity_preds = self.intensity_regressor(outputs.hidden_states[-1][:, 0, :]).squeeze(-1) |
|
return label2_logits, intensity_preds |
|
|
|
# ๋ชจ๋ธ ๊ด๋ จ ์ค์ |
|
audio_model_name = "team-lucid/hubert-base-korean" |
|
NUM_LABELS = 7 |
|
SAMPLING_RATE = 16000 |
|
|
|
# Hubert ๋ชจ๋ธ ๋ก๋ |
|
pretrained_model_path = "" # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํธ |
|
hubert_model = MyLitModel.load_from_checkpoint( |
|
pretrained_model_path, |
|
audio_model_name=audio_model_name, |
|
num_label2s=NUM_LABELS, |
|
) |
|
hubert_model.eval() |
|
hubert_model.to("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Feature extractor ๋ก๋ |
|
feature_extractor = AutoFeatureExtractor.from_pretrained(audio_model_name) |
|
|
|
# ์์ฑ ํ์ผ ์ฒ๋ฆฌ |
|
audio_path = "" # ์ฒ๋ฆฌํ ์์ฑ ํ์ผ ๊ฒฝ๋ก |
|
audio_np, _ = librosa.load(audio_path, sr=SAMPLING_RATE, mono=True) |
|
inputs = feature_extractor(raw_speech=audio_np, return_tensors="pt", sampling_rate=SAMPLING_RATE) |
|
audio_values = inputs["input_values"].to(hubert_model.device) |
|
audio_attn_mask = inputs.get("attention_mask", None) |
|
if audio_attn_mask is not None: |
|
audio_attn_mask = audio_attn_mask.to(hubert_model.device) |
|
|
|
# ๊ฐ์ ๋ถ์ |
|
with torch.no_grad(): |
|
if audio_attn_mask is None: |
|
label2_logits, intensity_preds = hubert_model(audio_values) |
|
else: |
|
label2_logits, intensity_preds = hubert_model(audio_values, audio_attn_mask) |
|
|
|
emotion_label = torch.argmax(label2_logits, dim=-1).item() |
|
emotion_intensity = intensity_preds.item() |
|
|
|
print(f"Emotion Label: {emotion_label}, Emotion Intensity: {emotion_intensity}") |
|
|
|
|
|
|
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
ํด๋น ๋ชจ๋ธ์ AI hub์ ๊ฐ์ ๋ถ๋ฅ๋ฅผ ์ํ ๋ํ์์ฑ๋ฐ์ดํฐ์
(https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=263) ์ค |
|
๊ฐ ๋ผ๋ฒจ ๋ณ ๋ฐ์ดํฐ์
1000๊ฐ์ฉ, ์ด 7000๊ฐ๋ฅผ ํ์ฉํด ํ์ต์ ์งํํ์ต๋๋ค. |
|
|
|
|
|
### Training Procedure |
|
|
|
๊ฐ 7๊ฐ์ง ๊ฐ์ (ํ๋ณต, ๋ถ๋
ธ, ํ์ค, ๊ณตํฌ, ์ค๋ฆฝ, ์ฌํ, ๋๋)๊ณผ ๊ฐ ๊ฐ์ ์ ๊ฐ๋(0-2)๋ฅผ ๋์์ ํ์ตํ๋ ๋ฉํฐํ
์คํฌ ๋ชจ๋ธ๋ก ์ค๊ณํ์ต๋๋ค. |
|
|
|
#### Training Hyperparameters |
|
|
|
| Hyperparameter | Base | |
|
|:--------------------|---------| |
|
| Learning Rates | 1e-5 | |
|
| Learning Rate Decay | 0.8 | |
|
| Batch Size | 8 | |
|
| Weight Decay | 0.01 | |
|
| Epoch | 30 | |