hubert-large-korean / README.md
franknoh's picture
Update README.md
62054d1
---
license: apache-2.0
language:
- ko
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- audio
---
# hubert-large-korean
## Model Details
Hubert(Hidden-Unit BERT)๋Š” Facebook์—์„œ ์ œ์•ˆํ•œ Speech Representation Learning ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
Hubert๋Š” ๊ธฐ์กด์˜ ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ, ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ raw waveform์—์„œ ๋ฐ”๋กœ ํ•™์Šตํ•˜๋Š” self-supervised learning ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ด ์—ฐ๊ตฌ๋Š” ๊ตฌ๊ธ€์˜ TPU Research Cloud(TRC)๋ฅผ ํ†ตํ•ด ์ง€์›๋ฐ›์€ Cloud TPU๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
### Model Description
<table>
<tr>
<td colspan="2"></td>
<td>Base</td>
<td>Large</td>
</tr>
<tr>
<td rowspan="3">CNN Encoder</td>
<td>strides</td>
<td colspan="2">5, 2, 2, 2, 2, 2, 2</td>
</tr>
<tr>
<td>kernel width</td>
<td colspan="2">10, 3, 3, 3, 3, 2, 2</td>
</tr>
<tr>
<td>channel</td>
<td colspan="2">512</td>
</tr>
<tr>
<td rowspan="4">Transformer Encoder</td>
<td>Layer</td>
<td>12</td>
<td>24</td>
</tr>
<tr>
<td>embedding dim</td>
<td>768</td>
<td>1024</td>
</tr>
<tr>
<td>inner FFN dim</td>
<td>3072</td>
<td>4096</td>
</tr>
<tr>
<td>attention heads</td>
<td>8</td>
<td>16</td>
</tr>
<tr>
<td>Projection</td>
<td>dim</td>
<td>256</td>
<td>768</td>
</tr>
<tr>
<td colspan="2">Params</td>
<td>95M</td>
<td>317M </td>
</tr>
</table>
## How to Get Started with the Model
### Pytorch
```py
import torch
from transformers import HubertModel
model = HubertModel.from_pretrained("team-lucid/hubert-large-korean")
wav = torch.ones(1, 16000)
outputs = model(wav)
print(f"Input: {wav.shape}") # [1, 16000]
print(f"Output: {outputs.last_hidden_state.shape}") # [1, 49, 768]
```
### JAX/Flax
```py
import jax.numpy as jnp
from transformers import FlaxAutoModel
model = FlaxAutoModel.from_pretrained("team-lucid/hubert-large-korean", trust_remote_code=True)
wav = jnp.ones((1, 16000))
outputs = model(wav)
print(f"Input: {wav.shape}") # [1, 16000]
print(f"Output: {outputs.last_hidden_state.shape}") # [1, 49, 768]
```
## Training Details
### Training Data
ํ•ด๋‹น ๋ชจ๋ธ์€ ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€์˜ ์žฌ์›์œผ๋กœ ํ•œ๊ตญ์ง€๋Šฅ์ •๋ณด์‚ฌํšŒ์ง„ํฅ์›์˜ ์ง€์›์„ ๋ฐ›์•„
๊ตฌ์ถ•๋œ [์ž์œ ๋Œ€ํ™” ์Œ์„ฑ(์ผ๋ฐ˜๋‚จ์—ฌ)](https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=109), [๋‹คํ™”์ž ์Œ์„ฑํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ](https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=542), [๋ฐฉ์†ก ์ฝ˜ํ…์ธ  ๋Œ€ํ™”์ฒด ์Œ์„ฑ์ธ์‹ ๋ฐ์ดํ„ฐ](https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=463)
์—์„œ ์•ฝ 4,000์‹œ๊ฐ„์„ ์ถ”์ถœํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
### Training Procedure
[์› ๋…ผ๋ฌธ](https://arxiv.org/pdf/2106.07447.pdf)๊ณผ ๋™์ผํ•˜๊ฒŒ MFCC ๊ธฐ๋ฐ˜์œผ๋กœ Base ๋ชจ๋ธ์„ ํ•™์Šตํ•œ ๋‹ค์Œ, 500 cluster๋กœ k-means๋ฅผ ์ˆ˜ํ–‰ํ•ด ๋‹ค์‹œ Base์™€
Large ๋ชจ๋ธ์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
#### Training Hyperparameters
| Hyperparameter | Base | Large |
|:--------------------|---------|--------:|
| Warmup Steps | 32,000 | 32,000 |
| Learning Rates | 5e-4 | 1.5e-3 |
| Batch Size | 128 | 128 |
| Weight Decay | 0.01 | 0.01 |
| Max Steps | 400,000 | 400,000 |
| Learning Rate Decay | 0.1 | 0.1 |
| \\(Adam\beta_1\\) | 0.9 | 0.9 |
| \\(Adam\beta_2\\) | 0.99 | 0.99 |