update readme
Browse files
README.md
CHANGED
@@ -1,55 +1,81 @@
|
|
1 |
---
|
|
|
2 |
language: ja
|
3 |
-
|
4 |
-
|
|
|
5 |
tags:
|
6 |
- hubert
|
7 |
- speech
|
8 |
-
license: apache-2.0
|
9 |
---
|
10 |
|
11 |
-
# japanese-hubert-base
|
12 |
|
13 |
![rinna-icon](./rinna.png)
|
14 |
|
15 |
-
|
16 |
|
17 |
-
This
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
-
|
23 |
-
from transformers import HubertModel
|
24 |
|
25 |
-
|
26 |
-
model.eval()
|
27 |
|
28 |
-
|
29 |
-
|
30 |
-
print(f"Input: {wav_input_16khz.size()}") # [1, 10000]
|
31 |
-
print(f"Output: {outputs.last_hidden_state.size()}") # [1, 31, 768]
|
32 |
-
```
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
-
|
|
|
|
|
|
|
38 |
|
39 |
-
|
40 |
|
41 |
-
|
|
|
|
|
42 |
|
43 |
-
|
|
|
|
|
|
|
44 |
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
-
|
|
|
|
|
48 |
|
|
|
49 |
|
50 |
-
|
|
|
|
|
51 |
```bibtex
|
52 |
-
@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
|
54 |
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
|
55 |
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
|
@@ -60,3 +86,7 @@ The model was trained on approximately 19,000 hours of [ReazonSpeech](https://hu
|
|
60 |
doi={10.1109/TASLP.2021.3122291}
|
61 |
}
|
62 |
```
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
|
3 |
language: ja
|
4 |
+
license: apache-2.0
|
5 |
+
datasets: reazon-research/reazonspeech
|
6 |
+
inference: false
|
7 |
tags:
|
8 |
- hubert
|
9 |
- speech
|
|
|
10 |
---
|
11 |
|
12 |
+
# `rinna/japanese-hubert-base`
|
13 |
|
14 |
![rinna-icon](./rinna.png)
|
15 |
|
16 |
+
# Overview
|
17 |
|
18 |
+
This is a Japanese HuBERT Base model trained by [rinna Co., Ltd.](https://rinna.co.jp/)
|
19 |
|
20 |
+
* **Model summary**
|
21 |
|
22 |
+
The model architecture is the same as the [original HuBERT Base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers with 12 attention heads.
|
23 |
+
The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).
|
|
|
24 |
|
25 |
+
* **Training**
|
|
|
26 |
|
27 |
+
The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1.
|
28 |
+
- [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
|
|
|
|
|
|
|
29 |
|
30 |
+
* **Contributors**
|
31 |
|
32 |
+
- [Yukiya Hono](https://huggingface.co/yky-h)
|
33 |
+
- [Kentaro Mitsui](https://huggingface.co/Kentaro321)
|
34 |
+
- [Kei Sawada](https://huggingface.co/keisawada)
|
35 |
+
|
36 |
+
---
|
37 |
|
38 |
+
# How to use the model
|
39 |
|
40 |
+
```python
|
41 |
+
import soundfile as sf
|
42 |
+
from transformers import AutoFeatureExtractor, AutoModel
|
43 |
|
44 |
+
model_name = "rinna/japanese-hubert-base"
|
45 |
+
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
|
46 |
+
model = AutoModel.from_pretrained(model_name)
|
47 |
+
model.eval()
|
48 |
|
49 |
+
raw_speech_16kHz, sr = sf.read(audio_file)
|
50 |
+
inputs = feature_extractor(
|
51 |
+
raw_speech_16kHz,
|
52 |
+
return_tensors="pt",
|
53 |
+
sampling_rate=sr,
|
54 |
+
)
|
55 |
+
outputs = model(**inputs)
|
56 |
|
57 |
+
print(f"Input: {inputs.input_values.size()}") # [1, #samples]
|
58 |
+
print(f"Output: {outputs.last_hidden_state.size()}") # [1, #frames, 768]
|
59 |
+
```
|
60 |
|
61 |
+
A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-hubert-base/tree/main/fairseq).
|
62 |
|
63 |
+
---
|
64 |
+
|
65 |
+
# How to cite
|
66 |
```bibtex
|
67 |
+
@misc{rinna-japanese-hubert-base,
|
68 |
+
title={rinna/japanese-hubert-base},
|
69 |
+
author={Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei},
|
70 |
+
url={https://huggingface.co/rinna/japanese-hubert-base}
|
71 |
+
}
|
72 |
+
```
|
73 |
+
|
74 |
+
---
|
75 |
+
|
76 |
+
# Citations
|
77 |
+
```bibtex
|
78 |
+
@article{hsu2021hubert,
|
79 |
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
|
80 |
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
|
81 |
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
|
|
|
86 |
doi={10.1109/TASLP.2021.3122291}
|
87 |
}
|
88 |
```
|
89 |
+
---
|
90 |
+
|
91 |
+
# License
|
92 |
+
[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)
|