yky-h commited on
Commit
f76572f
1 Parent(s): 42c69ee

update readme

Browse files
Files changed (1) hide show
  1. README.md +57 -27
README.md CHANGED
@@ -1,55 +1,81 @@
1
  ---
 
2
  language: ja
3
- datasets:
4
- - reazon-research/reazonspeech
 
5
  tags:
6
  - hubert
7
  - speech
8
- license: apache-2.0
9
  ---
10
 
11
- # japanese-hubert-base
12
 
13
  ![rinna-icon](./rinna.png)
14
 
15
- This is a Japanese HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers) model trained by [rinna Co., Ltd.](https://rinna.co.jp/)
16
 
17
- This model was traind using a large-scale Japanese audio dataset, [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) corpus.
18
 
19
- ## How to use the model
20
 
21
- ```python
22
- import torch
23
- from transformers import HubertModel
24
 
25
- model = HubertModel.from_pretrained("rinna/japanese-hubert-base")
26
- model.eval()
27
 
28
- wav_input_16khz = torch.randn(1, 10000)
29
- outputs = model(wav_input_16khz)
30
- print(f"Input: {wav_input_16khz.size()}") # [1, 10000]
31
- print(f"Output: {outputs.last_hidden_state.size()}") # [1, 31, 768]
32
- ```
33
 
34
- ## Model summary
35
 
36
- The model architecture is the same as the [original HuBERT base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers with 8 attention heads.
37
- The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).
 
 
 
38
 
39
- A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-hubert-base/tree/main/fairseq).
40
 
41
- ## Training
 
 
42
 
43
- The model was trained on approximately 19,000 hours of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) corpus.
 
 
 
44
 
45
- ## License
 
 
 
 
 
 
46
 
47
- [The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)
 
 
48
 
 
49
 
50
- ## Citation
 
 
51
  ```bibtex
52
- @article{hubert2021hsu,
 
 
 
 
 
 
 
 
 
 
 
53
  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
54
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
55
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
@@ -60,3 +86,7 @@ The model was trained on approximately 19,000 hours of [ReazonSpeech](https://hu
60
  doi={10.1109/TASLP.2021.3122291}
61
  }
62
  ```
 
 
 
 
 
1
  ---
2
+ thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
3
  language: ja
4
+ license: apache-2.0
5
+ datasets: reazon-research/reazonspeech
6
+ inference: false
7
  tags:
8
  - hubert
9
  - speech
 
10
  ---
11
 
12
+ # `rinna/japanese-hubert-base`
13
 
14
  ![rinna-icon](./rinna.png)
15
 
16
+ # Overview
17
 
18
+ This is a Japanese HuBERT Base model trained by [rinna Co., Ltd.](https://rinna.co.jp/)
19
 
20
+ * **Model summary**
21
 
22
+ The model architecture is the same as the [original HuBERT Base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers with 12 attention heads.
23
+ The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).
 
24
 
25
+ * **Training**
 
26
 
27
+ The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1.
28
+ - [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
 
 
 
29
 
30
+ * **Contributors**
31
 
32
+ - [Yukiya Hono](https://huggingface.co/yky-h)
33
+ - [Kentaro Mitsui](https://huggingface.co/Kentaro321)
34
+ - [Kei Sawada](https://huggingface.co/keisawada)
35
+
36
+ ---
37
 
38
+ # How to use the model
39
 
40
+ ```python
41
+ import soundfile as sf
42
+ from transformers import AutoFeatureExtractor, AutoModel
43
 
44
+ model_name = "rinna/japanese-hubert-base"
45
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
46
+ model = AutoModel.from_pretrained(model_name)
47
+ model.eval()
48
 
49
+ raw_speech_16kHz, sr = sf.read(audio_file)
50
+ inputs = feature_extractor(
51
+ raw_speech_16kHz,
52
+ return_tensors="pt",
53
+ sampling_rate=sr,
54
+ )
55
+ outputs = model(**inputs)
56
 
57
+ print(f"Input: {inputs.input_values.size()}") # [1, #samples]
58
+ print(f"Output: {outputs.last_hidden_state.size()}") # [1, #frames, 768]
59
+ ```
60
 
61
+ A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-hubert-base/tree/main/fairseq).
62
 
63
+ ---
64
+
65
+ # How to cite
66
  ```bibtex
67
+ @misc{rinna-japanese-hubert-base,
68
+ title={rinna/japanese-hubert-base},
69
+ author={Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei},
70
+ url={https://huggingface.co/rinna/japanese-hubert-base}
71
+ }
72
+ ```
73
+
74
+ ---
75
+
76
+ # Citations
77
+ ```bibtex
78
+ @article{hsu2021hubert,
79
  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
80
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
81
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
 
86
  doi={10.1109/TASLP.2021.3122291}
87
  }
88
  ```
89
+ ---
90
+
91
+ # License
92
+ [The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)