rinna
/

japanese-hubert-base

@@ -1,55 +1,81 @@
 ---
 language: ja
-datasets:
-  - reazon-research/reazonspeech
 tags:
   - hubert
   - speech
-license: apache-2.0
 ---
-# japanese-hubert-base
 ![rinna-icon](./rinna.png)
-This is a Japanese HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers) model trained by [rinna Co., Ltd.](https://rinna.co.jp/)
-This model was traind using a large-scale Japanese audio dataset, [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) corpus.
-## How to use the model
-```python
-import torch
-from transformers import HubertModel
-model = HubertModel.from_pretrained("rinna/japanese-hubert-base")
-model.eval()
-wav_input_16khz = torch.randn(1, 10000)
-outputs = model(wav_input_16khz)
-print(f"Input:   {wav_input_16khz.size()}")  # [1, 10000]
-print(f"Output:  {outputs.last_hidden_state.size()}")  # [1, 31, 768]
-```
-## Model summary
-The model architecture is the same as the [original HuBERT base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers with 8 attention heads.
-The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).
-A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-hubert-base/tree/main/fairseq).
-## Training
-The model was trained on approximately 19,000 hours of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) corpus.
-## License
-[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)
-## Citation
 ```bibtex
-@article{hubert2021hsu,
   author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
   journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
   title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
@@ -60,3 +86,7 @@ The model was trained on approximately 19,000 hours of [ReazonSpeech](https://hu
   doi={10.1109/TASLP.2021.3122291}
 }
 ```

 ---
+thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
 language: ja
+license: apache-2.0
+datasets: reazon-research/reazonspeech
+inference: false
 tags:
   - hubert
   - speech
 ---
+# `rinna/japanese-hubert-base`
 ![rinna-icon](./rinna.png)
+# Overview
+This is a Japanese HuBERT Base model trained by [rinna Co., Ltd.](https://rinna.co.jp/)
+* **Model summary**
+  The model architecture is the same as the [original HuBERT Base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers with 12 attention heads.
+  The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).
+* **Training**
+  The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1.
+  - [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
+* **Contributors**
+  - [Yukiya Hono](https://huggingface.co/yky-h)
+  - [Kentaro Mitsui](https://huggingface.co/Kentaro321)
+  - [Kei Sawada](https://huggingface.co/keisawada)
+---
+# How to use the model
+```python
+import soundfile as sf
+from transformers import AutoFeatureExtractor, AutoModel
+model_name = "rinna/japanese-hubert-base"
+feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+model.eval()
+raw_speech_16kHz, sr = sf.read(audio_file)
+inputs = feature_extractor(
+    raw_speech_16kHz,
+    return_tensors="pt",
+    sampling_rate=sr,
+)
+outputs = model(**inputs)
+print(f"Input:  {inputs.input_values.size()}")  # [1, #samples]
+print(f"Output: {outputs.last_hidden_state.size()}")  # [1, #frames, 768]
+```
+A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-hubert-base/tree/main/fairseq).
+---
+# How to cite
 ```bibtex
+@misc{rinna-japanese-hubert-base,
+  title={rinna/japanese-hubert-base},
+  author={Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei},
+  url={https://huggingface.co/rinna/japanese-hubert-base}
+}
+```
+---
+# Citations
+```bibtex
+@article{hsu2021hubert,
   author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
   journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
   title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
   doi={10.1109/TASLP.2021.3122291}
 }
 ```
+---
+# License
+[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)