metadata

language: ja
datasets:
  - reazon-research/reazonspeech
tags:
  - hubert
  - speech
license: apache-2.0

japanese-hubert-base

This is a Japanese HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers) model trained by rinna Co., Ltd.

This model was traind using a large-scale Japanese audio dataset, ReazonSpeech corpus.

How to use the model

import torch
from transformers import HubertModel

model = HubertModel.from_pretrained("rinna/japanese-hubert-base")
model.eval()

wav_input_16khz = torch.randn(1, 10000)
outputs = model(wav_input_16khz)
print(f"Input:   {wav_input_16khz.size()}")  # [1, 10000]
print(f"Output:  {outputs.last_hidden_state.size()}")  # [1, 31, 768]

Model summary

The model architecture is the same as the original HuBERT base model, which contains 12 transformer layers with 8 attention heads. The model was trained using code from the official repository, and the detailed training configuration can be found in the same repository and the original paper.

A fairseq checkpoint file can also be available here.

Training

The model was trained on approximately 19,000 hours of ReazonSpeech corpus.

License

The Apache 2.0 license

Citation

@article{hubert2021hsu,
  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  year={2021},
  volume={29},
  number={},
  pages={3451-3460},
  doi={10.1109/TASLP.2021.3122291}
}