hubert-base-jtube

This repo provides model weights for the hubert-base model trained on the JTubeSpeech corpus. Scroll down for the model usage

FAQ

Q. 何をするモデル？
A. 音声を潜在変数に埋め込むモデル．音声認識（書き起こし）みたいな認識系のタスクに使えます．

Q. 音声言語モデルって，ChatGPT の音声版ってこと？
A. Transformer にも種類があって，Encoder型とDecoder型の2つがあります．簡単に言うとEncoderが認識用（元データから潜在変数を得るモデル）で，Decoderが生成用（元データを復元するモデル）です．今回公開したHuBERTはEncoder型（認識用）で，ChatGPTのようなDecoder型（生成用）とは異なります．

Q. じゃあ声は作れないってこと？
A. 声を生成するモデルではなくて，認識する側のモデルです．生成には使えません．

Q. Decoder型（生成側）は今後公開する予定はあるの？
A. 生成モデルの公開は個人の権利を侵害する可能性があるため予定していないです．むしろ，声に関する個人の権利を保護する技術を開発することが音声技術者の課題だと考えています．（今回の音声言語モデルはそのための第一歩です）

Dataset

We extracted approximately 2720 hours of Japanese speech from the single-speaker subset of the JTubeSpeech corpus. The training data includes approximately 6,000,000 utterances from a total of about 55,000 speakers.

How to use

from transformers import AutoFeatureExtractor, HubertModel
from datasets import load_dataset
import soundfile as sf

model_name = "sarulab-speech/hubert-base-jtube"
processor = AutoFeatureExtractor.from_pretrained(model_name)
model = HubertModel.from_pretrained(model_name)


def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch


ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

input_values = processor(ds["speech"][0], return_tensors="pt",sampling_rate=16_000).input_values  # Batch size 1
hidden_states = model(input_values).last_hidden_state

Contributors

謝辞/acknowledgements

本研究は、国立研究開発法人産業技術総合研究所事業の令和5年度覚醒プロジェクトの助成を受けたものです。 /This work was supported by AIST KAKUSEI project (FY2023).