--- language: ja license: apache-2.0 tags: - speech - speaker-diarization datasets: - callhome --- # Fine-tuned XLSR-53 large model for speech diarization in Japanese phone-call 2 speakers diarization model which was fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Japanese using phone-call data [CallHome](https://media.talkbank.org/ca/CallHome/jpn/). ## Usage The model can be used directly as follows. ```python import numpy as np import torch from pydub import AudioSegment from transformers import Wav2Vec2ForAudioFrameClassification, Wav2Vec2FeatureExtractor def _make_timegrid(sound_duration: float, total_len: int): start_timegrid = np.linspace(0, sound_duration, total_len + 1) dt = start_timegrid[1] - start_timegrid[0] end_timegrid = start_timegrid + dt return start_timegrid[:total_len], end_timegrid[:total_len] feature_extractor = Wav2Vec2FeatureExtractor( feature_size=1, sampling_rate=16_000, padding_value=0.0, do_normalize=True, return_attention_mask=True, ) model = Wav2Vec2ForAudioFrameClassification.from_pretrained("Ivydata/wav2vec2-large-speech-diarization-jp") filepath = "/path/to/file.wav" sound = AudioSegment.from_file(filepath) sound = sound.set_frame_rate(16_000) sound_duration = sound.duration_seconds feature = feature_extractor(np.array(sound.get_array_of_samples())).input_values[0] input_values = torch.tensor(feature, dtype=torch.float32).unsqueeze(0) with torch.no_grad(): logits = model(input_values).logits pred = logits.argmax(dim=-1).squeeze(0) start_timegrid, end_timegrid = _make_timegrid(sound_duration, len(pred)) print("sec speaker_label") for p, start_time in zip(pred, start_timegrid): print(f"{start_time:.4f} {p}") ``` ## Training The model was trained on Japanese phone-call corpus [CallHome](https://media.talkbank.org/ca/CallHome/jpn/). ## License [The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)