--- language: kr datasets: - aihub 자유대화 음성(노인남녀) tags: - automatic-speech-recognition license: apache-2.0 --- # wav2vec2-xlsr-korean-senior Futher fine-tuned [fleek/wav2vec-large-xlsr-korean](https://huggingface.co/fleek/wav2vec-large-xlsr-korean) using the [AIhub 자유대화 음성(노인남녀)](https://aihub.or.kr/aidata/30704). - Total train data size: 808,642 - Total vaild data size: 159,970 When using this model, make sure that your speech input is sampled at 16kHz. The script used for training can be found here: https://github.com/hyyoka/wav2vec2-korean-senior ### Inference ``` py import torchaudio from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import re def clean_up(transcription): hangul = re.compile('[^ ㄱ-ㅣ가-힣]+') result = hangul.sub('', transcription) return result model_name "hyyoka/wav2vec2-xlsr-korean-senior" processor = Wav2Vec2Processor.from_pretrained(model_name) model = Wav2Vec2ForCTC.from_pretrained(model_name) speech_array, sampling_rate = torchaudio.load(wav_file) feat = processor(speech_array[0], sampling_rate=16000, padding=True, max_length=800000, truncation=True, return_attention_mask=True, return_tensors="pt", pad_token_id=49 ) input = {'input_values': feat['input_values'],'attention_mask':feat['attention_mask']} outputs = model(**input, output_attentions=True) logits = outputs.logits predicted_ids = logits.argmax(axis=-1) transcription = processor.decode(predicted_ids[0]) stt_result = clean_up(transcription) ```