kehanlu
/

mandarin-wav2vec2-aishell1

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

mandarin-wav2vec2-aishell1 / README.md

kehanlu's picture

Update README.md

8b883c2 over 1 year ago

|

raw history blame contribute delete

No virus

2.71 kB

	---
	language:
	- "zh"
	thumbnail: "Mandarin-wav2vec2.0 fine-tuned on AISHELL-1 dataset"
	tags:
	- automatic-speech-recognition
	- speech
	- wav2vec2.0
	- audio
	datasets:
	- AISHELL-1
	metrics:
	- cer

	---

	The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at https://github.com/kehanlu/mandarin-wav2vec2. This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR
	"([preprint](https://arxiv.org/abs/2210.06244)).

	## Results on AISHELL-1
	\|CER\|dev\|test\|
	\| - \| - \| - \|
	\|vanilla w2v2-CTC \| 4.85 \| 5.13\|

	## Usage
	Note: the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage.

	```python
	import torch
	import soundfile as sf
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC):
	"""
	In ESPNET there is a LayerNorm layer between encoder output and CTC classification head.
	"""
	def __init__(self, config):
	super().__init__(config)
	self.lm_head = torch.nn.Sequential(
	torch.nn.LayerNorm(config.hidden_size),
	self.lm_head
	)

	model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/mandarin-wav2vec2-aishell1")
	processor = Wav2Vec2Processor.from_pretrained("kehanlu/mandarin-wav2vec2-aishell1")

	audio_input, sample_rate = sf.read("/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav")
	inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")

	with torch.no_grad():
	model.eval()
	logits = model(**inputs).logits
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	print(transcription[0])

	# 广州市房地产中介协会分析
	```


	## Licence

	The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (https://www.aishelltech.com/aishell_2)

	```
	@ARTICLE{aishell2,
	author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
	title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
	journal = {ArXiv},
	eprint = {1808.10583},
	primaryClass = "cs.CL",
	year = 2018,
	month = Aug,
	}
	```

	if you find this useful, please cite
	```
	@article{lu2022context,
	title={A context-aware knowledge transferring strategy for CTC-based ASR},
	author={Lu, Ke-Han and Chen, Kuan-Yu},
	journal={arXiv preprint arXiv:2210.06244},
	year={2022}
	}
	```