Improve model card: add link to code and example usage (#1)

6444716 verified 7 days ago

2.79 kB

	---
	base_model: openai/whisper-large-v3
	library_name: transformers
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- audio
	- automatic-speech-recognition
	- whisper
	- hf-asr-leaderboard
	---

	# Model Card for Lite-Whisper large-v3

	<!-- Provide a quick summary of what the model is/does. -->

	Lite-Whisper is a compressed version of OpenAI Whisper with LiteASR. See our [GitHub repository](https://github.com/efeslab/LiteASR) and [paper](https://arxiv.org/abs/2502.20583) for details.

	Here's a code snippet to get started:
	```python
	import librosa
	import torch
	from transformers import AutoProcessor, AutoModel

	device = "cuda:0"
	dtype = torch.float16

	# load the compressed Whisper model
	model = AutoModel.from_pretrained(
	"efficient-speech/lite-whisper-large-v3-turbo",
	trust_remote_code=True,
	)
	model.to(dtype).to(device)

	# we use the same processor as the original model
	processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")

	# set the path to your audio file
	path = "path/to/audio.wav"
	audio, _ = librosa.load(path, sr=16000)

	input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
	input_features = input_features.to(dtype).to(device)

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(
	predicted_ids,
	skip_special_tokens=True
	)[0]

	print(transcription)
	```

	## Benchmark Results

	Following is the average word error rate (WER) evaluated on the [ESB datasets](https://huggingface.co/datasets/hf-audio/esb-datasets-test-only-sorted):

	\| Model \| Average WER (↓) \| Encoder Size \| Decoder Size \|
	\|-------\|----------------\|--------------\|--------------\|
	\| [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) \| 10.1 \| 635M \| 907M \|
	\| [lite-whisper-large-v3-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-acc) \| 10.1 \| 429M \| 907M \|
	\| [lite-whisper-large-v3](https://huggingface.co/efficient-speech/lite-whisper-large-v3) \| 10.2 \| 377M \| 907M \|
	\| [lite-whisper-large-v3-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-fast) \| 11.3 \| 308M \| 907M \|
	\|   \|   \|   \|   \|
	\| [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) \| 10.1 \| 635M \| 172M \|
	\| [lite-whisper-large-v3-turbo-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-acc) \| 10.2 \| 421M \| 172M \|
	\| [lite-whisper-large-v3-turbo](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo) \| 12.6 \| 374M \| 172M \|
	\| [lite-whisper-large-v3-turbo-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-fast) \| 20.1 \| 313M \| 172M \|
	\|   \|   \|   \|   \|
	\| [whisper-medium](https://huggingface.co/openai/whisper-medium) \| 14.8 \| 306M \| 457M \|