Edit model card

distil-whisper-german

This model is a German Speech Recognition model based on the distil-whisper technique. The model weights count 756M parameters and with a size of 1.51GB in bfloat16 format.

As a follow-up to the Whisper large v3 german we decided to create a distilled version for a faster inference with minimal quality loss.

Intended uses & limitations

The model is intended to be used for German speech recognition tasks. It can be used as local transkription service or as a part of a larger pipeline for speech recognition tasks. While counting only half of the parameters of the large model, the quality is still very good and can be used for most tasks. The latency is low enough to be used in real-time applications when using optimization toolkits like tensorrt.

Dataset

The dataset used for training is a filtered subset of the Common Voice dataset, multilingual librispeech and some internal data. The data was filtered and double checked for quality and correctness. We did some normalization to the text data, especially for casing and punctuation.

Model family

Model Parameters link
Whisper large v3 german 1.54B link
Distil-whisper large v3 german 756M link
tiny whisper 37.8M link

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • total_train_batch_size: 512
  • num_epochs: 5.0

Framework versions

  • Transformers 4.39.3
  • Pytorch 2.3.0a0+ebedce2
  • Datasets 2.18.0
  • Tokenizers 0.15.2

How to use

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "primeline/distil-whisper-large-v3-german"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

About us

primeline AI

Your partner for AI infrastructure in Germany
Experience the powerful AI infrastructure that drives your ambitions in Deep Learning, Machine Learning & High-Performance Computing. Optimized for AI training and inference.

Model author: Florian Zimmermeister

Downloads last month
373
Safetensors
Model size
756M params
Tensor type
BF16
·