File size: 5,533 Bytes

b8ba221
 
 
 
 
 
6d77146
b8ba221
 
 
 
 
 
 
 
0e6ed11
b8ba221
 
 
bed47c6
b8ba221
bed47c6
b8ba221
bed47c6
b8ba221
c1dde61
b8ba221
 
 
 
 
 
 
 
 
 
 
072b8f1
b8ba221
 
 
c1dde61
b8ba221
c1dde61
b8ba221
db9b183
 
b8ba221
 
 
 
 
 
 
 
e16dc41
b8ba221
d9a3c15
e16dc41
b8ba221
 
6d4ad2d
 
50896fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d4ad2d
b8ba221
 
 
 
6d4ad2d
 
50896fe
 
b8ba221
 
 
 
 
50896fe
 
b8ba221

---
datasets:
- homebrewltd/instruction-speech-whispervq-v2
language:
- en
- vi
license: cc-by-nc-sa-4.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
---


![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/11T2v8rzhkK3OLWIl0c62.png)

## Ichigo Whisper

Ichigo Whisper is a compact (22M parameters), open-source speech tokenizer for the `Whisper-medium model`, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding.

This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.

Ichigo Whisper is a key component of the [Ichigo v0.5 family]().

For more details, please refer to our official [blog post]().

### Model Summary

**Developed by:** Homebrew Research.

**Model Architecture:** WhisperVQ

**Model type:** Quantizer of Whisper

**Language(s):** English and Vietnamese

**License:** CC-BY-NC-SA-4.0

### Resources

**Demo:** [Ichigo Whisper demo](https://ichigo-whisper.homebrew.ltd/)

**Blog:** [Blog post]()

<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/nlmKWkr3QL0LyKjNXeY9p.qt"></video>

## Intended Use

**Intended Use Cases** This model is primarily intended for research applications. This version aims to further improve the Whisper on sound low-resource languages.

**Out-of-scope** The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

## How to Get Started

For inference, please refer to the official [Ichigo Whisper repository](https://github.com/janhq/WhisperSpeech/tree/main/ichigo-whisper).

```bash
python demo/inference.py --input path/to/your/audio.wav
```

## Training Specs

### Hardware Specifications

| **Component**              | **Details**             |
|---------------------------|-------------------------|
| **GPUs**                 | 8 × NVIDIA A6000       |

### Training Time

| **Phase**                  | **Duration**            |
|---------------------------|-------------------------|
| **Phase 1**              | 75 hours (50 epochs)    |
| **Phase 2**              | 29 hours (20 epochs)    |
| **Total Training**       | 104 hours              |

### Phase 1: With KL Loss

| **Parameter**              | **Value**                                                      |
|---------------------------|----------------------------------------------------------------|
| **Initialization Method** | WhisperVQ-Large-v3 (7 languages) embeddings with duplication |
| **Epochs**               | 50                                                              |
| **Global Batch Size**    | 336                                             |
| **Learning Rate**        | 1e-3                                                           |
| **Learning Scheduler**    | Linear warm-up with Cosine decay                       |
| **Optimizer**            | AdamW                                                          |
| **Warmup Ratio**         | 500                                                      |
| **Weight Decay**         | 0.001                                                          |
| **Max Audio Length**  | 30 seconds (padded audio)                                      |

### Phase 2: Without KL Loss

| **Parameter**              | **Value**                                                      |
|---------------------------|----------------------------------------------------------------|
| **Initialization Method** | Phase 1 checkpoint                                             |
| **Epochs**               | 20                                                              |
| **Global Batch Size**    | 336                                              |
| **Learning Rate**        | 1e-3                                                           |
| **Learning Scheduler**    | Linear warm-up with Cosine decay                       |
| **Optimizer**            | AdamW                                                          |
| **Warmup Ratio**         | 500                                                      |
| **Weight Decay**         | 0.001                                                          |
| **Max Audio Length**  | 30 seconds (padded audio)  |

## Evaluation

1. Vietnamese

| Model Name | Codebook Size | Dataset test | Test samples | WER |
|------------|---------------|--------------|--------------|-----|
| **IchigoWhisper** | 2561 | viVoice | 10000 | **11.68** |
| Whisper Medium | - | viVoice | 10000 | 18.30 |

2. English

| Model Name | Codebook Size | Dataset test | Test samples | WER |
|------------|---------------|--------------|--------------|-----|
| **IchigoWhisper** | 2561 | LibriTTS-R | 4689 | **11.89** |
| Whisper Medium | - | LibriTTS-R | 4689 | 13.06 |

## Citation Information

**BibTeX:**

```
@article{IchigoWhisper 2024,
  title={IchigoWhisper},
  author={Homebrew Research},
  year=2024,
  month=December},
  url={https://huggingface.co/homebrewltd/Ichigo-whisper}
```

## Acknowledgement

- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**

- **[Whisper](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**

- **[Vivoice](https://huggingface.co/datasets/capleaf/viVoice)**

- **[LibriTTS]**