File size: 5,533 Bytes
b8ba221 6d77146 b8ba221 0e6ed11 b8ba221 bed47c6 b8ba221 bed47c6 b8ba221 bed47c6 b8ba221 c1dde61 b8ba221 072b8f1 b8ba221 c1dde61 b8ba221 c1dde61 b8ba221 db9b183 b8ba221 e16dc41 b8ba221 d9a3c15 e16dc41 b8ba221 6d4ad2d 50896fe 6d4ad2d b8ba221 6d4ad2d 50896fe b8ba221 50896fe b8ba221 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
datasets:
- homebrewltd/instruction-speech-whispervq-v2
language:
- en
- vi
license: cc-by-nc-sa-4.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/11T2v8rzhkK3OLWIl0c62.png)
## Ichigo Whisper
Ichigo Whisper is a compact (22M parameters), open-source speech tokenizer for the `Whisper-medium model`, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding.
This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.
Ichigo Whisper is a key component of the [Ichigo v0.5 family]().
For more details, please refer to our official [blog post]().
### Model Summary
**Developed by:** Homebrew Research.
**Model Architecture:** WhisperVQ
**Model type:** Quantizer of Whisper
**Language(s):** English and Vietnamese
**License:** CC-BY-NC-SA-4.0
### Resources
**Demo:** [Ichigo Whisper demo](https://ichigo-whisper.homebrew.ltd/)
**Blog:** [Blog post]()
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/nlmKWkr3QL0LyKjNXeY9p.qt"></video>
## Intended Use
**Intended Use Cases** This model is primarily intended for research applications. This version aims to further improve the Whisper on sound low-resource languages.
**Out-of-scope** The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.
## How to Get Started
For inference, please refer to the official [Ichigo Whisper repository](https://github.com/janhq/WhisperSpeech/tree/main/ichigo-whisper).
```bash
python demo/inference.py --input path/to/your/audio.wav
```
## Training Specs
### Hardware Specifications
| **Component** | **Details** |
|---------------------------|-------------------------|
| **GPUs** | 8 × NVIDIA A6000 |
### Training Time
| **Phase** | **Duration** |
|---------------------------|-------------------------|
| **Phase 1** | 75 hours (50 epochs) |
| **Phase 2** | 29 hours (20 epochs) |
| **Total Training** | 104 hours |
### Phase 1: With KL Loss
| **Parameter** | **Value** |
|---------------------------|----------------------------------------------------------------|
| **Initialization Method** | WhisperVQ-Large-v3 (7 languages) embeddings with duplication |
| **Epochs** | 50 |
| **Global Batch Size** | 336 |
| **Learning Rate** | 1e-3 |
| **Learning Scheduler** | Linear warm-up with Cosine decay |
| **Optimizer** | AdamW |
| **Warmup Ratio** | 500 |
| **Weight Decay** | 0.001 |
| **Max Audio Length** | 30 seconds (padded audio) |
### Phase 2: Without KL Loss
| **Parameter** | **Value** |
|---------------------------|----------------------------------------------------------------|
| **Initialization Method** | Phase 1 checkpoint |
| **Epochs** | 20 |
| **Global Batch Size** | 336 |
| **Learning Rate** | 1e-3 |
| **Learning Scheduler** | Linear warm-up with Cosine decay |
| **Optimizer** | AdamW |
| **Warmup Ratio** | 500 |
| **Weight Decay** | 0.001 |
| **Max Audio Length** | 30 seconds (padded audio) |
## Evaluation
1. Vietnamese
| Model Name | Codebook Size | Dataset test | Test samples | WER |
|------------|---------------|--------------|--------------|-----|
| **IchigoWhisper** | 2561 | viVoice | 10000 | **11.68** |
| Whisper Medium | - | viVoice | 10000 | 18.30 |
2. English
| Model Name | Codebook Size | Dataset test | Test samples | WER |
|------------|---------------|--------------|--------------|-----|
| **IchigoWhisper** | 2561 | LibriTTS-R | 4689 | **11.89** |
| Whisper Medium | - | LibriTTS-R | 4689 | 13.06 |
## Citation Information
**BibTeX:**
```
@article{IchigoWhisper 2024,
title={IchigoWhisper},
author={Homebrew Research},
year=2024,
month=December},
url={https://huggingface.co/homebrewltd/Ichigo-whisper}
```
## Acknowledgement
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
- **[Whisper](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**
- **[Vivoice](https://huggingface.co/datasets/capleaf/viVoice)**
- **[LibriTTS]** |