MediaTek-Research
/

Breeze-ASR-25

@@ -1,6 +1,9 @@
 # Twister
-> **Twister** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper) with TTS-sythesized data, specially optimized for Taiwanese Mandarin and Mandarin-English code-switching scenarios.
 ---
 ## Performance
@@ -9,9 +12,9 @@
 | Dataset\Model             | WLV2-Oracle ↓ | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | Twister (Ours) ↓ |
 |---------------------------|---------------|-------------|-------------|----------------|------------------|
 | ASCEND-OVERALL*           | 21.14 (AUTO)  | 21.14       | 23.22       | 19.71          | **17.74** (-16.08%) |
-|  ASCEND-EN               | 27.20 (EN)    | 27.36       | 27.21       | 29.39          | **26.64** (-2.63%)  |
-|  ASCEND-ZH               | **13.75** (ZH)| 17.49       | 17.41       | 18.90          | 16.04 (-8.29%)     |
-|  ASCEND-MIX*             | 21.01 (AUTO)  | 21.01       | 25.13       | 17.34          | **16.38** (-22.01%) |
 | CommonVoice16-zh-TW       | 9.02 (ZH)     | 9.84        | 8.95        | 11.86          | **7.97** (-19%)     |
 | CSZS-zh-en*               | 29.49 (AUTO)  | 29.49       | 26.43       | 20.90          | **13.01** (-55.88%) |
@@ -44,42 +47,7 @@ Twister is fine-tuned on about **4,000 hours** of high-quality speech synthesize
 ## 🔧 Usage Example
-```python
-from transformers import WhisperProcessor, WhisperForConditionalGeneration
-from datasets import load_dataset
-import torch
-# 1. Load model and processor
-processor = WhisperProcessor.from_pretrained("Mediatek-Research/Twister")
-model = WhisperForConditionalGeneration.from_pretrained("Mediatek-Research/Twister")
-model.eval()
-# 2. Set decoding prompt for Chinese transcription
-forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe")
-# 3. Load a sample dataset
-ds = load_dataset("ky552/ML2021_ASR_ST", split="test")
-sample = ds[0]["audio"]
-# 4. Preprocess input audio
-inputs = processor(
-    sample["array"],
-    sampling_rate=sample["sampling_rate"],
-    return_tensors="pt"
-)
-input_features = inputs.input_features
-# 5. Inference
-with torch.no_grad():
-    predicted_ids = model.generate(
-        input_features,
-        forced_decoder_ids=forced_decoder_ids
-    )
-# 6. Decode prediction
-transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
-print("Transcription:", transcription)
-```
 ```python
 import torchaudio
@@ -87,7 +55,7 @@ import torch
 from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline
 # 1. Load audio
-audio_path = "./test.wav"
 waveform, sample_rate = torchaudio.load(audio_path)
 # 2. Preprocess
@@ -116,14 +84,64 @@ asr_pipeline = AutomaticSpeechRecognitionPipeline(
 output = asr_pipeline(waveform)
 print("Result:", output["text"])
 # Whipser: 使用這個方式的時候
 # Twister: 使用這個 function 的時候(correct)
 ```
 ---
 ## 📜 Citation
-Cheng-Kang Chou*, Chan-Jan Hsu*, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee
-> *Equal contribution
-> [Paper](https://youtu.be/dQw4w9WgXcQ?si=p5zCM8Hys4FdEOK0)

 # Twister
+**Twister** 是一個針對繁體中文以及中英交錯情境進行優化的語音辨識模型。**Twister** 基於 Whisper-large-v2 之上訓練而成，其中文部分完全採用合成語音資料進行訓練。
+**Twister** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper) with TTS-sythesized data, specially optimized for Taiwanese Mandarin and Mandarin-English code-switching scenarios.
 ---
 ## Performance
 | Dataset\Model             | WLV2-Oracle ↓ | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | Twister (Ours) ↓ |
 |---------------------------|---------------|-------------|-------------|----------------|------------------|
 | ASCEND-OVERALL*           | 21.14 (AUTO)  | 21.14       | 23.22       | 19.71          | **17.74** (-16.08%) |
+| - ASCEND-EN               | 27.20 (EN)    | 27.36       | 27.21       | 29.39          | **26.64** (-2.63%)  |
+| - ASCEND-ZH               | **13.75** (ZH)| 17.49       | 17.41       | 18.90          | 16.04 (-8.29%)     |
+| - ASCEND-MIX*             | 21.01 (AUTO)  | 21.01       | 25.13       | 17.34          | **16.38** (-22.01%) |
 | CommonVoice16-zh-TW       | 9.02 (ZH)     | 9.84        | 8.95        | 11.86          | **7.97** (-19%)     |
 | CSZS-zh-en*               | 29.49 (AUTO)  | 29.49       | 26.43       | 20.90          | **13.01** (-55.88%) |
 ## 🔧 Usage Example
+To run the model on `input_audio.wav`
 ```python
 import torchaudio
 from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline
 # 1. Load audio
+audio_path = "./input_audio.wav"
 waveform, sample_rate = torchaudio.load(audio_path)
 # 2. Preprocess
 output = asr_pipeline(waveform)
 print("Result:", output["text"])
+```
+You can obtain a wav file for testing by loading from a benchmark:
+```python
+from datasets import load_dataset
+import torch
+import torchaudio
+ds = load_dataset("ky552/ML2021_ASR_ST", split="test")
+sample = ds[0]["audio"]
+audio_array = sample["array"]
+sampling_rate = sample["sampling_rate"]
+waveform = torch.tensor(audio_array).unsqueeze(0)
+torchaudio.save("input_audio.wav", waveform, sampling_rate)
+# Decoding Results:
 # Whipser: 使用這個方式的時候
 # Twister: 使用這個 function 的時候(correct)
 ```
+---
+## Training Data
+Twister 的訓練採樣自以下數據集：
+The training data of Twister is sample the following publicly available sources:
+| Dataset Name                                                                 | Type   | Language        | Total Hours | License |
+|------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
+| ODC Synth                                                                    | Synth. | Mandarin        | 10,000      | Open Data Commons License Attribution + Apache2.0* |
+| [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real   | English         | 1,738       | Creative Commons Zero |
+| [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST)              | Real   | Code-switching  | 11          | MIT License |
+*ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
 ---
 ## 📜 Citation
+If you find this model useful, please cite our work:
+**Cheng-Kang Chou\***, **Chan-Jan Hsu\***, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee
+[*A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data*](https://arxiv.org/pdf/2506.11130)
+\*Equal contribution
+```bibtex
+@article{chou2025selfrefiningframeworkenhancingasr,
+  title={A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data},
+  author={Cheng Kang Chou and Chan-Jan Hsu and Ho-Lam Chung and Liang-Hsuan Tseng and Hsi-Chun Cheng and Yu-Kuan Fu and Kuan Po Huang and Hung-Yi Lee},
+  journal={arXiv preprint arXiv:2506.11130},
+  year={2025}
+}
+```