Instructions to use JacobLinCool/TEA-ASR-1-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JacobLinCool/TEA-ASR-1-mini with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="JacobLinCool/TEA-ASR-1-mini")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("JacobLinCool/TEA-ASR-1-mini") model = AutoModelForMultimodalLM.from_pretrained("JacobLinCool/TEA-ASR-1-mini") - Notebooks
- Google Colab
- Kaggle
TEA-ASR-1-mini · Taiwan Everyday Audio 🍵
TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin. It turns real speech into natural Traditional Chinese with authentic Taiwan vocabulary, and it stays robust through the everyday Mandarin–English code-switching common in Taiwan. Adapted from the state-of-the-art Qwen3-ASR foundation and merged into a single self-contained checkpoint, TEA-ASR loads and runs exactly like stock Qwen3-ASR — no converters, no post-processing — while matching or surpassing both a dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate.
TEA-ASR-1-mini is the 780M compact model (best accuracy-per-parameter).
A companion TEA-ASR-1 shares the identical recipe — see JacobLinCool/TEA-ASR-1.
Key features
- 🎯 Built for Taiwan Mandarin — Traditional script and Taiwan-style word choice, produced by the model itself.
- 🔀 Code-switch robust — handles natural zh-en mixing instead of translating Mandarin into English.
- 🧩 Drop-in Qwen3-ASR compatible — same loading and inference API as the base model; nothing else to install or call.
- 🪶 Lightweight adaptation — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public audio, then merged for deployment.
Quick start
pip install qwen-asr
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1-mini")
result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
print(result.text) # -> Traditional Chinese with Taiwan lexicon
Set language="Chinese" for Taiwan speech (recommended). You can also pass a context= string of hotwords
(names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.
Benchmark results
Mixed Error Rate (MER%, lower is better), all numbers from a single self-measured run under one protocol (see Evaluation). Columns: the two TEA-ASR models, the original (unadapted) Qwen3-ASR bases, and two references — Breeze-ASR-25 (a Taiwan-specialist ASR) and Whisper-large-v3. Bold = this model.
| Benchmark | TEA-ASR-1 | TEA-ASR-1-mini | Qwen3-ASR-1.7B | Qwen3-ASR-0.6B | Breeze-ASR-25 | Whisper-large-v3 |
|---|---|---|---|---|---|---|
| CommonVoice 19 (zh-TW) | 3.64 | 5.14 | 3.90 | 5.79 | 8.03 | 10.17 |
| ASCEND (zh-en) | 10.59 | 12.49 | 10.57 | 12.54 | 17.53 | 19.61 |
| CSZS (zh-en) | 10.98 | 13.21 | 11.03 | 16.03 | 12.18 | 23.24 |
| NTUML2021 | 6.80 | 7.37 | 10.12 | 11.03 | 7.50 | 9.68 |
How to read this. TEA-ASR-1-mini is the efficient model on this page. Across the suite, TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark, ahead of the Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; TEA-ASR-1-mini delivers most of that quality at well under half the parameters (780M vs 2B). Against the unadapted Qwen3-ASR base, the gain in this content-folded recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or better — and, importantly, the metric folds away script differences (see Evaluation), so it does not reflect the decisive practical change: TEA-ASR emits Traditional script and Taiwan vocabulary natively, whereas the base produces Simplified script.
Speed & memory
Measured on NVIDIA RTX 5090 (32 GB) (bf16, batch 1, 50 utterances, greedy decode). xRT = audio seconds processed per wall-clock second (higher is faster); RTF = wall-clock / audio (lower is faster); peak VRAM is the maximum allocated during inference.
| Model | Params | xRT ↑ | RTF ↓ | Peak VRAM (GB) ↓ |
|---|---|---|---|---|
| TEA-ASR-1 | 2B | 11.0 | 0.091 | 4.16 |
| TEA-ASR-1-mini | 780M | 8.1 | 0.124 | 1.65 |
| Breeze-ASR-25 | 1.54B | 5.5 | 0.182 | 4.41 |
| Whisper-large-v3 | 1.54B | 4.7 | 0.214 | 4.41 |
Figures
Accuracy across the four public benchmarks (content-fold MER%, lower is better):
Speed and memory (single GPU, bf16, batch 1):
Ablation — tokenizer × finetune. Content MER isolates the finetune gain (the script fold hides tokenizer effects); raw MER isolates the tokenizer-first localization that makes the output Traditional + Taiwan-lexicon:
Evaluation
- Metric — Mixed Error Rate (MER). Character Error Rate for Chinese and Word Error Rate for the English tokens, computed jointly per utterance and micro-averaged.
- Content fold (applied uniformly to every dataset and every system). Before scoring, both the reference and
the hypothesis are normalized to a common form — converted to Simplified Chinese with OpenCC (
t2s), lowercased, and stripped of punctuation. This isolates recognition from script style, so a Simplified-output model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual output is Traditional; the fold is only for scoring.) - Decoding. TEA-ASR and Qwen3-ASR are decoded with
language=Chinese; Whisper-large-v3 and Breeze-ASR-25 use their own automatic language detection. All systems are scored with the same code on the same public splits; we do not import numbers reported elsewhere.
| Dataset | What it tests | Eval split (n) |
|---|---|---|
| CommonVoice 19 (zh-TW) | Read Taiwan-Mandarin speech | full test (5013) |
| ASCEND | Spontaneous Mandarin–English code-switch conversation | full test (1315) |
| CSZS (zh-en) | Zero-resource code-switch benchmark | full test (3176) |
| NTUML2021 | Mandarin lecture speech (university ML course) | test[:2000] |
- No train/test leakage. Fine-tuning used only the training pools, disjoint from every evaluation
split: the NTUML2021 train split, the ASCEND train split, and a CommonVoice slice drawn from
validated_without_test(CommonVoice's official non-test pool, disjoint from its test split). Evaluation therefore runs on the full, untouched CommonVoice / ASCEND / NTUML2021 test splits; CSZS is a separate dataset not used in training at all. Every number above is leak-free.
How it was built
- Base
Qwen/Qwen3-ASR-0.6B(frozen AuT audio encoder + Qwen3 decoder). - Adaptation: a rank-16 decoder-only LoRA trained on a few hours of public audio (CommonVoice zh-TW, ASCEND, NTUML2021), with general + code-switch replay to preserve the base model's broad and bilingual ability. The audio encoder is left frozen.
- Localization: Traditional-script + Taiwan-lexicon output is rendered through the model's own tokenizer (the surface mapping is baked once at build time); there is no post-processing at inference — the Traditional output comes straight from the model's own tokenizer decode.
- Packaging: the adapter is merged into the base and the localized tokenizer is shipped with it, so the release is a single drop-in checkpoint that loads like stock Qwen3-ASR.
- Decoding tip: pass
language="Chinese"for Taiwan speech; this also prevents translation-style outputs on dense code-switch.
Limitations
- Dense synthetic code-switch (CSZS): the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1.
- Scope: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the
qwen-asrpackage, exactly like the base.
Citation
@misc{teaasr2026,
title = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin},
author = {TEA-ASR contributors},
year = {2026},
note = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR}
}
Built on Qwen3-ASR (Apache-2.0). The TEA-ASR adaptation and this checkpoint are released under the MIT License; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its attribution/NOTICE terms.
- Downloads last month
- -
Model tree for JacobLinCool/TEA-ASR-1-mini
Base model
Qwen/Qwen3-ASR-0.6B

