whisper-38TPS-large-v3-turbo
Add an interpolate layer with scale factor 1 / 1.3 linear mode to introduce 38 TPS. This model use to introduce VQ for projection layer later.
WanDB at https://wandb.ai/huseinzol05/whisperconv-37tps
Training dataset
- malaysia-ai/common_voice_17_0
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments
Evaluation
Evaluate on malaysia-ai/common_voice_17_0/test up to 115 languages with some conditions,
- Lower case.
- Remove punctuation.
- Provide language tagging for decoder input ids,
<|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.
lang: gl, samples: 9949, CER: 0.038022646505003505
lang: en, samples: 16379, CER: 0.06152338036752953
lang: ar, samples: 10458, CER: 0.20554948380790689
lang: kab, samples: 14972, CER: 0.27582742742168737
lang: ml, samples: 703, CER: 0.4534987099731459
lang: kk, samples: 514, CER: 0.14656822533502237
lang: ltg, samples: 2904, CER: 0.20097263599391613
lang: fr, samples: 16145, CER: 0.04488389251043107
lang: de, samples: 16170, CER: 0.024508649217920696
lang: fi, samples: 1554, CER: 0.04564682077988523
lang: pt, samples: 9432, CER: 0.03775697459390274
lang: ia, samples: 1816, CER: 0.048942064572492235
lang: eu, samples: 13621, CER: 0.04257874896499848
lang: ro, samples: 3896, CER: 0.04464553583764197
lang: sw, samples: 12086, CER: 0.13462171972111703
lang: sv-SE, samples: 5247, CER: 0.05644495253179642
lang: ta, samples: 8263, CER: 0.12015692184372433
lang: et, samples: 2653, CER: 0.08418725106887591
lang: lg, samples: 11902, CER: 0.16394520477766272
lang: it, samples: 15154, CER: 0.022206968121195512
lang: mhr, samples: 15107, CER: 0.11759627706458757
lang: sr, samples: 1539, CER: 0.12054745929850534
lang: mr, samples: 1437, CER: 0.17201526189909722
lang: ka, samples: 12608, CER: 0.09759112968055164
lang: es, samples: 15848, CER: 0.02079860813120504
lang: be, samples: 15878, CER: 0.028204188639431513
lang: lt, samples: 4753, CER: 0.08361403994497943
lang: ca, samples: 16389, CER: 0.034603051793827375
lang: eo, samples: 14773, CER: 0.038797289403201284
lang: tr, samples: 11235, CER: 0.06036704523833737
lang: hu, samples: 11435, CER: 0.03949698885801047
lang: ja, samples: 6033, CER: 0.4220936026828759
lang: br, samples: 2202, CER: 0.35878086034863677
lang: ne-NP, samples: 217, CER: 0.3291459262210471
lang: uz, samples: 12006, CER: 0.12374728709149391
lang: ru, samples: 10184, CER: 0.02797243735802649
lang: dv, samples: 2213, CER: 0.23492100705076932
lang: tt, samples: 4953, CER: 0.13729422476882677
lang: rw, samples: 14797, CER: 0.18145367587835692
lang: bn, samples: 9327, CER: 0.18277559280921965
lang: ug, samples: 6108, CER: 0.13144227833835373
lang: rm-sursilv, samples: 1361, CER: 0.18689765164456176
lang: bg, samples: 3201, CER: 0.055955241908113074
lang: ab, samples: 9108, CER: 0.19054594912915496
lang: uk, samples: 9915, CER: 0.051784101043250555
lang: mt, samples: 1662, CER: 0.21771389762160198
lang: fa, samples: 10292, CER: 0.16831622647092573
lang: pl, samples: 9186, CER: 0.04033527459592553
lang: bas, samples: 541, CER: 0.35723102972073434
lang: nl, samples: 11255, CER: 0.022585953833447428
lang: zh-CN, samples: 10335, CER: 0.2931312734758128
lang: tok, samples: 2175, CER: 0.03662570548031443
lang: ur, samples: 4052, CER: 0.13198011579433647
lang: sk, samples: 2593, CER: 0.11906174726636401
lang: oc, samples: 254, CER: 0.24591277076643198
lang: yue, samples: 2585, CER: 0.2164728975826135
lang: mrj, samples: 7102, CER: 0.16832338715131967
lang: fy-NL, samples: 3167, CER: 0.15728785235456794
lang: cs, samples: 9055, CER: 0.036099521557020384
lang: th, samples: 10982, CER: 0.2047811972945032
lang: ckb, samples: 5262, CER: 0.18515629283718374
lang: mn, samples: 1896, CER: 0.3506058387282826
lang: ky, samples: 1604, CER: 0.16262879996086715
lang: skr, samples: 1006, CER: 0.36649834414968757
lang: hy-AM, samples: 4281, CER: 0.1225414613097752
lang: sl, samples: 1242, CER: 0.0834131147698269
lang: vi, samples: 1077, CER: 0.08876174396167676
lang: hi, samples: 3151, CER: 0.11898319714865897
lang: nan-tw, samples: 2317, CER: 0.5474943411562636
lang: id, samples: 3633, CER: 0.03180116282736414
lang: cy, samples: 5371, CER: 0.17257875329649836
lang: yo, samples: 999, CER: 0.455950415432927
lang: sah, samples: 1455, CER: 0.18888490602403937
lang: mk, samples: 1097, CER: 0.09206708244914664
lang: cv, samples: 1288, CER: 0.235723839280149
lang: myv, samples: 479, CER: 0.1592703126884194
lang: da, samples: 2405, CER: 0.06542541215856146
lang: lv, samples: 6738, CER: 0.08540597002397939
lang: kmr, samples: 3900, CER: 0.19240419880492615
lang: tk, samples: 545, CER: 0.33610008208878533
lang: nn-NO, samples: 370, CER: 0.13261241419957523
lang: ha, samples: 661, CER: 0.2573926198205386
lang: he, samples: 260, CER: 0.4051793430769439
lang: dyu, samples: 59, CER: 0.304191650031961
lang: gn, samples: 855, CER: 0.33838399989471013
lang: lij, samples: 694, CER: 0.2589637626026028
lang: hsb, samples: 444, CER: 0.19484668772406566
lang: pa-IN, samples: 487, CER: 0.26281109809350234
lang: el, samples: 1696, CER: 0.0802963573687271
lang: zgh, samples: 159, CER: 1.0
lang: as, samples: 551, CER: 0.3544747289612597
lang: sq, samples: 472, CER: 0.205909598829979
lang: ko, samples: 338, CER: 0.1756772082099313
lang: ga-IE, samples: 517, CER: 0.49812764585095354
lang: cnh, samples: 763, CER: 0.3273461347554693
lang: sat, samples: 147, CER: 0.44685714375234686
lang: rm-vallader, samples: 462, CER: 0.175400127063989
lang: or, samples: 670, CER: 1.0
lang: mdf, samples: 104, CER: 0.26337437776761086
lang: af, samples: 62, CER: 0.16694385500004474
lang: ig, samples: 4, CER: 0.49499782040104623
lang: sc, samples: 232, CER: 0.27876921441533403
lang: tig, samples: 169, CER: 0.7535851634053247
lang: te, samples: 49, CER: 0.43039390871972943
lang: ps, samples: 199, CER: 0.30951384676098237
lang: am, samples: 205, CER: 0.8482531487830595
lang: ast, samples: 162, CER: 0.12322345297299651
lang: os, samples: 50, CER: 0.7190250069381621
lang: lo, samples: 33, CER: 1.0
lang: az, samples: 33, CER: 0.11273205088291703
lang: ti, samples: 4, CER: 1.0
lang: vot, samples: 6, CER: 0.2898256634669678
lang: nhi, samples: 5, CER: 0.37620444072056974
lang: yi, samples: 6, CER: 1.0
lang: tw, samples: 9, CER: 0.46826636272155564
average CER: 0.2364527160297919
Source code
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/whisper-conv-38tps
- Downloads last month
- 7