whisper-25TPS-large-v3-turbo

Add a pooling layer with stride 2 to introduce 25 TPS. This model use to introduce VQ for projection layer later.

WanDB at https://wandb.ai/huseinzol05/whisperconv?nw=nwuserhuseinzol05

Training dataset

  1. malaysia-ai/common_voice_17_0
  2. mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
  3. mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments

Evaluation

Evaluate on malaysia-ai/common_voice_17_0/test up to 115 languages with some conditions,

  1. Lower case.
  2. Remove punctuation.
  3. Provide language tagging for decoder input ids, <|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.
lang: gl, samples: 9949, CER: 0.042740443121340566
lang: en, samples: 16379, CER: 0.060986384009768274
lang: ar, samples: 10458, CER: 0.22266123579844427
lang: kab, samples: 14972, CER: 0.3244665236586341
lang: ml, samples: 703, CER: 0.42335890521056685
lang: kk, samples: 514, CER: 0.17043440799796145
lang: ltg, samples: 2904, CER: 0.23117590536047175
lang: fr, samples: 16145, CER: 0.048485631588568376
lang: de, samples: 16170, CER: 0.026314971778193794
lang: fi, samples: 1554, CER: 0.05055169332273527
lang: pt, samples: 9432, CER: 0.04087286366709751
lang: ia, samples: 1816, CER: 0.05992562427372291
lang: eu, samples: 13621, CER: 0.0512883172324828
lang: ro, samples: 3896, CER: 0.05076617371579273
lang: sw, samples: 12086, CER: 0.1507494503501684
lang: sv-SE, samples: 5247, CER: 0.061493613079958064
lang: ta, samples: 8263, CER: 0.13906399211712145
lang: et, samples: 2653, CER: 0.0940406805612152
lang: lg, samples: 11902, CER: 0.1739333269639051
lang: it, samples: 15154, CER: 0.023851543887980154
lang: mhr, samples: 15107, CER: 0.11669897389006022
lang: sr, samples: 1539, CER: 0.1768132282095298
lang: mr, samples: 1437, CER: 0.19218859045543932
lang: ka, samples: 12608, CER: 0.1249808202311083
lang: es, samples: 15848, CER: 0.021626970071659344
lang: be, samples: 15878, CER: 0.033842475848291816
lang: lt, samples: 4753, CER: 0.09010879518888047
lang: ca, samples: 16389, CER: 0.03981665496291331
lang: eo, samples: 14773, CER: 0.045954869771101005
lang: tr, samples: 11235, CER: 0.05675522154877889
lang: hu, samples: 11435, CER: 0.04471908954804673
lang: ja, samples: 6033, CER: 0.36979170394583843
lang: br, samples: 2202, CER: 0.3698327173154907
lang: ne-NP, samples: 217, CER: 0.6311209237066987
lang: uz, samples: 12006, CER: 0.14525655980276672
lang: ru, samples: 10184, CER: 0.030237044051551122
lang: dv, samples: 2213, CER: 0.4396708739245023
lang: tt, samples: 4953, CER: 0.15263373726156979
lang: rw, samples: 14797, CER: 0.2062194796231522
lang: bn, samples: 9327, CER: 0.23033462368544436
lang: ug, samples: 6108, CER: 0.15066705416170634
lang: rm-sursilv, samples: 1361, CER: 0.22034708167584185
lang: bg, samples: 3201, CER: 0.06147538075948558
lang: ab, samples: 9108, CER: 0.23437651516119973
lang: uk, samples: 9915, CER: 0.05952618613972175
lang: mt, samples: 1662, CER: 0.24826981610474486
lang: fa, samples: 10292, CER: 0.18625054813201342
lang: pl, samples: 9186, CER: 0.042953875408088954
lang: bas, samples: 541, CER: 0.4306410343558038
lang: nl, samples: 11255, CER: 0.025745352458888874
lang: zh-CN, samples: 10335, CER: 0.24028746839684906
lang: tok, samples: 2175, CER: 0.05378416879373438
lang: ur, samples: 4052, CER: 0.1335251795057381
lang: sk, samples: 2593, CER: 0.1293953251895765
lang: oc, samples: 254, CER: 0.25430326530844893
lang: yue, samples: 2585, CER: 0.24481328379140146
lang: mrj, samples: 7102, CER: 0.17758088754553472
lang: fy-NL, samples: 3167, CER: 0.18638765694302617
lang: cs, samples: 9055, CER: 0.03927483627794959
lang: th, samples: 10982, CER: 0.21474513392414912
lang: ckb, samples: 5262, CER: 0.21311040529692965
lang: mn, samples: 1896, CER: 0.40997841020559816
lang: ky, samples: 1604, CER: 0.19988688313695105
lang: skr, samples: 1006, CER: 0.433359625436932
lang: hy-AM, samples: 4281, CER: 0.15392287780406108
lang: sl, samples: 1242, CER: 0.09513225423150144
lang: vi, samples: 1077, CER: 0.098204374854713
lang: hi, samples: 3151, CER: 0.13696196488161588
lang: nan-tw, samples: 2317, CER: 0.5831691586562167
lang: id, samples: 3633, CER: 0.03486822347502954
lang: cy, samples: 5371, CER: 0.18579147648223834
lang: yo, samples: 999, CER: 0.5552374946139923
lang: sah, samples: 1455, CER: 0.22751567126188318
lang: mk, samples: 1097, CER: 0.09952169250027201
lang: cv, samples: 1288, CER: 0.2576255518218807
lang: myv, samples: 479, CER: 0.18588405953220014
lang: da, samples: 2405, CER: 0.06941392064863901
lang: lv, samples: 6738, CER: 0.09661537865671035
lang: kmr, samples: 3900, CER: 0.2301259104014993
lang: tk, samples: 545, CER: 0.36230278052919873
lang: nn-NO, samples: 370, CER: 0.14840933723876001
lang: ha, samples: 661, CER: 0.2931043936843371
lang: he, samples: 260, CER: 0.39141508380891815
lang: dyu, samples: 59, CER: 0.61439655320722
lang: gn, samples: 855, CER: 0.37302107024653286
lang: lij, samples: 694, CER: 0.29544299446782624
lang: hsb, samples: 444, CER: 0.22357315869461994
lang: pa-IN, samples: 487, CER: 0.4293346665112184
lang: el, samples: 1696, CER: 0.09513654126618179
lang: zgh, samples: 159, CER: 1.0
lang: as, samples: 551, CER: 0.35565880640786546
lang: sq, samples: 472, CER: 0.24829234420755228
lang: ko, samples: 338, CER: 0.27138642455096107
lang: ga-IE, samples: 517, CER: 0.4374031144524405
lang: cnh, samples: 763, CER: 0.4734102220225452
lang: sat, samples: 147, CER: 1.0
lang: rm-vallader, samples: 462, CER: 0.19970230364362926
lang: or, samples: 670, CER: 1.0
lang: mdf, samples: 104, CER: 0.3026470206470614
lang: af, samples: 62, CER: 0.1641456986700316
lang: ig, samples: 4, CER: 0.5614456190061029
lang: sc, samples: 232, CER: 0.3428693546598864
lang: tig, samples: 169, CER: 0.9600079749203083
lang: te, samples: 49, CER: 0.48187292278194327
lang: ps, samples: 199, CER: 0.352738986222659
lang: am, samples: 205, CER: 0.8622565254904031
lang: ast, samples: 162, CER: 0.13468806515316542
lang: os, samples: 50, CER: 0.4827762620005156
lang: lo, samples: 33, CER: 1.0
lang: az, samples: 33, CER: 0.12338463797528132
lang: ti, samples: 4, CER: 1.0
lang: vot, samples: 6, CER: 0.4186810281632936
lang: nhi, samples: 5, CER: 0.4055536936182097
lang: yi, samples: 6, CER: 0.9078696446408636
lang: tw, samples: 9, CER: 0.4916804976841617

average CER: 0.2675743317499601

Source code

Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/whisper-conv

Downloads last month
8
Safetensors
Model size
0.8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support