whisper-38TPS-VQ-32k-large-v3-turbo
Add an interpolate layer with scale factor 1 / 1.3 linear mode to introduce 38 TPS with 32768 VQ embedding size.
This model to introduce VQ on top mesolitica/whisper-38TPS-large-v3-turbo
WanDB at https://wandb.ai/huseinzol05/whisperconv-vq-37tps?nw=nwuserhuseinzol05
Training dataset
- malaysia-ai/common_voice_17_0
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments
how to audio token
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/whisper-38TPS-VQ-32k-large-v3-turbo"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
encoder = model.model.get_encoder()
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
features = feature_extractor([y], return_tensors = 'pt', return_attention_mask = True)
for k in features.keys():
features[k] = features[k].cuda()
encoded = encoder(**features)
print(encoded[1][0, encoded[2][0] == 1])
tensor([30019, 16591, 25658, 26770, 18729, 11918, 27695, 8797, 8797, 27695,
3713, 4070, 31486, 10838, 29572, 17799, 10532, 30455, 27432, 11923,
5474, 5474, 8369, 22489, 19089, 11508, 29421, 23174, 22103, 32428,
24292, 10034, 29611, 22995, 8371, 7246, 7246, 7246, 18944, 32239,
32239, 32239, 5305, 5305, 18107, 18107, 18107, 17816, 17816, 15308,
31477, 31477, 31477, 31477, 29400, 32234, 19476, 12665, 27116, 27116,
27116, 27077, 2226, 2226, 14469, 9391, 9401, 5440, 11090, 7858,
7858, 9655, 535, 15933, 19437, 31405, 26886, 26886, 1099, 25014,
25014, 25014, 26876, 26876, 31252, 12830, 12125, 3158, 8791, 8791,
8791, 6250, 184, 184, 184, 20886, 1253, 25801, 11358, 2875,
19004, 20452, 20108, 260, 23872, 21176, 2646, 6819, 6819, 28491,
19185, 28226, 776, 776, 23908, 19632, 12109, 7945, 7945, 18838,
20878, 12554, 12554, 29472, 13465, 7392, 7392, 7392, 19392, 26456,
26456, 30045, 26470, 7751, 8246, 1812, 28528, 15703, 6675, 28935,
28935, 30123, 30123, 27261, 25220, 24163, 11258, 11258, 24163, 21332,
21332, 21332, 2981, 17763, 1719, 31918, 24147, 24147, 8292, 22857,
23017, 625, 20466, 5160, 31824, 31824, 14302, 14125, 9496, 2987,
21650, 9496, 21650, 14561, 13358, 10482, 6400, 32446, 5707],
device='cuda:0')
how to decode
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/whisper-38TPS-VQ-32k-large-v3-turbo"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
input_ids = tokenizer(
'<|startoftranscript|><|ru|><|transcribe|><|notimestamps|>',
add_special_tokens = False, return_tensors = 'pt')['input_ids']
features = feature_extractor([y], return_tensors = 'pt', return_attention_mask = True)
features['decoder_input_ids'] = input_ids
for k in features.keys():
features[k] = features[k].cuda()
generate_kwargs = dict(
**features,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
tokenizer.decode(generation_output[0])
Output,
<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Кубы саралды да был халга көтөн бит авшаблылы сусобы.<|endoftext|>
Evaluation
Evaluate on malaysia-ai/common_voice_17_0/test up to 115 languages with some conditions,
- Lower case.
- Remove punctuation.
- Provide language tagging for decoder input ids,
<|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.
lang: gl, samples: 9949, CER: 0.12746936406182688
lang: en, samples: 16379, CER: 0.13930802727944014
lang: ar, samples: 10458, CER: 0.4498251119693873
lang: kab, samples: 14972, CER: 0.48581894130967107
lang: ml, samples: 703, CER: 0.59715752797816
lang: kk, samples: 514, CER: 0.3686262317099611
lang: ltg, samples: 2904, CER: 0.3704451547056763
lang: fr, samples: 16145, CER: 0.1215426359959612
lang: de, samples: 16170, CER: 0.11970717111409096
lang: fi, samples: 1554, CER: 0.2981875307547178
lang: pt, samples: 9432, CER: 0.11991829325198566
lang: ia, samples: 1816, CER: 0.1234243307747447
lang: eu, samples: 13621, CER: 0.23408967336845624
lang: ro, samples: 3896, CER: 0.17291169198520565
lang: sw, samples: 12086, CER: 0.32794310064905186
lang: sv-SE, samples: 5247, CER: 0.23135083544102092
lang: ta, samples: 8263, CER: 0.4032804119000507
lang: et, samples: 2653, CER: 0.4244129183063203
lang: lg, samples: 11902, CER: 0.3398706560192189
lang: it, samples: 15154, CER: 0.1029483592408615
lang: mhr, samples: 15107, CER: 0.27873678475896824
lang: sr, samples: 1539, CER: 0.26248795278898246
lang: mr, samples: 1437, CER: 0.49891638502764163
lang: ka, samples: 12608, CER: 0.38903551694026145
lang: es, samples: 15848, CER: 0.074388150036706
lang: be, samples: 15878, CER: 0.1609634481903754
lang: lt, samples: 4753, CER: 0.2793459350677913
lang: ca, samples: 16389, CER: 0.10062952786076083
lang: eo, samples: 14773, CER: 0.13245656160734767
lang: tr, samples: 11235, CER: 0.24140118163354476
lang: hu, samples: 11435, CER: 0.2643090005095542
lang: ja, samples: 6033, CER: 0.8114595146900297
lang: br, samples: 2202, CER: 0.4657352936895148
lang: ne-NP, samples: 217, CER: 0.5571899093662568
lang: uz, samples: 12006, CER: 0.3621147820370711
lang: ru, samples: 10184, CER: 0.1876265121020025
lang: dv, samples: 2213, CER: 0.5850910934308908
lang: tt, samples: 4953, CER: 0.3535362079507922
lang: rw, samples: 14797, CER: 0.38581967349184976
lang: bn, samples: 9327, CER: 0.4986031294538938
lang: ug, samples: 6108, CER: 0.4020168137292696
lang: rm-sursilv, samples: 1361, CER: 0.34519570712670294
lang: bg, samples: 3201, CER: 0.25531019050842363
lang: ab, samples: 9108, CER: 0.4204321114541483
lang: uk, samples: 9915, CER: 0.21183776686832398
lang: mt, samples: 1662, CER: 0.43251963255565967
lang: fa, samples: 10292, CER: 0.3302326632713642
lang: pl, samples: 9186, CER: 0.2275658623491296
lang: bas, samples: 541, CER: 0.4256158944105182
lang: nl, samples: 11255, CER: 0.1560031992405498
lang: zh-CN, samples: 10335, CER: 0.7779944072493119
lang: tok, samples: 2175, CER: 0.1419295769904799
lang: ur, samples: 4052, CER: 0.3348359222212008
lang: sk, samples: 2593, CER: 0.2649395612011684
lang: oc, samples: 254, CER: 0.3589146405361618
lang: yue, samples: 2585, CER: 0.6774984222800331
lang: mrj, samples: 7102, CER: 0.35130887320730586
lang: fy-NL, samples: 3167, CER: 0.35231730479328544
lang: cs, samples: 9055, CER: 0.2251860845993416
lang: th, samples: 10982, CER: 0.5751500448982126
lang: ckb, samples: 5262, CER: 0.36440709164347096
lang: mn, samples: 1896, CER: 0.565680904485207
lang: ky, samples: 1604, CER: 0.428841554197287
lang: skr, samples: 1006, CER: 0.4678370147092392
lang: hy-AM, samples: 4281, CER: 0.4428607927893869
lang: sl, samples: 1242, CER: 0.23621835602663363
lang: vi, samples: 1077, CER: 0.39162283697260614
lang: hi, samples: 3151, CER: 0.3573449577147329
lang: nan-tw, samples: 2317, CER: 0.6440427148483474
lang: id, samples: 3633, CER: 0.11620038242874417
lang: cy, samples: 5371, CER: 0.4182510985929321
lang: yo, samples: 999, CER: 0.5972995868517021
lang: sah, samples: 1455, CER: 0.5176889487638698
lang: mk, samples: 1097, CER: 0.2997277491985316
lang: cv, samples: 1288, CER: 0.46207433986232854
lang: myv, samples: 479, CER: 0.38905552834264207
lang: da, samples: 2405, CER: 0.2632929572419957
lang: lv, samples: 6738, CER: 0.27549100298771156
lang: kmr, samples: 3900, CER: 0.3602596624320287
lang: tk, samples: 545, CER: 0.5672014310278388
lang: nn-NO, samples: 370, CER: 0.331428480826169
lang: ha, samples: 661, CER: 0.3763624199771743
lang: he, samples: 260, CER: 0.602466931740592
lang: dyu, samples: 59, CER: 0.5820750451529207
lang: gn, samples: 855, CER: 0.4866577861736295
lang: lij, samples: 694, CER: 0.37367314556900244
lang: hsb, samples: 444, CER: 0.4541521051073199
lang: pa-IN, samples: 487, CER: 0.5421673548546602
lang: el, samples: 1696, CER: 0.28102174399264546
lang: zgh, samples: 159, CER: 1.0
lang: as, samples: 551, CER: 0.5518150494581233
lang: sq, samples: 472, CER: 0.41118946650207994
lang: ko, samples: 338, CER: 0.8700089013555891
lang: ga-IE, samples: 517, CER: 0.5381003025454884
lang: cnh, samples: 763, CER: 0.44010903184681227
lang: sat, samples: 147, CER: 0.5332040869830634
lang: rm-vallader, samples: 462, CER: 0.35263989992992273
lang: or, samples: 670, CER: 0.9245558006844555
lang: mdf, samples: 104, CER: 0.4676185684354689
lang: af, samples: 62, CER: 0.39465688937555804
lang: ig, samples: 4, CER: 0.7426642872711422
lang: sc, samples: 232, CER: 0.41218399452470394
lang: tig, samples: 169, CER: 0.7765322344204647
lang: te, samples: 49, CER: 0.6286027325934781
lang: ps, samples: 199, CER: 0.4519742491508148
lang: am, samples: 205, CER: 0.816901029784296
lang: ast, samples: 162, CER: 0.22597461573546815
lang: os, samples: 50, CER: 0.6138043201690925
lang: lo, samples: 33, CER: 1.0
lang: az, samples: 33, CER: 0.43241566316230967
lang: ti, samples: 4, CER: 0.9958581349206349
lang: vot, samples: 6, CER: 0.5608543344527327
lang: nhi, samples: 5, CER: 0.5143715961457896
lang: yi, samples: 6, CER: 0.7985098631011839
lang: tw, samples: 9, CER: 0.49049342547470587
average CER: 0.41613331835195966
Source code
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/whisper-conv-38tps
- Downloads last month
- 8