--- tags: - espnet - audio - automatic-speech-recognition language: de datasets: - speechcatcher license: mit --- ## Speechcatcher ESPnet streaming ASR model XL for German ASR ### `speechcatcher/speechcatcher_german_espnet_streaming_transformer_26k_train_size_xl_raw_de_bpe1024` This model was trained by bmilde using speechcatcher recipe in [espnet](https://github.com/speechcatcher-asr/espnet/tree/egs2-speechcatcher-de). ### Demo: How to use the model Global installation: ```bash sudo apt-get install portaudio19-dev python3.10-dev ffmpeg # on mac: # brew install portaudio ffmpeg pip3 install git+https://github.com/speechcatcher-asr/speechcatcher speechcatcher -m de_streaming_transformer_xl mediafile.mp4 # or with a microphone: speechcatcher -m de_streaming_transformer_xl -l ``` Virtual environment: ```bash virtualenv -p python3.10 speechcatcher_env source speechcatcher_env/bin/activate pip3 install git+https://github.com/speechcatcher-asr/speechcatcher speechcatcher -m de_streaming_transformer_xl mediafile.mp4 # or with a microphone: speechcatcher -m de_streaming_transformer_xl -l ``` # RESULTS Tuda-de-raw: 2.76% CER (without LM) Tuda-de-raw: 9.65% WER (without LM) Note: Tuda-de-raw results are based on raw tuda-de test utterances without the normalization step. It may not be directly comparable to regular tuda-de results. # Speechcatcher training Speechcatcher models are trained by using Whisper large as a teacher model: ![Speechcatcher Teacher/student training](https://github.com/speechcatcher-asr/speechcatcher/raw/main/speechcatcher_training.svg) See [speechcatcher-data](https://github.com/speechcatcher-asr/speechcatcher-data) for code and more info on replicating the training process. # Sponsors Speechcatcher was gracefully funded by Media Tech Lab by Media Lab Bayern (@media-tech-lab) # Citing ```BibTex @misc{milde2023speechcatcher, author = {Milde, Benjamin}, title = {Speechcatcher}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/speechcatcher-asr/speechcatcher}}, } ``` ## ASR config
expand ``` config: conf/train_asr_streaming_transformer_size_l.yaml print_config: false log_level: INFO dry_run: false iterator_type: sequence output_dir: exp/asr_train_asr_streaming_transformer_size_l_raw_de_bpe1024 ngpu: 1 seed: 0 num_workers: 1 num_att_plot: 0 dist_backend: nccl dist_init_method: env:// dist_world_size: 4 dist_rank: 0 local_rank: 0 dist_master_addr: localhost dist_master_port: 55625 dist_launcher: null multiprocessing_distributed: true unused_parameters: false sharded_ddp: false cudnn_enabled: true cudnn_benchmark: false cudnn_deterministic: true collect_stats: false write_collected_feats: false max_epoch: 20 patience: 3 val_scheduler_criterion: - valid - acc early_stopping_criterion: - valid - acc - max best_model_criterion: - - valid - acc - max keep_nbest_models: 10 nbest_averaging_interval: 0 grad_clip: 5 grad_clip_type: 2.0 grad_noise: false accum_grad: 1 no_forward_run: false resume: true train_dtype: float32 use_amp: false log_interval: null use_matplotlib: true use_tensorboard: true create_graph_in_tensorboard: false use_wandb: false wandb_project: null wandb_id: null wandb_entity: null wandb_name: null wandb_model_log_interval: -1 detect_anomaly: false pretrain_path: null init_param: [] ignore_init_mismatch: false freeze_param: [] num_iters_per_epoch: null batch_size: 64 valid_batch_size: null batch_bins: 1000000 valid_batch_bins: null train_shape_file: - exp/asr_stats_raw_de_bpe1024/train/speech_shape - exp/asr_stats_raw_de_bpe1024/train/text_shape.bpe valid_shape_file: - exp/asr_stats_raw_de_bpe1024/valid/speech_shape - exp/asr_stats_raw_de_bpe1024/valid/text_shape.bpe batch_type: folded valid_batch_type: null fold_length: - 80000 - 150 sort_in_batch: descending sort_batch: descending multiple_iterator: false chunk_length: 500 chunk_shift_ratio: 0.5 num_cache_chunks: 1024 train_data_path_and_name_and_type: - - dump/raw/train/wav.scp - speech - sound - - dump/raw/train/text - text - text valid_data_path_and_name_and_type: - - dump/raw/dev/wav.scp - speech - sound - - dump/raw/dev/text - text - text allow_variable_data_keys: false max_cache_size: 0.0 max_cache_fd: 32 valid_max_cache_size: null exclude_weight_decay: false exclude_weight_decay_conf: {} optim: adam optim_conf: lr: 0.001 scheduler: warmuplr scheduler_conf: warmup_steps: 25000 token_list: - - - ',' - . - t - ▁ - e - en - s - n - ▁ich - ▁das - ▁und - ▁die - er - ▁ist - ▁auch - ▁so - st - ▁der - ▁nicht - ▁es - ▁ein - r - ▁in - f - ▁dann - ▁ja - d - ▁da - g - h - m - o - u - b - ▁wir - ▁zu - ▁du - ▁ge - ▁Und - i - a - ▁mit - ▁den - in - ▁man - l - ▁auf - ▁dass - sch - ▁jetzt - '?' - ge - ▁was - ▁er - ▁Ja - ▁hat - '-' - p - ▁war - ▁eine - ▁F - ▁aber - ▁mal - ▁oder - y - ▁noch - te - ung - ▁haben - ▁Ich - ▁be - ▁Das - ▁wie - ä - ▁an - ▁habe - k - ▁von - ▁sich - ▁K - al - ▁wenn - la - ▁schon - ig - ra - lich - re - de - ch - ▁für - it - ▁Also - w - ▁A - es - ▁sind - ▁ver - le - or - ▁sie - ▁B - ü - ▁also - ▁ganz - ▁T - ▁im - ▁dem - ter - an - ck - ▁St - ▁aus - ▁G - ▁kann - ▁bei - ▁halt - ▁H - el - ▁immer - z - ▁einfach - ▁P - ö - ▁S - ▁weil - ▁mir - se - ▁f - ut - ten - ▁wo - ▁Sch - us - ▁vor - ur - ▁sehr - ri - kt - ing - ▁E - il - ▁gut - ▁mich - ▁Aber - 'on' - und - cht - ▁als - den - ar - ie - um - ▁uns - ste - ▁Da - hr - ▁über - be - ▁einen - ▁Be - ▁ihr - is - ▁wieder - ▁glaube - ▁Ge - at - ▁irgendwie - li - ▁nur - we - ro - ▁bisschen - he - ▁mehr - ▁M - tz - ▁muss - gen - ▁sagen - ben - ▁wirklich - ▁alle - nd - ▁wird - ▁gibt - ▁um - ▁m - ▁natürlich - ▁viel - me - nt - et - ▁diese - ▁U - '0' - ▁sein - ▁nach - ▁hier - ▁meine - ern - lo - ion - ▁eigentlich - ▁O - ▁machen - ▁bin - ▁So - ll - ▁hast - ▁weiß - ▁Re - c - ▁I - ▁sch - ▁C - ▁vielleicht - iert - ach - ▁b - ne - x - ze - rei - ru - ma - ▁zum - ▁finde - ß - ▁N - ▁Die - rt - ich - ▁Ma - uch - ▁eben - rü - ▁Ver - ein - ▁In - R - ieren - ▁Ha - ssen - ft - chen - am - di - der - hl - ▁Es - ▁gesagt - zu - ▁ne - ▁An - ▁k - ▁1 - ▁am - hn - ▁gerade - pp - her - ▁alles - nen - ▁geht - ▁genau - ha - ▁Jahr - ▁re - ▁werden - ▁w - ▁Z - isch - ▁p - ▁Er - ke - ▁Wir - au - mm - ik - ▁mein - ▁dir - ▁einem - un - ▁würde - ▁We - ▁zwei - v - ▁doch - ▁keine - ▁erst - na - and - ▁gar - ▁hin - ▁durch - ▁V - kommen - ell - ul - end - ▁können - j - fe - ▁richtig - ff - ▁Me - ▁andere - lie - '...' - wi - ol - art - ▁Leute - ▁Zeit - ▁Ein - ran - ner - ▁ab - nk - ation - ▁viele - ▁g - S - rie - ▁ob - im - ver - ür - rk - ▁einer - men - ▁ent - iv - lei - ▁gemacht - sp - ▁hatte - ▁weiter - sten - che - ang - all - ir - hör - ▁Was - aus - ier - ▁Ne - ▁Li - ▁hab - ass - L - igen - zi - ungen - ▁Spiel - ▁will - ▁unter - ag - ▁macht - ber - ▁Sp - zen - ▁denn - ken - ▁des - ▁Ka - lle - id - sen - ▁dich - ▁st - ▁Du - ▁kommt - spiel - ▁Fall - ▁Man - ▁Se - ▁W - ▁dieser - ▁Ko - ga - ▁De - ▁groß - ▁Le - ▁schön - ▁La - ▁jeden - ▁D - ▁Genau - gt - ▁dieses - ungs - ▁J - pro - ▁Co - ▁Beispiel - ▁heißt - ▁s - ist - rä - ho - ▁damit - ▁Wo - ▁unsere - ▁le - ert - '5' - ni - tt - gel - ▁her - ve - ▁sondern - mp - reich - ▁Sa - '''' - ▁lang - ▁rein - ▁neu - ▁sagt - ▁tatsächlich - ▁kein - är - nehmen - ▁bis - elt - ad - teil - ▁euch - ta - ▁a - ▁anderen - ▁raus - op - ▁Der - ige - arbeit - ▁Film - ▁Ba - ▁heute - ▁wäre - ▁nochmal - ▁ange - ▁Sie - ick - ▁of - ler - ▁un - ische - weise - lä - kl - ▁Na - iß - wa - ▁wer - ▁Ding - ▁okay - ▁Ra - halt - ▁we - ▁Pa - ▁Thema - heit - ▁ko - ▁Dann - ▁diesen - schaft - ▁möchte - ▁hätte - lu - ▁Al - bar - ▁Tag - mo - ▁Wie - ▁waren - ▁sp - ▁wurde - ▁Auf - ce - ▁Frage - ▁kannst - wo - ▁Mi - ▁deine - ▁To - mi - ▁dazu - äng - ▁bist - ischen - ▁Mo - ▁ihn - 'no' - zieh - ▁Ab - ▁kommen - ▁Menschen - anz - ▁Wenn - ▁ha - ▁Vor - ▁Ro - stell - ▁Zu - ▁je - rau - eln - ab - hin - ka - schau - ▁Pro - ger - P - ▁Bo - ▁gerne - ko - nis - ▁drei - ▁gleich - ld - ▁klar - ack - ▁Aus - ün - ▁nie - A - ▁tr - ▁seine - ▁Mit - geben - ▁soll - '4' - ▁diesem - lau - ▁müssen - ▁kleine - ▁kurz - mmer - ment - stellen - ▁Wa - ▁Podcast - ▁Wi - ▁the - ▁Woche - ▁guck - ▁quasi - ▁Ho - mal - ▁sei - ▁Po - krieg - aff - ▁nächste - itz - ▁20 - tag - '9' - ▁Ende - richt - uck - ör - ▁2 - dem - mpf - vi - 'off' - ▁Leben - ▁wichtig - ▁gesehen - ▁gehen - ress - ▁sag - M - ▁echt - ▁etwas - stand - zähl - führ - T - ▁wenig - ▁zusammen - ▁paar - ▁Di - ▁einmal - bo - ▁sehen - ▁Sachen - ▁Kon - bi - ▁dabei - gend - pass - ic - ▁könnte - ▁Weil - zeit - ▁denke - F - ▁Folge - man - ▁wollte - kauf - ▁weg - ▁3 - ▁selbst - '1' - hol - co - ▁wollen - bau - '2' - B - ▁wahrscheinlich - ank - ▁Mal - ▁letzten - fahren - ▁vom - ▁Do - hi - ▁eher - D - ▁selber - ord - ▁super - ▁musst - ▁drauf - ▁jemand - '8' - ▁gegen - ▁überhaupt - ▁The - ▁Okay - ▁beim - ▁sage - pa - ▁dafür - vor - ▁Frau - ▁hatten - ▁drin - '6' - ▁sozusagen - iz - ▁fand - ▁Tra - folg - ▁Nach - ▁tun - ▁dein - ität - C - ▁Oder - ▁zurück - ▁Nein - po - ▁cool - ▁sowas - ▁sieht - gehen - schi - ▁Gott - ▁schnell - form - ▁ihm - ▁besser - ▁gab - wä - ▁äh - ▁Kinder - änder - ▁sollte - ▁Jo - ▁voll - ▁War - ▁kenne - ▁zwar - ▁total - ▁welche - ▁passiert - ▁Hand - fall - ▁irgendwann - ▁Problem - war - qu - fühl - ▁Wer - ▁wissen - ▁dort - ▁jeder - ca - ▁deswegen - sprech - ▁davon - ▁damals - trag - ▁nämlich - ▁Punkt - ▁Welt - ▁abge - '7' - log - ▁sogar - ▁kam - legen - ▁Moment - igkeit - ▁konnte - ▁komm - ▁gewesen - ▁anders - ▁Bi - K - ▁eigene - ▁liebe - ▁Teil - ▁Lo - ▁toll - ▁Arbeit - ▁Seite - genommen - ▁to - ▁alt - ▁trotzdem - ▁gehört - ▁Jetzt - ▁mache - ▁Dr - ▁relativ - sicht - ▁steht - ▁Auto - ▁darüber - nehm - ▁irgendwas - ▁ohne - ▁Geld - ▁Euro - ieß - suche - ▁vier - einander - ▁Grund - ▁Gefühl - gestellt - ▁sa - ativ - G - ▁darauf - I - ▁All - ▁Anfang - ▁darf - ▁Freund - ▁direkt - ▁irgendwo - ▁letzte - ▁schlecht - ▁manchmal - ▁Bild - ▁Geschichte - ▁interessant - E - ▁komplett - ▁Ahnung - bringen - nutz - bild - ▁frag - V - ▁Kind - ▁meisten - ▁gehabt - ▁gedacht - ▁erstmal - ▁fast - ▁stimmt - '3' - laufen - ▁bestimmt - zahl - ▁Über - kommt - gegangen - setzen - ▁funktioniert - ▁spielen - ▁Person - ▁Sinn - ▁dachte - ▁fünf - ▁hoch - bereit - ▁brauche - ▁zwischen - ▁Spaß - ▁spannend - ▁ehrlich - ▁krass - ▁schreib - ▁zumindest - zeug - ▁Musik - W - fahr - ▁solche - ▁Deutschland - ▁gespielt - geschrieben - Ä - ▁später - Y - O - H - '!' - U - N - Q - Ö - X - Z - J - '%' - Ü - é - « - » - '&' - à - à - ş - q - ¤ - Ÿ - € - è - ı - ç - ú - ë - ¶ - á - ć - — - õ - ğ - í - ° - ô - _ - ó - / - å - $ - ́ - û - › - ê - ‹ - '"' - ñ - Ş - č - ) - É - μ - ø - š - о - ł - ù - ã - ā - © - а - ':' - е - œ - и - н - â - î - т - ń - р - к - 你 - æ - „ - Č - с - ♪ - д - Š - в - ï - İ - л - À - у - ь - я - м - ę - ś - ž - п - '=' - ō - ř - Æ - ш - з - ы - ū - ș - Ø - '~' - ì - ò - ο - ч - г - ý - ̄ - ц - Х - ż - З - б - ¡ - Н - ă - ̃ - К - ж - ไ - ồ - ♫ - ر - х - ン - Ç - § - ⁄ - + - '*' - Å - і - Á - ī - џ - ู - ; - '>' - Î - ą - Đ - Ȗ - Ε - έ - δ - ι - λ - ς - τ - υ - ύ - О - Т - و - ک - ں - ด - ม - ่ - ṣ - “ - ♥ - き - つ - ぶ - ら - チ - ッ - ホ - ロ - 中 - 以 - 佢 - 利 - 厲 - 句 - 可 - 吃 - 国 - 士 - 好 - 安 - 害 - 度 - 手 - 晃 - 法 - Ć - ě - Б - ج - 救 - ά - – - ダ - 制 - init: null input_size: null ctc_conf: dropout_rate: 0.0 ctc_type: builtin reduce: true ignore_nan_grad: null zero_infinity: true joint_net_conf: null use_preprocessor: true token_type: bpe bpemodel: data/de_token_list/bpe_unigram1024/bpe.model non_linguistic_symbols: null cleaner: null g2p: null speech_volume_normalize: null rir_scp: null rir_apply_prob: 1.0 noise_scp: null noise_apply_prob: 1.0 noise_db_range: '13_15' short_noise_thres: 0.5 frontend: default frontend_conf: n_fft: 512 win_length: 400 hop_length: 160 fs: 16k specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range: - 0 - 30 num_freq_mask: 2 apply_time_mask: true time_mask_width_range: - 0 - 40 num_time_mask: 2 normalize: global_mvn normalize_conf: stats_file: exp/asr_stats_raw_de_bpe1024/train/feats_stats.npz model: espnet model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false preencoder: null preencoder_conf: {} encoder: contextual_block_transformer encoder_conf: output_size: 256 attention_heads: 8 linear_units: 2048 num_blocks: 22 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.0 input_layer: conv2d normalize_before: true block_size: 40 hop_size: 16 look_ahead: 16 init_average: true ctx_pos_enc: true postencoder: null postencoder_conf: {} decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.0 src_attention_dropout_rate: 0.0 preprocessor: default preprocessor_conf: {} required: - output_dir - token_list version: '202211' distributed: true ```