espnet
/

owsm_ctc_v3.2_ft_1B

@@ -9,7 +9,7 @@ language: multilingual
 datasets:
 - owsm_v3.2_ctc
 base_model:
-- espnet/owsm_ctc_v3.1_1B
 license: cc-by-4.0
 ---
@@ -27,4 +27,129 @@ espnet_model_zoo
 ```
-**Example usage can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

 datasets:
 - owsm_v3.2_ctc
 base_model:
+- espnet/owsm_ctc_v3.2_ft_1B
 license: cc-by-4.0
 ---
 ```
+**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
+### Example script for batched inference
+`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
+```python
+from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
+s2t = Speech2TextGreedySearch.from_pretrained(
+    "espnet/owsm_ctc_v3.1_1B",
+    device="cuda",
+    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
+    lang_sym='<eng>',
+    task_sym='<asr>',
+)
+res = s2t.batch_decode(
+    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
+    batch_size=16,
+    context_len_in_secs=4,
+)   # res is a single str, i.e., the predicted text without special tokens
+res = s2t.batch_decode(
+    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
+    batch_size=16,
+    context_len_in_secs=4,
+)   # res is a list of str
+# Please check the code of `batch_decode` for all supported inputs
+```
+### Example script for short-form ASR/ST/LID
+Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.
+```python
+import librosa
+from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
+s2t = Speech2TextGreedySearch.from_pretrained(
+    "espnet/owsm_ctc_v3.2_ft_1B",
+    device="cuda",
+    generate_interctc_outputs=False,
+    lang_sym='<eng>',
+    task_sym='<asr>',
+)
+# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
+speech, rate = librosa.load("xxx.wav", sr=16000)
+speech = librosa.util.fix_length(speech, size=(16000 * 30))
+res = s2t(speech)[0]
+print(res)
+```
+### Example script for long-form ASR/ST
+```python
+import soundfile as sf
+import torch
+from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
+context_len_in_secs = 4   # left and right context when doing buffered inference
+batch_size = 32   # depends on the GPU memory
+s2t = Speech2TextGreedySearch.from_pretrained(
+    "espnet/owsm_ctc_v3.2_ft_1B",
+    device='cuda' if torch.cuda.is_available() else 'cpu',
+    generate_interctc_outputs=False,
+    lang_sym='<eng>',
+    task_sym='<asr>',
+)
+speech, rate = sf.read(
+    "xxx.wav"
+)
+text = s2t.decode_long_batched_buffered(
+    speech,
+    batch_size=batch_size,
+    context_len_in_secs=context_len_in_secs,
+)
+print(text)
+```
+### Example of CTC forced alignment using `ctc-segmentation`
+CTC segmentation can be efficiently applied to audio of an arbitrary length.
+```python
+import soundfile as sf
+from espnet2.bin.s2t_ctc_align import CTCSegmentation
+from espnet_model_zoo.downloader import ModelDownloader
+# Download model first
+d = ModelDownloader()
+downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")   # "espnet/owsm_ctc_v3.2_ft_1B"
+aligner = CTCSegmentation(
+    **downloaded,
+    fs=16000,
+    ngpu=1,
+    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
+    kaldi_style_text=True,
+    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
+    lang_sym="<eng>",
+    task_sym="<asr>",
+    context_len_in_secs=2,  # left and right context in buffered decoding
+)
+speech, rate = sf.read(
+    "./test_utils/ctc_align_test.wav"
+)
+print(f"speech duration: {len(speech) / rate : .2f} seconds")
+text = """
+utt1 THE SALE OF THE HOTELS
+utt2 IS PART OF HOLIDAY'S STRATEGY
+utt3 TO SELL OFF ASSETS
+utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
+"""
+segments = aligner(speech, text)
+print(segments)
+```