openai/whisper-large-v3 · Word timestamps and "return

Nov 17, 2023

When setting up a whisper pipeline like this;

pipe = pipeline(
            "automatic-speech-recognition",
            model=self.model,
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            torch_dtype=self.torch_dtype,
            device=self.device,
        )

and then calling it like this:

pipe(
            file_path,
            chunk_length_s=30,
            batch_size=1
            return_timestamps="word",
            return_language=True,
        )

yields the output:

{'text': "...", 'chunks': [{'text': ' My', 'timestamp': (0.0, 0.42)}, {'text': ' name', 'timestamp': (0.42, 0.56)}, {'text': ' is', 'timestamp': (0.56, 0.76)}, {'text': ' Sojdri,', 'timestamp': (0.76, 1.38)}, {'text': " I'm", 'timestamp': (1.38, 1.68)}, {'text': ' 12', 'timestamp': (1.68, 1.9)}, {'text': ' years', 'timestamp': (1.9, 2.14)}, {'text': ' old,', 'timestamp': (2.14, 2.68)}, {'text': ' I', 'timestamp': (2.68, 2.82)}, {'text': ' love', 'timestamp': (2.82, 3.04)}, {'text': ' my', 'timestamp': (3.04, 3.3)}, {'text': ' mom,', 'timestamp': (3.3, 3.8)}, {'text': ' my', 'timestamp': (3.8, 3.9)}, {'text': ' dad,', 'timestamp': (3.9, 4.3)}, {'text': ' my', 'timestamp': (4.3, 4.32)}, {'text': ' older', 'timestamp': (4.32, 4.6)}, {'text': ' brother,', 'timestamp': (4.6, 5.12)}, {'text': ' Ryan,', 'timestamp': (5.12, 5.44)}, {'text': " who's", 'timestamp': (5.44, 5.54)}, {'text': ' 16', 'timestamp': (5.54, 5.82)}, {'text': ' years', 'timestamp': (5.82, 6.08)}, {'text': ' old.', 'timestamp': (6.08, 6.62)}, {'text': ' My', 'timestamp': (6.68, 6.86)}, {'text': ' favorite', 'timestamp': (6.86, 7.16)}, {'text': ' subject', 'timestamp': (7.16, 7.66)}, {'text': ' is', 'timestamp': (7.66, 8.16)}, {'text': ' history,', 'timestamp': (8.16, 9.28)}, {'text': ' and', 'timestamp': (9.28, 9.44)}, {'text': ' my', 'timestamp': (9.44, 9.62)}, {'text': ' favorite', 'timestamp': (9.62, 9.82)}, {'text': ' sport', 'timestamp': (9.82, 10.3)}, {'text': ' is', 'timestamp': (10.3, 10.6)}, {'text': ' hockey.', 'timestamp': (10.6, 12.38)}]}

where language is missing.

But if I do:

pipe(
            file_path,
            chunk_length_s=30,
            batch_size=1,
            return_timestamps=True,
            return_language=True,
        )

I get:

{'text': "...", 'chunks': [{'language': 'english', 'timestamp': (0.0, 6.48), 'text': " My name is Sojdri, I'm 12 years old, I love my mom, my dad, my older brother, Ryan, who's 16 years old."}, {'language': 'english', 'timestamp': (6.48, 11.52), 'text': ' My favorite subject is history, and my favorite sport is hockey.'}]}

In which language is preserved

Is this a bug, or is it something that prevents both of them from working at the same time?