## 1. Setup & Installation

In [1]:
%%writefile requirements.txt
git+https://github.com/openai/whisper.git@8cf36f3508c9acd341a45eb2364239a3d81458b9

Overwriting requirements.txt


In [None]:
!pip install -r requirements.txt --upgrade

## 2. Test model

In [3]:
!wget https://cdn-media.huggingface.co/speech_samples/sample1.flac

--2022-09-23 20:32:18--  https://cdn-media.huggingface.co/speech_samples/sample1.flac
Resolving cdn-media.huggingface.co (cdn-media.huggingface.co)... 13.32.151.62, 13.32.151.23, 13.32.151.60, ...
Connecting to cdn-media.huggingface.co (cdn-media.huggingface.co)|13.32.151.62|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 282378 (276K) [audio/flac]
Saving to: ‘sample1.flac’


2022-09-23 20:32:18 (78.7 MB/s) - ‘sample1.flac’ saved [282378/282378]



In [9]:
import whisper

model = whisper.load_model("large")
result = model.transcribe("sample1.flac")
print(result["text"])

100%|█████████████████████████████████████| 2.87G/2.87G [01:11<00:00, 42.9MiB/s]


Detected language: english
 going along slushy country roads and speaking to damp audiences in drafty school rooms day after day for a fortnight. he'll have to put in an appearance at some place of worship on sunday morning and he can come to us immediately afterwards.


## 3. Create Custom Handler for Inference Endpoints


In [5]:
%%writefile handler.py
from typing import  Dict
from transformers.pipelines.audio_utils import ffmpeg_read
import whisper
import torch

SAMPLE_RATE = 16000



class EndpointHandler():
    def __init__(self, path=""):
        # load the model
        self.model = whisper.load_model("medium")


    def __call__(self, data: Dict[str, bytes]) -> Dict[str, str]:
        """
        Args:
            data (:obj:):
                includes the deserialized audio file as bytes
        Return:
            A :obj:`dict`:. base64 encoded image
        """
        # process input
        inputs = data.pop("inputs", data)
        audio_nparray = ffmpeg_read(inputs, SAMPLE_RATE)
        audio_tensor= torch.from_numpy(audio_nparray)
        
        # run inference pipeline
        result = self.model.transcribe(audio_nparray)

        # postprocess the prediction
        return {"text": result["text"]}

Overwriting handler.py


test custom pipeline

In [6]:
from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".")

In [None]:
import base64
from PIL import Image
from io import BytesIO
import json

# file reader
with open("sample1.flac", "rb") as f:
  request = {"inputs": f.read()}


# test the handler
pred = my_handler(request)

  audio_tensor= torch.from_numpy(audio_nparray)


Detected language: english


In [3]:
pred

{'transcription': " going along slushy country roads and speaking to damp audiences in draughty school rooms day after day for a fortnight. He'll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards."}

In [4]:
import json

json.dumps({'transcription': " going along slushy country roads and speaking to damp audiences in draughty school rooms day after day for a fortnight. He'll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards."})

'{"transcription": " going along slushy country roads and speaking to damp audiences in draughty school rooms day after day for a fortnight. He\'ll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards."}'