Spaces:

aodianyun
/

whisper-jax

Runtime error

App Files Files Community

shideqin commited on Apr 27, 2023

Commit

63f0318

•

1 Parent(s): f988cc5

init

Browse files

Files changed (3) hide show

LICENSE +203 -0
README.md +468 -1
app.py +251 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,203 @@

+Copyright 2023- The HuggingFace Inc. team and The OpenAI Authors. All rights reserved.
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -9,4 +9,471 @@ app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 pinned: false
 ---
+# Whisper JAX
+This repository contains optimised JAX code for OpenAI's [Whisper Model](https://arxiv.org/abs/2212.04356), largely built
+on the 🤗 Hugging Face Transformers Whisper implementation. Compared to OpenAI's PyTorch code, Whisper JAX runs over **70x**
+faster, making it the fastest Whisper implementation available.
+The JAX code is compatible on CPU, GPU and TPU, and can be run standalone (see [Pipeline Usage](#pipeline-usage)) or
+as an inference endpoint (see [Creating an Endpoint](#creating-an-endpoint)).
+For a quick-start guide to running Whisper JAX on a Cloud TPU, refer to the following Kaggle notebook, where we transcribe 30 mins of audio in approx 30 sec:
+[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/sgandhi99/whisper-jax-tpu)
+The Whisper JAX model is also running as a demo on the Hugging Face Hub:
+[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sanchit-gandhi/whisper-jax)
+## Installation
+Whisper JAX was tested using Python 3.9 and JAX version 0.4.5. Installation assumes that you already have the latest
+version of the JAX package installed on your device. You can do so using the official JAX installation guide: https://github.com/google/jax#installation
+Once the appropriate version of JAX has been installed, Whisper JAX can be installed through pip:
+```
+pip install git+https://github.com/sanchit-gandhi/whisper-jax.git
+```
+To update the Whisper JAX package to the latest version, simply run:
+```
+pip install --upgrade --no-deps --force-reinstall git+https://github.com/sanchit-gandhi/whisper-jax.git
+```
+## Pipeline Usage
+The recommended way of running Whisper JAX is through the [`FlaxWhisperPipline`](https://github.com/sanchit-gandhi/whisper-jax/blob/main/whisper_jax/pipeline.py#L57) abstraction class. This class handles all
+the necessary pre- and post-processing, as well as wrapping the generate method for data parallelism across accelerator devices.
+Whisper JAX makes use of JAX's [`pmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.pmap.html) function for data parallelism across GPU/TPU devices. This function is _Just In Time (JIT)_
+compiled the first time it is called. Thereafter, the function will be _cached_, enabling it to be run in super-fast time:
+```python
+from whisper_jax import FlaxWhisperPipline
+# instantiate pipeline
+pipeline = FlaxWhisperPipline("openai/whisper-large-v2")
+# JIT compile the forward call - slow, but we only do once
+text = pipeline("audio.mp3")
+# used cached function thereafter - super fast!!
+text = pipeline("audio.mp3")
+```
+### Half-Precision
+The model computation can be run in half-precision by passing the dtype argument when instantiating the pipeline. This will
+speed-up the computation quite considerably by storing intermediate tensors in half-precision. There is no change to the precision
+of the model weights.
+For most GPUs, the dtype should be set to `jnp.float16`. For A100 GPUs or TPUs, the dtype should be set to `jnp.bfloat16`:
+```python
+from whisper_jax import FlaxWhisperPipline
+import jax.numpy as jnp
+# instantiate pipeline in bfloat16
+pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16)
+```
+### Batching
+Whisper JAX also provides the option of _batching_ a single audio input across accelerator devices. The audio is first
+chunked into 30 second segments, and then chunks dispatched to the model to be transcribed in parallel. The resulting
+transcriptions are stitched back together at the boundaries to give a single, uniform transcription. In practice, batching
+provides a 10x speed-up compared to transcribing the audio samples sequentially, with a less than 1% penalty to the WER[^1], provided the batch size is selected large enough.
+To enable batching, pass the `batch_size` parameter when you instantiate the pipeline:
+```python
+from whisper_jax import FlaxWhisperPipline
+# instantiate pipeline with batching
+pipeline = FlaxWhisperPipline("openai/whisper-large-v2", batch_size=16)
+```
+### Task
+By default, the pipeline transcribes the audio file in the language it was spoken in. For speech translation, set the
+`task` argument to `"translate"`:
+```python
+# translate
+text = pipeline("audio.mp3", task="translate")
+```
+### Timestamps
+The [`FlaxWhisperPipline`](https://github.com/sanchit-gandhi/whisper-jax/blob/main/whisper_jax/pipeline.py#L57) also supports timestamp prediction. Note that enabling timestamps will require a second JIT compilation of the
+forward call, this time including the timestamp outputs:
+```python
+# transcribe and return timestamps
+outputs = pipeline("audio.mp3",  task="transcribe", return_timestamps=True)
+text = outputs["text"]  # transcription
+chunks = outputs["chunks"]  # transcription + timestamps
+```
+### Putting it all together
+In the following code snippet, we instantiate the model in bfloat16 precision with batching enabled, and transcribe the audio file
+returning timestamps tokens:
+```python
+from whisper_jax import FlaxWhisperPipline
+import jax.numpy as jnp
+# instantiate pipeline with bfloat16 and enable batching
+pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16)
+# transcribe and return timestamps
+outputs = pipeline("audio.mp3",  task="transcribe", return_timestamps=True)
+```
+## Model Usage
+The Whisper JAX model can use on a more granular level in much the same way as the original Hugging Face
+Transformers implementation. This requires the Whisper processor to be loaded separately to the model to handle the
+pre- and post-processing, and the generate function to be wrapped using `pmap` by hand:
+```python
+import jax.numpy as jnp
+from datasets import load_dataset
+from flax.jax_utils import replicate
+from flax.training.common_utils import shard
+from jax import device_get, pmap
+from transformers import WhisperProcessor
+from whisper_jax import FlaxWhisperForConditionalGeneration
+# load the processor and model
+processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
+model, params = FlaxWhisperForConditionalGeneration.from_pretrained(
+    "openai/whisper-large-v2", dtype=jnp.bfloat16, _do_init=False,
+)
+def generate_fn(input_features):
+    pred_ids = model.generate(
+        input_features, task="transcribe", return_timestamps=False, max_length=model.config.max_length, params=params,
+    )
+    return pred_ids.sequences
+# pmap the generate function for data parallelism
+p_generate = pmap(generate_fn, "input_features")
+# replicate the parameters across devices
+params = replicate(params)
+# load a dummy sample from the LibriSpeech dataset
+ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+sample = ds[0]["audio"]
+# pre-process: convert the audio array to log-mel input features
+input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="np").input_features
+# replicate the input features across devices for DP
+input_features = shard(input_features)
+# run the forward pass (JIT compiled the first time it is called)
+pred_ids = p_generate(input_features)
+output_ids = device_get(pred_ids.reshape(-1, model.config.max_length))
+# post-process: convert tokens ids to text string
+transcription = processor.batch_decode(pred_ids, skip_special_tokens=True)
+```
+## Available Models and Languages
+All Whisper models on the Hugging Face Hub with Flax weights are compatible with Whisper JAX. This includes, but is not limited to,
+the official OpenAI Whisper checkpoints:
+| Size     | Parameters | English-only                                         | Multilingual                                        |
+|----------|------------|------------------------------------------------------|-----------------------------------------------------|
+| tiny     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny)     |
+| base     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
+| small    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
+| medium   | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
+| large    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)    |
+| large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
+Should you wish to use a fine-tuned Whisper checkpoint in Whisper JAX, you should first convert the PyTorch weights to Flax.
+This is straightforward through use of the `from_pt` argument, which will convert the PyTorch state dict to a frozen Flax
+parameter dictionary on the fly. You can then push the converted Flax weights to the Hub to be used directly in Flax
+the next time they are required. Note that converting weights from PyTorch to Flax requires both PyTorch and Flax to be installed.
+For example, to convert the fine-tuned checkpoint [`sanchit-gandhi/whisper-small-hi`](https://huggingface.co/sanchit-gandhi/whisper-small-hi) from the blog post [Fine-Tuning Whisper](https://huggingface.co/blog/fine-tune-whisper):
+```python
+from whisper_jax import FlaxWhisperForConditionalGeneration, FlaxWhisperPipline
+import jax.numpy as jnp
+checkpoint_id = "sanchit-gandhi/whisper-small-hi"
+# convert PyTorch weights to Flax
+model = FlaxWhisperForConditionalGeneration.from_pretrained(checkpoint_id, from_pt=True)
+# push converted weights to the Hub
+model.push_to_hub(checkpoint_id)
+# now we can load the Flax weights directly as required
+pipeline = FlaxWhisperPipline(checkpoint_id, dtype=jnp.bfloat16, batch_size=16)
+```
+## Advanced Usage
+More advanced users may wish to explore different parallelisation techniques. The Whisper JAX code is
+built on-top of the [T5x codebase](https://github.com/google-research/t5x), meaning it can be run using model, activation, and data parallelism using the T5x
+partitioning convention. To use T5x partitioning, the logical axis rules and number of model partitions must be defined.
+For more details, the user is referred to the official T5x partitioning guide: https://github.com/google-research/t5x/blob/main/docs/usage/partitioning.md
+### Pipeline
+The following code snippet demonstrates how data parallelism can be achieved using the pipeline `shard_params` method in
+an entirely equivalent way to `pmap`:
+```python
+from whisper_jax import FlaxWhisperPipline
+import jax.numpy as jnp
+# 2D parameter and activation partitioning for DP
+logical_axis_rules_dp = (
+    ("batch", "data"),
+    ("mlp", None),
+    ("heads", None),
+    ("vocab", None),
+    ("embed", None),
+    ("embed", None),
+    ("joined_kv", None),
+    ("kv", None),
+    ("length", None),
+    ("num_mel", None),
+    ("channels", None),
+)
+pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16)
+pipeline.shard_params(num_mp_partitions=1, logical_axis_rules=logical_axis_rules_dp)
+```
+### Model
+It is also possible to use the Whisper JAX model with T5x partitioning by defining a T5x inference state and T5x partitioner:
+```python
+import jax
+import jax.numpy as jnp
+from flax.core.frozen_dict import freeze
+from jax.sharding import PartitionSpec as P
+from whisper_jax import FlaxWhisperForConditionalGeneration, InferenceState, PjitPartitioner
+# 2D parameter and activation partitioning for DP
+logical_axis_rules_dp = [
+    ("batch", "data"),
+    ("mlp", None),
+    ("heads", None),
+    ("vocab", None),
+    ("embed", None),
+    ("embed", None),
+    ("joined_kv", None),
+    ("kv", None),
+    ("length", None),
+    ("num_mel", None),
+    ("channels", None),
+]
+model, params = FlaxWhisperForConditionalGeneration.from_pretrained(
+    "openai/whisper-large-v2",
+    _do_init=False,
+    dtype=jnp.bfloat16,
+)
+def init_fn():
+    input_shape = (1, 80, 3000)
+    input_features = jnp.zeros(input_shape, dtype="f4")
+    input_features = input_features.at[(..., -1)].set(model.config.eos_token_id)
+    decoder_input_ids = jnp.zeros((input_shape[0], 1), dtype="i4")
+    decoder_attention_mask = jnp.ones_like(decoder_input_ids)
+    batch_size, sequence_length = decoder_input_ids.shape
+    decoder_position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))
+    rng = jax.random.PRNGKey(0)
+    init_params = model.module.init(
+        rng,
+        input_features=input_features,
+        decoder_input_ids=decoder_input_ids,
+        decoder_attention_mask=decoder_attention_mask,
+        decoder_position_ids=decoder_position_ids,
+        return_dict=False,
+    )
+    return init_params
+# Axis names metadata
+param_axes = jax.eval_shape(init_fn)["params_axes"]
+# Create InferenceState, since the partitioner expects it
+state = InferenceState(
+    step=jnp.array(0),
+    params=freeze(model.params_shape_tree),
+    params_axes=freeze(param_axes),
+    flax_mutables=None,
+    flax_mutables_axes=param_axes,
+)
+# Define the pjit partitioner with 1 model partition
+partitioner = PjitPartitioner(
+    num_partitions=1,
+    logical_axis_rules=logical_axis_rules_dp,
+)
+mesh_axes = partitioner.get_mesh_axes(state)
+params_spec = mesh_axes.params
+p_shard_params = partitioner.partition(model.to_bf16, (params_spec,), params_spec)
+def generate(params, input_features):
+    output_ids = model.generate(input_features, params=params, max_length=model.config.max_length).sequences
+    return output_ids
+p_generate = partitioner.partition(
+    generate,
+    in_axis_resources=(params_spec, P("data")),
+    out_axis_resources=P("data"),
+)
+# This will auto-magically run in mesh context
+params = p_shard_params(freeze(params))
+# you can now run the forward pass with:
+# pred_ids = p_generate(input_features)
+```
+## Benchmarks
+We compare Whisper JAX to the official [OpenAI implementation](https://github.com/openai/whisper) and the [🤗 Transformers
+implementation](https://huggingface.co/docs/transformers/model_doc/whisper). We benchmark the models on audio samples of
+increasing length and report the average inference time in seconds over 10 repeat runs. For all three systems, we pass a
+pre-loaded audio file to the model and measure the time for the forward pass. Leaving the task of loading the audio file
+to the systems adds an equal offset to all the benchmark times, so the actual time for loading **and** transcribing an
+audio file will be higher than the reported numbers.
+OpenAI and Transformers both run in PyTorch on GPU. Whisper JAX runs in JAX on GPU and TPU. OpenAI transcribes the audio
+sequentially in the order it is spoken. Both Transformers and Whisper JAX use a batching algorithm, where chunks of audio
+are batched together and transcribed in parallel (see section [Batching](#batching)).
+**Table 1:** Average inference time in seconds for audio files of increasing length. GPU device is a single A100 40GB GPU.
+TPU device is a single TPU v4-8.
+<div align="center">
+|           | OpenAI  | Transformers | Whisper JAX | Whisper JAX |
+|-----------|---------|--------------|-------------|-------------|
+|           |         |              |             |             |
+| Framework | PyTorch | PyTorch      | JAX         | JAX         |
+| Backend   | GPU     | GPU          | GPU         | TPU         |
+|           |         |              |             |             |
+| 1 min     | 13.8    | 4.54         | 1.72        | 0.45        |
+| 10 min    | 108.3   | 20.2         | 9.38        | 2.01        |
+| 1 hour    | 1001.0  | 126.1        | 75.3        | 13.8        |
+|           |         |              |             |             |
+</div>
+## Creating an Endpoint
+The Whisper JAX model is running as a demo on the Hugging Face Hub:
+[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sanchit-gandhi/whisper-jax)
+However, at peak times there may be a queue of users that limit how quickly your audio input is transcribed. In this case,
+you may benefit from running the model yourself, such that you have unrestricted access to the Whisper JAX model.
+If you are just interested in running the model in a standalone Python script, refer to the Kaggle notebook Whisper JAX TPU:
+[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/sgandhi99/whisper-jax-tpu)
+Otherwise, we provide all the necessary code for creating an inference endpoint. To obtain this code, first clone the
+repository on the GPU/TPU on which you want to host the endpoint:
+```
+git clone https://github.com/sanchit-gandhi/whisper-jax
+```
+And then install Whisper JAX from source, with the required additional endpoint dependencies:
+```
+cd whisper-jax
+pip install -e .["endpoint"]
+```
+We recommend that you set-up an endpoint in the same zone/region as the one you are based in. This reduces the communication
+time between your local machine and the remote one, which can significantly reduce the overall request time.
+The Python script [`fastapi_app.py`](app/fastapi_app.py) contains the code to launch a FastAPI app with the Whisper large-v2 model.
+By default, it uses a batch size of 16 and bfloat16 half-precision. You should update these parameters depending on your
+GPU/TPU device (as explained in the sections on [Half-precision](#half-precision) and [Batching](#batching)).
+You can launch the FastAPI app through Uvicorn using the bash script [`launch_app.sh`](app/launch_app.sh):
+```
+bash launch_app.sh
+```
+This will open the port 8000 for the FastAPI app. To direct network requests to the FastAPI app, we use ngrok to launch a
+server on the corresponding port:
+```
+ngrok http --subdomain=whisper-jax 8000
+```
+We can now send json requests to our endpoint using ngrok. The function `transcribe_audio` loads an audio file, encodes it
+in bytes, sends it to our endpoint, and returns the transcription:
+```python
+import base64
+from transformers.pipelines.audio_utils import ffmpeg_read
+import requests
+API_URL = "https://whisper-jax.ngrok.io/generate/"  # make sure this URL matches your ngrok subdomain
+def query(payload):
+    """Send json payload to ngrok API URL and return response."""
+    response = requests.post(API_URL, json=payload)
+    return response.json(), response.status_code
+def transcribe_audio(audio_file, task="transcribe", return_timestamps=False):
+    with open(audio_file, "rb") as f:
+        inputs = f.read()
+    inputs = ffmpeg_read(inputs, sampling_rate=16000)
+    # encode to bytes to make json compatible
+    inputs = {"array": base64.b64encode(inputs.tobytes()).decode(), "sampling_rate": 16000}
+    # format as a json payload and send query
+    payload = {"inputs": inputs, "task": task, "return_timestamps": return_timestamps}
+    data, status_code = query(payload)
+    if status_code == 200:
+        output = {"text": data["text"], "chunks": data.get("chunks", None)}
+    else:
+        output = data["detail"]
+    return output
+# transcribe an audio file using our endpoint
+output = transcribe_audio("audio.mp3")
+```
+Note that this code snippet sends a base64 byte encoding of the audio file to the remote machine over [`requests`](https://requests.readthedocs.io).
+In some cases, transferring the audio request from the local machine to the remote can take longer than actually
+transcribing it. Therefore, you may wish to explore more efficient methods of sending requests, such as parallel
+requests/transcription (see function `transcribe_chunked_audio` in [app.py](app/app.py).)
+Finally, we can create a Gradio demo for the frontend, the code for which resides in [`app.py`](app/app.py). You can launch this
+application by providing the ngrok subdomain:
+```
+API_URL=https://whisper-jax.ngrok.io/generate/ API_URL_FROM_FEATURES=https://whisper-jax.ngrok.io/generate_from_features/ python app.py
+```
+This will launch a Gradio demo with the same interface as the official Whisper JAX demo.
+## Acknowledgements
+* 🤗 Hugging Face Transformers for the base Whisper implementation, particularly to [andyehrenberg](https://github.com/andyehrenberg) for the [Flax Whisper PR](https://github.com/huggingface/transformers/pull/20479) and [ArthurZucker](https://github.com/ArthurZucker) for the batching algorithm
+* Gradio for their easy-to-use package for building ML demos, and [pcuenca](https://github.com/pcuenca) for the help in hooking the demo up to the TPU
+* Google's [TPU Research Cloud (TRC)](https://sites.research.google/trc/about/) programme for Cloud TPUs
+[^1]: See WER results from Colab: https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing

app.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import base64
+import math
+import os
+import time
+from multiprocessing import Pool
+import gradio as gr
+import numpy as np
+import pytube
+import requests
+from processing_whisper import WhisperPrePostProcessor
+from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE
+from transformers.pipelines.audio_utils import ffmpeg_read
+title = "Whisper JAX: The Fastest Whisper API ⚡️"
+description = """Whisper JAX is an optimised implementation of the [Whisper model](https://huggingface.co/openai/whisper-large-v2) by OpenAI. It runs on JAX with a TPU v4-8 in the backend. Compared to PyTorch on an A100 GPU, it is over [**70x faster**](https://github.com/sanchit-gandhi/whisper-jax#benchmarks), making it the fastest Whisper API available.
+Note that at peak times, you may find yourself in the queue for this demo. When you submit a request, your queue position will be shown in the top right-hand side of the demo pane. Once you reach the front of the queue, your audio file will be transcribed, with the progress displayed through a progress bar.
+To skip the queue, you may wish to create your own inference endpoint, details for which can be found in the [Whisper JAX repository](https://github.com/sanchit-gandhi/whisper-jax#creating-an-endpoint).
+"""
+article = "Whisper large-v2 model by OpenAI. Backend running JAX on a TPU v4-8 through the generous support of the [TRC](https://sites.research.google/trc/about/) programme. Whisper JAX [code](https://github.com/sanchit-gandhi/whisper-jax) and Gradio demo by 🤗 Hugging Face."
+API_URL = os.getenv("API_URL")
+API_URL_FROM_FEATURES = os.getenv("API_URL_FROM_FEATURES")
+language_names = sorted(TO_LANGUAGE_CODE.keys())
+CHUNK_LENGTH_S = 30
+BATCH_SIZE = 16
+NUM_PROC = 16
+FILE_LIMIT_MB = 1000
+def query(payload):
+    response = requests.post(API_URL, json=payload)
+    return response.json(), response.status_code
+def inference(inputs, task=None, return_timestamps=False):
+    payload = {"inputs": inputs, "task": task, "return_timestamps": return_timestamps}
+    data, status_code = query(payload)
+    if status_code != 200:
+        # error with our request - return the details to the user
+        raise gr.Error(data["detail"])
+    text = data["detail"]
+    timestamps = data.get("chunks")
+    if timestamps is not None:
+        timestamps = [
+            f"[{format_timestamp(chunk['timestamp'][0])} -> {format_timestamp(chunk['timestamp'][1])}] {chunk['text']}"
+            for chunk in timestamps
+        ]
+        text = "\n".join(str(feature) for feature in timestamps)
+    return text
+def chunked_query(payload):
+    response = requests.post(API_URL_FROM_FEATURES, json=payload)
+    return response.json(), response.status_code
+def forward(batch, task=None, return_timestamps=False):
+    feature_shape = batch["input_features"].shape
+    batch["input_features"] = base64.b64encode(batch["input_features"].tobytes()).decode()
+    outputs, status_code = chunked_query(
+        {"batch": batch, "task": task, "return_timestamps": return_timestamps, "feature_shape": feature_shape}
+    )
+    if status_code != 200:
+        # error with our request - return the details to the user
+        raise gr.Error(outputs["detail"])
+    outputs["tokens"] = np.asarray(outputs["tokens"])
+    return outputs
+def identity(batch):
+    return batch
+# Copied from https://github.com/openai/whisper/blob/c09a7ae299c4c34c5839a76380ae407e7d785914/whisper/utils.py#L50
+def format_timestamp(seconds: float, always_include_hours: bool = False, decimal_marker: str = "."):
+    if seconds is not None:
+        milliseconds = round(seconds * 1000.0)
+        hours = milliseconds // 3_600_000
+        milliseconds -= hours * 3_600_000
+        minutes = milliseconds // 60_000
+        milliseconds -= minutes * 60_000
+        seconds = milliseconds // 1_000
+        milliseconds -= seconds * 1_000
+        hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
+        return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
+    else:
+        # we have a malformed timestamp so just return it as is
+        return seconds
+if __name__ == "__main__":
+    processor = WhisperPrePostProcessor.from_pretrained("openai/whisper-large-v2")
+    stride_length_s = CHUNK_LENGTH_S / 6
+    chunk_len = round(CHUNK_LENGTH_S * processor.feature_extractor.sampling_rate)
+    stride_left = stride_right = round(stride_length_s * processor.feature_extractor.sampling_rate)
+    step = chunk_len - stride_left - stride_right
+    pool = Pool(NUM_PROC)
+    def tqdm_generate(inputs: dict, task: str, return_timestamps: bool, progress: gr.Progress):
+        inputs_len = inputs["array"].shape[0]
+        all_chunk_start_idx = np.arange(0, inputs_len, step)
+        num_samples = len(all_chunk_start_idx)
+        num_batches = math.ceil(num_samples / BATCH_SIZE)
+        dummy_batches = list(
+            range(num_batches)
+        )  # Gradio progress bar not compatible with generator, see https://github.com/gradio-app/gradio/issues/3841
+        dataloader = processor.preprocess_batch(inputs, chunk_length_s=CHUNK_LENGTH_S, batch_size=BATCH_SIZE)
+        progress(0, desc="Pre-processing audio file...")
+        dataloader = pool.map(identity, dataloader)
+        model_outputs = []
+        start_time = time.time()
+        # iterate over our chunked audio samples
+        for batch, _ in zip(dataloader, progress.tqdm(dummy_batches, desc="Transcribing...")):
+            model_outputs.append(forward(batch, task=task, return_timestamps=return_timestamps))
+        runtime = time.time() - start_time
+        post_processed = processor.postprocess(model_outputs, return_timestamps=return_timestamps)
+        text = post_processed["text"]
+        timestamps = post_processed.get("chunks")
+        if timestamps is not None:
+            timestamps = [
+                f"[{format_timestamp(chunk['timestamp'][0])} -> {format_timestamp(chunk['timestamp'][1])}] {chunk['text']}"
+                for chunk in timestamps
+            ]
+            text = "\n".join(str(feature) for feature in timestamps)
+        return text, runtime
+    def transcribe_chunked_audio(inputs, task, return_timestamps, progress=gr.Progress()):
+        progress(0, desc="Loading audio file...")
+        if inputs is None:
+            raise gr.Error("No audio file submitted! Please upload an audio file before submitting your request.")
+        file_size_mb = os.stat(inputs).st_size / (1024 * 1024)
+        if file_size_mb > FILE_LIMIT_MB:
+            raise gr.Error(
+                f"File size exceeds file size limit. Got file of size {file_size_mb:.2f}MB for a limit of {FILE_LIMIT_MB}MB."
+            )
+        with open(inputs, "rb") as f:
+            inputs = f.read()
+        inputs = ffmpeg_read(inputs, processor.feature_extractor.sampling_rate)
+        inputs = {"array": inputs, "sampling_rate": processor.feature_extractor.sampling_rate}
+        text, runtime = tqdm_generate(inputs, task=task, return_timestamps=return_timestamps, progress=progress)
+        return text, runtime
+    def _return_yt_html_embed(yt_url):
+        video_id = yt_url.split("?v=")[-1]
+        HTML_str = (
+            f'<center> <iframe width="500" height="320" src="https://www.youtube.com/embed/{video_id}"> </iframe>'
+            " </center>"
+        )
+        return HTML_str
+    def transcribe_youtube(yt_url, task, return_timestamps, progress=gr.Progress(), max_filesize=75.0):
+        progress(0, desc="Loading audio file...")
+        html_embed_str = _return_yt_html_embed(yt_url)
+        try:
+            yt = pytube.YouTube(yt_url)
+            stream = yt.streams.filter(only_audio=True)[0]
+        except KeyError:
+            raise gr.Error("An error occurred while loading the YouTube video. Please try again.")
+        if stream.filesize_mb > max_filesize:
+            raise gr.Error(f"Maximum YouTube file size is {max_filesize}MB, got {stream.filesize_mb:.2f}MB.")
+        stream.download(filename="audio.mp3")
+        with open("audio.mp3", "rb") as f:
+            inputs = f.read()
+        inputs = ffmpeg_read(inputs, processor.feature_extractor.sampling_rate)
+        inputs = {"array": inputs, "sampling_rate": processor.feature_extractor.sampling_rate}
+        text, runtime = tqdm_generate(inputs, task=task, return_timestamps=return_timestamps, progress=progress)
+        return html_embed_str, text, runtime
+    microphone_chunked = gr.Interface(
+        fn=transcribe_chunked_audio,
+        inputs=[
+            gr.inputs.Audio(source="microphone", optional=True, type="filepath"),
+            gr.inputs.Radio(["transcribe", "translate"], label="Task", default="transcribe"),
+            gr.inputs.Checkbox(default=False, label="Return timestamps"),
+        ],
+        outputs=[
+            gr.outputs.Textbox(label="Transcription").style(show_copy_button=True),
+            gr.outputs.Textbox(label="Transcription Time (s)"),
+        ],
+        allow_flagging="never",
+        title=title,
+        description=description,
+        article=article,
+    )
+    audio_chunked = gr.Interface(
+        fn=transcribe_chunked_audio,
+        inputs=[
+            gr.inputs.Audio(source="upload", optional=True, label="Audio file", type="filepath"),
+            gr.inputs.Radio(["transcribe", "translate"], label="Task", default="transcribe"),
+            gr.inputs.Checkbox(default=False, label="Return timestamps"),
+        ],
+        outputs=[
+            gr.outputs.Textbox(label="Transcription").style(show_copy_button=True),
+            gr.outputs.Textbox(label="Transcription Time (s)"),
+        ],
+        allow_flagging="never",
+        title=title,
+        description=description,
+        article=article,
+    )
+    youtube = gr.Interface(
+        fn=transcribe_youtube,
+        inputs=[
+            gr.inputs.Textbox(lines=1, placeholder="Paste the URL to a YouTube video here", label="YouTube URL"),
+            gr.inputs.Radio(["transcribe", "translate"], label="Task", default="transcribe"),
+            gr.inputs.Checkbox(default=False, label="Return timestamps"),
+        ],
+        outputs=[
+            gr.outputs.HTML(label="Video"),
+            gr.outputs.Textbox(label="Transcription").style(show_copy_button=True),
+            gr.outputs.Textbox(label="Transcription Time (s)"),
+        ],
+        allow_flagging="never",
+        title=title,
+        examples=[["https://www.youtube.com/watch?v=m8u-18Q0s7I", "transcribe", False]],
+        cache_examples=False,
+        description=description,
+        article=article,
+    )
+    demo = gr.Blocks()
+    with demo:
+        gr.TabbedInterface([microphone_chunked, audio_chunked, youtube], ["Microphone", "Audio File", "YouTube"])
+    demo.queue(concurrency_count=3, max_size=5)
+    demo.launch(show_api=False, max_threads=10)