Spaces:

vishred18
/

Comparative-Analysis-of-Speech-Synthesis-Models

Build error

Comparative-Analysis-of-Speech-Synthesis-Models

File size: 14,079 Bytes

d5ee97c

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/sayakpaul/TensorFlowTTS/blob/master/notebooks/TensorFlowTTS_FastSpeech_with_TFLite.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sSxhJVHSDGGc"
   },
   "source": [
    "##### Copyright 2020 The TensorFlow Authors. All Rights Reserved.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Q0ySjtvCD6nJ"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\"); { display-mode: \"form\" }\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Do75nZO17R_g"
   },
   "source": [
    "Authors : [jaeyoo@](https://github.com/jaeyoo), [khanhlvg@](https://github.com/khanhlvg), [abattery@](https://github.com/abattery), [thaink@](https://github.com/thaink) (Google Research) (refactored by [sayakpaul](https://github.com/sayakpaul) (PyImageSearch))\n",
    "\n",
    "Created : 2020-07-03 KST\n",
    "\n",
    "Last updated : 2020-07-04 KST\n",
    "\n",
    "-----\n",
    "Change logs\n",
    "* 2020-07-04 KST : Update notebook with the latest repo.\n",
    " * https://github.com/TensorSpeech/TensorflowTTS/pull/84 merged.\n",
    "* 2020-07-03 KST : First implementation (outputs : `fastspeech_quant.tflite`)\n",
    " * varied-length input tensor, varied-length output tensor\n",
    " * Inference on tflite works well.\n",
    "* 2020-12-22 IST: Notebook runs end-to-end on Colab.\n",
    "-----\n",
    "\n",
    "**Status** : successfully converted (`fastspeech_quant.tflite`)\n",
    "\n",
    "**Disclaimer** \n",
    "- This colab doesn't care about the latency, so it compressed the model with quantization. (112 MB -> 28 MB)\n",
    "- The TFLite file doesn't have LJSpeechProcessor. So you need to run it before feeding input vectors.\n",
    "- `tf-nightly>=2.4.0-dev20200630`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p5aF0cRBv57s"
   },
   "source": [
    "# Generate voice with FastSpeech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VJTsCmhciNfz"
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/TensorSpeech/TensorFlowTTS.git\n",
    "!cd TensorFlowTTS\n",
    "!pip install /content/TensorFlowTTS/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-vKfQMu7PiVL"
   },
   "outputs": [],
   "source": [
    "!pip install -q tf-nightly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vFZ9aWOnP3y_"
   },
   "source": [
    "**Another runtime restart is required.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "EHHcYEUyon5W",
    "outputId": "f89c5c36-a33a-48c2-9fb6-11ced05d4eea"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import yaml\n",
    "import tensorflow as tf\n",
    "\n",
    "import sys\n",
    "sys.path.append('/content/TensorFlowTTS')\n",
    "\n",
    "from tensorflow_tts.inference import AutoProcessor\n",
    "from tensorflow_tts.inference import AutoConfig\n",
    "from tensorflow_tts.inference import TFAutoModel\n",
    "\n",
    "from IPython.display import Audio\n",
    "print(tf.__version__) # check if >= 2.4.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nBr1A7MBSm6u"
   },
   "outputs": [],
   "source": [
    "# initialize melgan model\n",
    "melgan = TFAutoModel.from_pretrained(\"tensorspeech/tts-melgan-ljspeech-en\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VG0PwedpqFhd"
   },
   "outputs": [],
   "source": [
    "# initialize FastSpeech model.\n",
    "fastspeech = TFAutoModel.from_pretraned(\"tensorspeech/tts-fastspeech-ljspeech-en\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "7cPCQoxam3vp",
    "outputId": "3dcede9a-045c-4599-dea8-0a6c1e38d942"
   },
   "outputs": [],
   "source": [
    "input_text = \"Recent research at Harvard has shown meditating\\\n",
    "for as little as 8 weeks, can actually increase the grey matter in the \\\n",
    "parts of the brain responsible for emotional regulation, and learning.\"\n",
    "\n",
    "processor = AutoProcessor.from_pretrained(\"tensorspeech/tts-fastspeech-ljspeech-en\")\n",
    "input_ids = processor.text_to_sequence(input_text.lower())\n",
    "\n",
    "mel_before, mel_after, duration_outputs, _, _ = fastspeech.inference(\n",
    "    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),\n",
    "    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),\n",
    "    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),\n",
    "    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),\n",
    "    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),\n",
    ")\n",
    "\n",
    "audio_before = melgan(mel_before)[0, :, 0]\n",
    "audio_after = melgan(mel_after)[0, :, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "id": "SxVxtZI5sDF-",
    "outputId": "b9bbeff9-2b25-4ef2-9979-e6f41859beb3"
   },
   "outputs": [],
   "source": [
    "Audio(data=audio_before, rate=22050)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "id": "i5gV4y9RpLBA",
    "outputId": "fdcbb41f-ad7b-4320-86f2-452385adda1e"
   },
   "outputs": [],
   "source": [
    "Audio(data=audio_after, rate=22050)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "38xzKgqgwbLl"
   },
   "source": [
    "# Convert to TFLite"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "j3eBgJr1CfqF"
   },
   "outputs": [],
   "source": [
    "# Concrete Function\n",
    "fastspeech_concrete_function = fastspeech.inference_tflite.get_concrete_function()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "d9CUR0UD8O9w"
   },
   "outputs": [],
   "source": [
    "converter = tf.lite.TFLiteConverter.from_concrete_functions(\n",
    "    [fastspeech_concrete_function]\n",
    ")\n",
    "converter.optimizations = [tf.lite.Optimize.DEFAULT]\n",
    "converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,\n",
    "                                       tf.lite.OpsSet.SELECT_TF_OPS]\n",
    "tflite_model = converter.convert()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "IjLkV0wlIVq1",
    "outputId": "bc6fc0aa-31ab-4188-a851-6135abd8a5cd"
   },
   "outputs": [],
   "source": [
    "# Save the TF Lite model.\n",
    "with open('fastspeech_quant.tflite', 'wb') as f:\n",
    "  f.write(tflite_model)\n",
    "\n",
    "print('Model size is %f MBs.' % (len(tflite_model) / 1024 / 1024.0) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gLoUH69hJkIK"
   },
   "outputs": [],
   "source": [
    "## Download the TF Lite model\n",
    "#from google.colab import files\n",
    "#files.download('fastspeech_quant.tflite') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1WqL_NEbtL5K"
   },
   "source": [
    "# Inference from TFLite"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JjNnqWlItLXi"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "\n",
    "# Load the TFLite model and allocate tensors.\n",
    "interpreter = tf.lite.Interpreter(model_path='fastspeech_quant.tflite')\n",
    "\n",
    "# Get input and output tensors.\n",
    "input_details = interpreter.get_input_details()\n",
    "output_details = interpreter.get_output_details()\n",
    "\n",
    "# Prepare input data.\n",
    "def prepare_input(input_ids):\n",
    "  input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0)\n",
    "  return (input_ids,\n",
    "          tf.convert_to_tensor([0], tf.int32),\n",
    "          tf.convert_to_tensor([1.0], dtype=tf.float32),\n",
    "          tf.convert_to_tensor([1.0], dtype=tf.float32),\n",
    "          tf.convert_to_tensor([1.0], dtype=tf.float32))\n",
    "\n",
    "# Test the model on random input data.\n",
    "def infer(input_text):\n",
    "  processor = AutoProcessor.from_pretrained(pretrained_path=\"ljspeech_mapper.json\")\n",
    "  input_ids = processor.text_to_sequence(input_text.lower())\n",
    "  interpreter.resize_tensor_input(input_details[0]['index'], \n",
    "                                  [1, len(input_ids)])\n",
    "  interpreter.resize_tensor_input(input_details[1]['index'], \n",
    "                                  [1])\n",
    "  interpreter.resize_tensor_input(input_details[2]['index'], \n",
    "                                  [1])\n",
    "  interpreter.resize_tensor_input(input_details[3]['index'], \n",
    "                                  [1])\n",
    "  interpreter.resize_tensor_input(input_details[4]['index'], \n",
    "                                  [1])\n",
    "  interpreter.allocate_tensors()\n",
    "  input_data = prepare_input(input_ids)\n",
    "  for i, detail in enumerate(input_details):\n",
    "    input_shape = detail['shape_signature']\n",
    "    interpreter.set_tensor(detail['index'], input_data[i])\n",
    "\n",
    "  interpreter.invoke()\n",
    "\n",
    "  # The function `get_tensor()` returns a copy of the tensor data.\n",
    "  # Use `tensor()` in order to get a pointer to the tensor.\n",
    "  return (interpreter.get_tensor(output_details[0]['index']),\n",
    "          interpreter.get_tensor(output_details[1]['index']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dRgCO2UfdrBe"
   },
   "outputs": [],
   "source": [
    "input_text = \"Recent research at Harvard has shown meditating\\\n",
    "for as little as 8 weeks, can actually increase the grey matter in the \\\n",
    "parts of the brain responsible for emotional regulation, and learning.\"\n",
    "\n",
    "decoder_output_tflite, mel_output_tflite = infer(input_text)\n",
    "audio_before_tflite = melgan(decoder_output_tflite)[0, :, 0]\n",
    "audio_after_tflite = melgan(mel_output_tflite)[0, :, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "id": "vajrYnWAX31f",
    "outputId": "6ec5fcf3-c66b-4dac-f749-e7149beb81eb"
   },
   "outputs": [],
   "source": [
    "Audio(data=audio_before_tflite, rate=22050)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "id": "-eJ5QGc5X_Tc",
    "outputId": "85654556-89ee-4b3e-9100-f60390af26f2",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "Audio(data=audio_after_tflite, rate=22050)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iht1FDZUd0Ig"
   },
   "outputs": [],
   "source": [
    "input_text = \"I love TensorFlow Lite converted FastSpeech with quantization. \\\n",
    "The converted model file is of 28.6 Mega bytes.\"\n",
    "\n",
    "decoder_output_tflite, mel_output_tflite = infer(input_text)\n",
    "audio_before_tflite = melgan(decoder_output_tflite)[0, :, 0]\n",
    "audio_after_tflite = melgan(mel_output_tflite)[0, :, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "id": "ZJVtr-D3d6rr",
    "outputId": "138a0cc3-d367-4a0c-872a-47ca25e12f69"
   },
   "outputs": [],
   "source": [
    "Audio(data=audio_before_tflite, rate=22050)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "id": "mBU2Zdl1d8ZI",
    "outputId": "ce188fbf-5d86-474f-84ee-b22baac6f417"
   },
   "outputs": [],
   "source": [
    "Audio(data=audio_after_tflite, rate=22050)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "TensorFlowTTS - FastSpeech with TFLite",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}