{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "sSxhJVHSDGGc" }, "source": [ "##### Copyright 2020 The TensorFlow Authors. All Rights Reserved.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Q0ySjtvCD6nJ" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\"); { display-mode: \"form\" }\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "Do75nZO17R_g" }, "source": [ "Authors : [jaeyoo@](https://github.com/jaeyoo), [khanhlvg@](https://github.com/khanhlvg), [abattery@](https://github.com/abattery), [thaink@](https://github.com/thaink) (Google Research) (refactored by [sayakpaul](https://github.com/sayakpaul) (PyImageSearch))\n", "\n", "Created : 2020-07-03 KST\n", "\n", "Last updated : 2020-07-04 KST\n", "\n", "-----\n", "Change logs\n", "* 2020-07-04 KST : Update notebook with the latest repo.\n", " * https://github.com/TensorSpeech/TensorflowTTS/pull/84 merged.\n", "* 2020-07-03 KST : First implementation (outputs : `fastspeech_quant.tflite`)\n", " * varied-length input tensor, varied-length output tensor\n", " * Inference on tflite works well.\n", "* 2020-12-22 IST: Notebook runs end-to-end on Colab.\n", "-----\n", "\n", "**Status** : successfully converted (`fastspeech_quant.tflite`)\n", "\n", "**Disclaimer** \n", "- This colab doesn't care about the latency, so it compressed the model with quantization. (112 MB -> 28 MB)\n", "- The TFLite file doesn't have LJSpeechProcessor. So you need to run it before feeding input vectors.\n", "- `tf-nightly>=2.4.0-dev20200630`\n" ] }, { "cell_type": "markdown", "metadata": { "id": "p5aF0cRBv57s" }, "source": [ "# Generate voice with FastSpeech" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VJTsCmhciNfz" }, "outputs": [], "source": [ "!git clone https://github.com/TensorSpeech/TensorFlowTTS.git\n", "!cd TensorFlowTTS\n", "!pip install /content/TensorFlowTTS/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-vKfQMu7PiVL" }, "outputs": [], "source": [ "!pip install -q tf-nightly" ] }, { "cell_type": "markdown", "metadata": { "id": "vFZ9aWOnP3y_" }, "source": [ "**Another runtime restart is required.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EHHcYEUyon5W", "outputId": "f89c5c36-a33a-48c2-9fb6-11ced05d4eea" }, "outputs": [], "source": [ "import numpy as np\n", "import yaml\n", "import tensorflow as tf\n", "\n", "import sys\n", "sys.path.append('/content/TensorFlowTTS')\n", "\n", "from tensorflow_tts.inference import AutoProcessor\n", "from tensorflow_tts.inference import AutoConfig\n", "from tensorflow_tts.inference import TFAutoModel\n", "\n", "from IPython.display import Audio\n", "print(tf.__version__) # check if >= 2.4.0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nBr1A7MBSm6u" }, "outputs": [], "source": [ "# initialize melgan model\n", "melgan = TFAutoModel.from_pretrained(\"tensorspeech/tts-melgan-ljspeech-en\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VG0PwedpqFhd" }, "outputs": [], "source": [ "# initialize FastSpeech model.\n", "fastspeech = TFAutoModel.from_pretraned(\"tensorspeech/tts-fastspeech-ljspeech-en\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7cPCQoxam3vp", "outputId": "3dcede9a-045c-4599-dea8-0a6c1e38d942" }, "outputs": [], "source": [ "input_text = \"Recent research at Harvard has shown meditating\\\n", "for as little as 8 weeks, can actually increase the grey matter in the \\\n", "parts of the brain responsible for emotional regulation, and learning.\"\n", "\n", "processor = AutoProcessor.from_pretrained(\"tensorspeech/tts-fastspeech-ljspeech-en\")\n", "input_ids = processor.text_to_sequence(input_text.lower())\n", "\n", "mel_before, mel_after, duration_outputs, _, _ = fastspeech.inference(\n", " input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),\n", " speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),\n", " speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),\n", " f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),\n", " energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),\n", ")\n", "\n", "audio_before = melgan(mel_before)[0, :, 0]\n", "audio_after = melgan(mel_after)[0, :, 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "SxVxtZI5sDF-", "outputId": "b9bbeff9-2b25-4ef2-9979-e6f41859beb3" }, "outputs": [], "source": [ "Audio(data=audio_before, rate=22050)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "i5gV4y9RpLBA", "outputId": "fdcbb41f-ad7b-4320-86f2-452385adda1e" }, "outputs": [], "source": [ "Audio(data=audio_after, rate=22050)" ] }, { "cell_type": "markdown", "metadata": { "id": "38xzKgqgwbLl" }, "source": [ "# Convert to TFLite" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j3eBgJr1CfqF" }, "outputs": [], "source": [ "# Concrete Function\n", "fastspeech_concrete_function = fastspeech.inference_tflite.get_concrete_function()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d9CUR0UD8O9w" }, "outputs": [], "source": [ "converter = tf.lite.TFLiteConverter.from_concrete_functions(\n", " [fastspeech_concrete_function]\n", ")\n", "converter.optimizations = [tf.lite.Optimize.DEFAULT]\n", "converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,\n", " tf.lite.OpsSet.SELECT_TF_OPS]\n", "tflite_model = converter.convert()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IjLkV0wlIVq1", "outputId": "bc6fc0aa-31ab-4188-a851-6135abd8a5cd" }, "outputs": [], "source": [ "# Save the TF Lite model.\n", "with open('fastspeech_quant.tflite', 'wb') as f:\n", " f.write(tflite_model)\n", "\n", "print('Model size is %f MBs.' % (len(tflite_model) / 1024 / 1024.0) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gLoUH69hJkIK" }, "outputs": [], "source": [ "## Download the TF Lite model\n", "#from google.colab import files\n", "#files.download('fastspeech_quant.tflite') " ] }, { "cell_type": "markdown", "metadata": { "id": "1WqL_NEbtL5K" }, "source": [ "# Inference from TFLite" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JjNnqWlItLXi" }, "outputs": [], "source": [ "import numpy as np\n", "import tensorflow as tf\n", "\n", "# Load the TFLite model and allocate tensors.\n", "interpreter = tf.lite.Interpreter(model_path='fastspeech_quant.tflite')\n", "\n", "# Get input and output tensors.\n", "input_details = interpreter.get_input_details()\n", "output_details = interpreter.get_output_details()\n", "\n", "# Prepare input data.\n", "def prepare_input(input_ids):\n", " input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0)\n", " return (input_ids,\n", " tf.convert_to_tensor([0], tf.int32),\n", " tf.convert_to_tensor([1.0], dtype=tf.float32),\n", " tf.convert_to_tensor([1.0], dtype=tf.float32),\n", " tf.convert_to_tensor([1.0], dtype=tf.float32))\n", "\n", "# Test the model on random input data.\n", "def infer(input_text):\n", " processor = AutoProcessor.from_pretrained(pretrained_path=\"ljspeech_mapper.json\")\n", " input_ids = processor.text_to_sequence(input_text.lower())\n", " interpreter.resize_tensor_input(input_details[0]['index'], \n", " [1, len(input_ids)])\n", " interpreter.resize_tensor_input(input_details[1]['index'], \n", " [1])\n", " interpreter.resize_tensor_input(input_details[2]['index'], \n", " [1])\n", " interpreter.resize_tensor_input(input_details[3]['index'], \n", " [1])\n", " interpreter.resize_tensor_input(input_details[4]['index'], \n", " [1])\n", " interpreter.allocate_tensors()\n", " input_data = prepare_input(input_ids)\n", " for i, detail in enumerate(input_details):\n", " input_shape = detail['shape_signature']\n", " interpreter.set_tensor(detail['index'], input_data[i])\n", "\n", " interpreter.invoke()\n", "\n", " # The function `get_tensor()` returns a copy of the tensor data.\n", " # Use `tensor()` in order to get a pointer to the tensor.\n", " return (interpreter.get_tensor(output_details[0]['index']),\n", " interpreter.get_tensor(output_details[1]['index']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dRgCO2UfdrBe" }, "outputs": [], "source": [ "input_text = \"Recent research at Harvard has shown meditating\\\n", "for as little as 8 weeks, can actually increase the grey matter in the \\\n", "parts of the brain responsible for emotional regulation, and learning.\"\n", "\n", "decoder_output_tflite, mel_output_tflite = infer(input_text)\n", "audio_before_tflite = melgan(decoder_output_tflite)[0, :, 0]\n", "audio_after_tflite = melgan(mel_output_tflite)[0, :, 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "vajrYnWAX31f", "outputId": "6ec5fcf3-c66b-4dac-f749-e7149beb81eb" }, "outputs": [], "source": [ "Audio(data=audio_before_tflite, rate=22050)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "-eJ5QGc5X_Tc", "outputId": "85654556-89ee-4b3e-9100-f60390af26f2", "scrolled": true }, "outputs": [], "source": [ "Audio(data=audio_after_tflite, rate=22050)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iht1FDZUd0Ig" }, "outputs": [], "source": [ "input_text = \"I love TensorFlow Lite converted FastSpeech with quantization. \\\n", "The converted model file is of 28.6 Mega bytes.\"\n", "\n", "decoder_output_tflite, mel_output_tflite = infer(input_text)\n", "audio_before_tflite = melgan(decoder_output_tflite)[0, :, 0]\n", "audio_after_tflite = melgan(mel_output_tflite)[0, :, 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "ZJVtr-D3d6rr", "outputId": "138a0cc3-d367-4a0c-872a-47ca25e12f69" }, "outputs": [], "source": [ "Audio(data=audio_before_tflite, rate=22050)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "mBU2Zdl1d8ZI", "outputId": "ce188fbf-5d86-474f-84ee-b22baac6f417" }, "outputs": [], "source": [ "Audio(data=audio_after_tflite, rate=22050)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "TensorFlowTTS - FastSpeech with TFLite", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 1 }