subhankarg commited on
Commit
5da6259
1 Parent(s): 33a5cec

Update README/Model Card.

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md CHANGED
@@ -1,3 +1,115 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ library_name: nemo
5
+ datasets:
6
+ - ljspeech
7
+ thumbnail: null
8
+ tags:
9
+ - text-to-speech
10
+ - speech
11
+ - audio
12
+ - Transformer
13
+ - pytorch
14
+ - NeMo
15
+ - Riva
16
  license: cc-by-4.0
17
  ---
18
+ # NVIDIA FastPitch (en-US)
19
+
20
+ <style>
21
+ img {
22
+ display: inline;
23
+ }
24
+ </style>
25
+
26
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastPitch--Transformer-lightgrey#model-badge)](#model-architecture)
27
+ | [![Model size](https://img.shields.io/badge/Params-45M-lightgrey#model-badge)](#model-architecture)
28
+ | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
29
+ | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
30
+
31
+ FastPitch [1] is a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration. Additionally, it uses an unsupervised speech-text aligner [2]. See the [model architecture](#model-architecture) section for complete architecture details.
32
+
33
+ It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
34
+
35
+
36
+ ## Usage
37
+
38
+ The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
39
+
40
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
41
+
42
+ ```
43
+ pip install nemo_toolkit['all']
44
+ ```
45
+
46
+ ### Automatically instantiate the model
47
+
48
+
49
+ ```python
50
+ # Load FastPitch
51
+ from nemo.collections.tts.models import FastPitchModel
52
+ spec_generator = FastPitchModel.from_pretrained("tts_en_fastpitch")
53
+
54
+ # Load vocoder
55
+ from nemo.collections.tts.models import Vocoder
56
+ model = Vocoder.from_pretrained(model_name="tts_hifigan")
57
+ ```
58
+
59
+ ### Generate audio
60
+
61
+ ```python
62
+ import soundfile as sf
63
+ parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
64
+ spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
65
+ audio = model.convert_spectrogram_to_audio(spec=spectrogram)
66
+ ```
67
+
68
+ ### Save the generated audio file
69
+
70
+ ```python
71
+ # Save the audio to disk in a file called speech.wav
72
+ sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
73
+ ```
74
+
75
+
76
+ ### Input
77
+
78
+ This model accepts batches of text.
79
+
80
+ ### Output
81
+
82
+ This model generates mel spectrograms.
83
+
84
+ ## Model Architecture
85
+
86
+ FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with a much higher real-time factor than Tacotron2 for the mel-spectrogram synthesis of a typical utterance. It uses an unsupervised speech-text aligner.
87
+
88
+
89
+ ## Training
90
+
91
+ The NeMo toolkit [3] was used for training the models for 1000 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/fastpitch.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/fastpitch_align_v1.05.yaml).
92
+
93
+
94
+ ### Datasets
95
+
96
+ This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
97
+
98
+ ## Performance
99
+
100
+ No performance information is available at this time.
101
+
102
+ ## Limitations
103
+ This checkpoint only works well with vocoders that were trained on 22050Hz data. Otherwise, the generated audio may be scratchy or choppy-sounding.
104
+
105
+ ## Deployment with NVIDIA Riva
106
+ For the best real-time accuracy, latency, and throughput, deploy the model with [NVIDIA Riva](https://developer.nvidia.com/riva), an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, at the edge, and embedded.
107
+ Additionally, Riva provides:
108
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
109
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
110
+ * Streaming speech recognition, Kubernetes compatible scaling, and Enterprise-grade support
111
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
112
+ ## References
113
+ - [1] [FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
114
+ - [2] [One TTS Alignment To Rule Them All](https://arxiv.org/abs/2108.10447)
115
+ - [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)