# Generative Spoken Language Modeling

* [Paper](https://arxiv.org/abs/2102.01192)
* [Demo](https://speechbot.github.io/gslm/index.html)

We build and evaluate generative speech2speech systems using [Log Mel Filtebank](https://pytorch.org/audio/stable/compliance.kaldi.html#fbank), [Modified CPC](https://github.com/facebookresearch/CPC_audio), [HuBERT Base](https://github.com/pytorch/fairseq/tree/main/examples/hubert) and [Wav2Vec 2.0 Large](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec). Our system is composed of three components, namely, *speech2unit*, *ulm* and *unit2speech*. We explain about models and usage of these components in their respective sub-directories. See the links below.

## Speech to Unit Model (speech2unit)
Speech to unit model is used for quantizing raw speech into learned discrete speech units. [More details](speech2unit)

## Unit Language Model (ulm)
Unit Language Model is a generative language model trained on discrete speech units. [More details](ulm)

## Unit to Speech Model (unit2speech)
Unit to speech model is used for synthesizing speech from discrete speech units. [More details](unit2speech)

## Metrics
We show how to compute ASR based metrics as well as zero-shot metrics proposed in our paper [here](metrics).

## Tools
We share two tools to resynthesize a given spoken utterance, and generate novel spoken language given a spoken prompt. [More detail](tools)