SaShiMi-796

This repository contains pre-trained models for SaShiMi-796, a replication of the paper "It's Raw! Audio Generation with State-Space Models" from scratch in PyTorch. It was developed as a part of the course METU CENG 796 Deep Generative Models in Spring 2023.

See the following links for more information:

The models and the dataset in this repository will be automatically downloaded and extracted by download_data.sh script in the GitHub repository.

In addition, there's a also a zip file containing the Youtube Mix dataset. The only difference in our version of the dataset is that it's split into train-validation-test sets as described in the README file of the dataset. We had to upload our own version of this dataset because it's not possible to download it from the official repository using wget due to authorization issues.

Please note that the Youtube Mix dataset is not our own work (original Youtube video), hence is not covered under the same license as the model. The dataset is provided for academic and research purposes only, and it should be used as such in order to constitute fair use under the US copyright law. We take no responsibility for any copyright infringements that may take place by users who download and use this dataset.

Reproduction Results

With an 8-layer SaShiMi model, we managed to achieve an NLL of 1.325 (in base 2) after 160 epochs. For comparison, the result reported in the paper is 1.294. Although our result is slightly higher, the model in the paper was trained longer (600K steps on page 19, which would be about 400 epochs in our setup). We believe it's reasonable to expect that our model can achieve the same or better NLL value with longer training and/or better hyperparameter choices. Furthermore, our generated samples are similar to the ones provided by the authors. Therefore, we think that we've successfully reproduced the paper.