Spaces:

nakas
/

Time-Domain-Audio-Style-Transfer

Runtime error

App Files Files Community

nakas commited on Nov 12, 2022

Commit

2c448c3

•

1 Parent(s): 028a5c0

github fork

Browse files

Files changed (17) hide show

.DS_Store +0 -0
LICENSE +201 -0
MANIFEST.in +1 -0
audio_style_transfer/__init__.py +0 -0
audio_style_transfer/__version__.py +3 -0
audio_style_transfer/models/__init__.py +0 -0
audio_style_transfer/models/nsynth.py +393 -0
audio_style_transfer/models/timedomain.py +354 -0
audio_style_transfer/models/uylanov.py +205 -0
audio_style_transfer/utils.py +199 -0
environment.yml +117 -0
nips_2017.sty +339 -0
paper.pdf +0 -0
paper.tex +224 -0
search.py +127 -0
setup.py +116 -0
style-transfer.bib +128 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

MANIFEST.in ADDED Viewed

	@@ -0,0 +1 @@


1	+ include README.md LICENSE

audio_style_transfer/__init__.py ADDED Viewed

File without changes

audio_style_transfer/__version__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ VERSION = (1, 0, 0)
2	+
3	+ __version__ = '.'.join(map(str, VERSION))

audio_style_transfer/models/__init__.py ADDED Viewed

File without changes

audio_style_transfer/models/nsynth.py ADDED Viewed

	@@ -0,0 +1,393 @@

+"""NSynth & WaveNet Audio Style Transfer."""
+import os
+import glob
+import librosa
+import argparse
+import numpy as np
+import tensorflow as tf
+from magenta.models.nsynth.wavenet import masked
+from magenta.models.nsynth.utils import mu_law, inv_mu_law_numpy
+from audio_style_transfer import utils
+def compute_wavenet_encoder_features(content, style):
+    ae_hop_length = 512
+    ae_bottleneck_width = 16
+    ae_num_stages = 10
+    ae_num_layers = 30
+    ae_filter_length = 3
+    ae_width = 128
+    # Encode the source with 8-bit Mu-Law.
+    n_frames = content.shape[0]
+    n_samples = content.shape[1]
+    content_tf = np.ascontiguousarray(content)
+    style_tf = np.ascontiguousarray(style)
+    g = tf.Graph()
+    content_features = []
+    style_features = []
+    layers = []
+    with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
+        x = tf.placeholder('float32', [n_frames, n_samples], name="x")
+        x_quantized = mu_law(x)
+        x_scaled = tf.cast(x_quantized, tf.float32) / 128.0
+        x_scaled = tf.expand_dims(x_scaled, 2)
+        en = masked.conv1d(
+            x_scaled,
+            causal=False,
+            num_filters=ae_width,
+            filter_length=ae_filter_length,
+            name='ae_startconv')
+        for num_layer in range(ae_num_layers):
+            dilation = 2**(num_layer % ae_num_stages)
+            d = tf.nn.relu(en)
+            d = masked.conv1d(
+                d,
+                causal=False,
+                num_filters=ae_width,
+                filter_length=ae_filter_length,
+                dilation=dilation,
+                name='ae_dilatedconv_%d' % (num_layer + 1))
+            d = tf.nn.relu(d)
+            en += masked.conv1d(
+                d,
+                num_filters=ae_width,
+                filter_length=1,
+                name='ae_res_%d' % (num_layer + 1))
+            layers.append(en)
+        en = masked.conv1d(
+            en,
+            num_filters=ae_bottleneck_width,
+            filter_length=1,
+            name='ae_bottleneck')
+        en = masked.pool1d(en, ae_hop_length, name='ae_pool', mode='avg')
+        saver = tf.train.Saver()
+        saver.restore(sess, './model.ckpt-200000')
+        content_features = sess.run(layers, feed_dict={x: content_tf})
+        styles = sess.run(layers, feed_dict={x: style_tf})
+        for i, style_feature in enumerate(styles):
+            n_features = np.prod(layers[i].shape.as_list()[-1])
+            features = np.reshape(style_feature, (-1, n_features))
+            style_gram = np.matmul(features.T, features) / (n_samples *
+                                                            n_frames)
+            style_features.append(style_gram)
+    return content_features, style_features
+def compute_wavenet_encoder_stylization(n_samples,
+                                        n_frames,
+                                        content_features,
+                                        style_features,
+                                        alpha=1e-4,
+                                        learning_rate=1e-3,
+                                        iterations=100):
+    ae_style_layers = [1, 5]
+    ae_num_layers = 30
+    ae_num_stages = 10
+    ae_filter_length = 3
+    ae_width = 128
+    layers = []
+    with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
+        x = tf.placeholder(
+            name="x", shape=(n_frames, n_samples, 1), dtype=tf.float32)
+        en = masked.conv1d(
+            x,
+            causal=False,
+            num_filters=ae_width,
+            filter_length=ae_filter_length,
+            name='ae_startconv')
+        for num_layer in range(ae_num_layers):
+            dilation = 2**(num_layer % ae_num_stages)
+            d = tf.nn.relu(en)
+            d = masked.conv1d(
+                d,
+                causal=False,
+                num_filters=ae_width,
+                filter_length=ae_filter_length,
+                dilation=dilation,
+                name='ae_dilatedconv_%d' % (num_layer + 1))
+            d = tf.nn.relu(d)
+            en += masked.conv1d(
+                d,
+                num_filters=ae_width,
+                filter_length=1,
+                name='ae_res_%d' % (num_layer + 1))
+            layer_i = tf.identity(en, name='layer_{}'.format(num_layer))
+            layers.append(layer_i)
+        saver = tf.train.Saver()
+        saver.restore(sess, './model.ckpt-200000')
+        sess.run(tf.initialize_all_variables())
+        frozen_graph_def = tf.graph_util.convert_variables_to_constants(
+            sess, sess.graph_def, [en.name.replace(':0', '')] +
+            ['layer_{}'.format(i) for i in range(ae_num_layers)])
+    with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
+        x = tf.Variable(
+            np.random.randn(n_frames, n_samples, 1).astype(np.float32))
+        tf.import_graph_def(frozen_graph_def, input_map={'x:0': x})
+        content_loss = np.float32(0.0)
+        style_loss = np.float32(0.0)
+        for num_layer in ae_style_layers:
+            layer_i = g.get_tensor_by_name(name='import/layer_%d:0' %
+                                           (num_layer))
+            content_loss = content_loss + alpha * 2 * tf.nn.l2_loss(
+                layer_i - content_features[num_layer])
+            n_features = layer_i.shape.as_list()[-1]
+            features = tf.reshape(layer_i, (-1, n_features))
+            gram = tf.matmul(tf.transpose(features), features) / (n_frames *
+                                                                  n_samples)
+            style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_features[
+                num_layer])
+        loss = content_loss + style_loss
+        # Optimization
+        print('Started optimization.')
+        opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
+        var_list = tf.trainable_variables()
+        print(var_list)
+        sess.run(tf.initialize_all_variables())
+        for i in range(iterations):
+            s, c, layer, _ = sess.run([style_loss, content_loss, loss, opt])
+            print(i, '- Style:', s, 'Content:', c, end='\r')
+        result = x.eval()
+        result = inv_mu_law_numpy(result[..., 0] / result.max() * 128.0)
+    return result
+def compute_wavenet_decoder_features(content, style):
+    num_stages = 10
+    num_layers = 30
+    filter_length = 3
+    width = 512
+    skip_width = 256
+    # Encode the source with 8-bit Mu-Law.
+    n_frames = content.shape[0]
+    n_samples = content.shape[1]
+    content_tf = np.ascontiguousarray(content)
+    style_tf = np.ascontiguousarray(style)
+    g = tf.Graph()
+    content_features = []
+    style_features = []
+    layers = []
+    with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
+        x = tf.placeholder('float32', [n_frames, n_samples], name="x")
+        x_quantized = mu_law(x)
+        x_scaled = tf.cast(x_quantized, tf.float32) / 128.0
+        x_scaled = tf.expand_dims(x_scaled, 2)
+        layer = x_scaled
+        layer = masked.conv1d(
+            layer, num_filters=width, filter_length=filter_length, name='startconv')
+        # Set up skip connections.
+        s = masked.conv1d(
+            layer, num_filters=skip_width, filter_length=1, name='skip_start')
+        # Residual blocks with skip connections.
+        for i in range(num_layers):
+            dilation = 2**(i % num_stages)
+            d = masked.conv1d(
+                layer,
+                num_filters=2 * width,
+                filter_length=filter_length,
+                dilation=dilation,
+                name='dilatedconv_%d' % (i + 1))
+            assert d.get_shape().as_list()[2] % 2 == 0
+            m = d.get_shape().as_list()[2] // 2
+            d_sigmoid = tf.sigmoid(d[:, :, :m])
+            d_tanh = tf.tanh(d[:, :, m:])
+            d = d_sigmoid * d_tanh
+            layer += masked.conv1d(
+                d, num_filters=width, filter_length=1, name='res_%d' % (i + 1))
+            s += masked.conv1d(
+                d,
+                num_filters=skip_width,
+                filter_length=1,
+                name='skip_%d' % (i + 1))
+            layers.append(s)
+        saver = tf.train.Saver()
+        saver.restore(sess, './model.ckpt-200000')
+        content_features = sess.run(layers, feed_dict={x: content_tf})
+        styles = sess.run(layers, feed_dict={x: style_tf})
+        for i, style_feature in enumerate(styles):
+            n_features = np.prod(layers[i].shape.as_list()[-1])
+            features = np.reshape(style_feature, (-1, n_features))
+            style_gram = np.matmul(features.T, features) / (n_samples *
+                                                            n_frames)
+            style_features.append(style_gram)
+    return content_features, style_features
+def compute_wavenet_decoder_stylization(n_samples,
+                                        n_frames,
+                                        content_features,
+                                        style_features,
+                                        alpha=1e-4,
+                                        learning_rate=1e-3,
+                                        iterations=100):
+    style_layers = [1, 5]
+    num_stages = 10
+    num_layers = 30
+    filter_length = 3
+    width = 512
+    skip_width = 256
+    layers = []
+    with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
+        x = tf.placeholder(
+            name="x", shape=(n_frames, n_samples, 1), dtype=tf.float32)
+        layer = x
+        layer = masked.conv1d(
+            layer, num_filters=width, filter_length=filter_length, name='startconv')
+        # Set up skip connections.
+        s = masked.conv1d(
+            layer, num_filters=skip_width, filter_length=1, name='skip_start')
+        # Residual blocks with skip connections.
+        for i in range(num_layers):
+            dilation = 2**(i % num_stages)
+            d = masked.conv1d(
+                layer,
+                num_filters=2 * width,
+                filter_length=filter_length,
+                dilation=dilation,
+                name='dilatedconv_%d' % (i + 1))
+            assert d.get_shape().as_list()[2] % 2 == 0
+            m = d.get_shape().as_list()[2] // 2
+            d_sigmoid = tf.sigmoid(d[:, :, :m])
+            d_tanh = tf.tanh(d[:, :, m:])
+            d = d_sigmoid * d_tanh
+            layer += masked.conv1d(
+                d, num_filters=width, filter_length=1, name='res_%d' % (i + 1))
+            s += masked.conv1d(
+                d,
+                num_filters=skip_width,
+                filter_length=1,
+                name='skip_%d' % (i + 1))
+            layer_i = tf.identity(s, name='layer_{}'.format(num_layers))
+            layers.append(layer_i)
+        saver = tf.train.Saver()
+        saver.restore(sess, './model.ckpt-200000')
+        sess.run(tf.initialize_all_variables())
+        frozen_graph_def = tf.graph_util.convert_variables_to_constants(
+            sess, sess.graph_def, [s.name.replace(':0', '')] +
+            ['layer_{}'.format(i) for i in range(num_layers)])
+    with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
+        x = tf.Variable(
+            np.random.randn(n_frames, n_samples, 1).astype(np.float32))
+        tf.import_graph_def(frozen_graph_def, input_map={'x:0': x})
+        content_loss = np.float32(0.0)
+        style_loss = np.float32(0.0)
+        for num_layer in style_layers:
+            layer_i = g.get_tensor_by_name(name='import/layer_%d:0' %
+                                           (num_layer))
+            content_loss = content_loss + alpha * 2 * tf.nn.l2_loss(
+                layer_i - content_features[num_layer])
+            n_features = layer_i.shape.as_list()[-1]
+            features = tf.reshape(layer_i, (-1, n_features))
+            gram = tf.matmul(tf.transpose(features), features) / (n_frames *
+                                                                  n_samples)
+            style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_features[
+                num_layer])
+        loss = content_loss + style_loss
+        # Optimization
+        print('Started optimization.')
+        opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
+        var_list = tf.trainable_variables()
+        print(var_list)
+        sess.run(tf.initialize_all_variables())
+        for i in range(iterations):
+            s, c, _ = sess.run([style_loss, content_loss, opt])
+            print(i, '- Style:', s, 'Content:', c, end='\r')
+        result = x.eval()
+        result = inv_mu_law_numpy(result[..., 0] / result.max() * 128.0)
+    return result
+def run(content_fname,
+        style_fname,
+        output_path,
+        model,
+        iterations=100,
+        sr=16000,
+        hop_size=512,
+        frame_size=2048,
+        alpha=1e-3):
+    content, fs = librosa.load(content_fname, sr=sr)
+    style, fs = librosa.load(style_fname, sr=sr)
+    n_samples = (min(content.shape[0], style.shape[0]) // 512) * 512
+    content = utils.chop(content[:n_samples], hop_size, frame_size)
+    style = utils.chop(style[:n_samples], hop_size, frame_size)
+    if model == 'encoder':
+        content_features, style_features = compute_wavenet_encoder_features(
+            content=content, style=style)
+        result = compute_wavenet_encoder_stylization(
+            n_frames=content_features[0].shape[0],
+            n_samples=frame_size,
+            alpha=alpha,
+            content_features=content_features,
+            style_features=style_features,
+            iterations=iterations)
+    elif model == 'decoder':
+        content_features, style_features = compute_wavenet_decoder_features(
+            content=content, style=style)
+        result = compute_wavenet_decoder_stylization(
+            n_frames=content_features[0].shape[0],
+            n_samples=frame_size,
+            alpha=alpha,
+            content_features=content_features,
+            style_features=style_features,
+            iterations=iterations)
+    else:
+        raise ValueError('Unsupported model type: {}.'.format(model))
+    x = utils.unchop(result, hop_size, frame_size)
+    librosa.output.write_wav('prelimiter.wav', x, sr)
+    limited = utils.limiter(x)
+    output_fname = '{}/{}+{}.wav'.format(output_path,
+                                         content_fname.split('/')[-1],
+                                         style_fname.split('/')[-1])
+    librosa.output.write_wav(output_fname, limited, sr=sr)
+def batch(content_path, style_path, output_path, model):
+    content_files = glob.glob('{}/*.wav'.format(content_path))
+    style_files = glob.glob('{}/*.wav'.format(style_path))
+    for content_fname in content_files:
+        for style_fname in style_files:
+            output_fname = '{}/{}+{}.wav'.format(output_path,
+                                                 content_fname.split('/')[-1],
+                                                 style_fname.split('/')[-1])
+            if os.path.exists(output_fname):
+                continue
+            run(content_fname, style_fname, output_fname, model)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '-s', '--style', help='style file(s) location', required=True)
+    parser.add_argument(
+        '-c', '--content', help='content file(s) location', required=True)
+    parser.add_argument('-o', '--output', help='output path', required=True)
+    parser.add_argument(
+        '-m',
+        '--model',
+        help='model type: [encoder], or decoder',
+        default='encoder')
+    parser.add_argument(
+        '-t',
+        '--type',
+        help='mode for training [single] (point to files) or batch (point to path)',
+        default='single')
+    args = vars(parser.parse_args())
+    if args['model'] == 'single':
+        run(args['content'], args['style'], args['output'], args['model'])
+    else:
+        batch(args['content'], args['style'], args['output'], args['model'])

audio_style_transfer/models/timedomain.py ADDED Viewed

	@@ -0,0 +1,354 @@

+"""NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
+Parag K. Mital
+"""
+import tensorflow as tf
+import librosa
+import numpy as np
+from scipy.signal import hann
+from audio_style_transfer import utils
+import argparse
+import glob
+import os
+def chop(signal, hop_size=256, frame_size=512):
+    n_hops = len(signal) // hop_size
+    s = []
+    hann_win = hann(frame_size)
+    for hop_i in range(n_hops):
+        frame = signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)]
+        frame = np.pad(frame, (0, frame_size - len(frame)), 'constant')
+        frame *= hann_win
+        s.append(frame)
+    s = np.array(s)
+    return s
+def unchop(frames, hop_size=256, frame_size=512):
+    signal = np.zeros((frames.shape[0] * hop_size + frame_size,))
+    for hop_i, frame in enumerate(frames):
+        signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)] += frame
+    return signal
+def dft_np(signal, hop_size=256, fft_size=512):
+    s = chop(signal, hop_size, fft_size)
+    N = s.shape[-1]
+    k = np.reshape(
+        np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [1, N // 2])
+    x = np.reshape(np.linspace(0.0, N - 1, N), [N, 1])
+    freqs = np.dot(x, k)
+    real = np.dot(s, np.cos(freqs)) * (2.0 / N)
+    imag = np.dot(s, np.sin(freqs)) * (2.0 / N)
+    return real, imag
+def idft_np(re, im, hop_size=256, fft_size=512):
+    N = re.shape[1] * 2
+    k = np.reshape(
+        np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [N // 2, 1])
+    x = np.reshape(np.linspace(0.0, N - 1, N), [1, N])
+    freqs = np.dot(k, x)
+    signal = np.zeros((re.shape[0] * hop_size + fft_size,))
+    recon = np.dot(re, np.cos(freqs)) + np.dot(im, np.sin(freqs))
+    for hop_i, frame in enumerate(recon):
+        signal[(hop_i * hop_size):(hop_i * hop_size + fft_size)] += frame
+    return signal
+def unwrap(x):
+    return np.unwrap(x).astype(np.float32)
+def instance_norm(x, epsilon=1e-5):
+    """Instance Normalization.
+    See Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016).
+    Instance Normalization: The Missing Ingredient for Fast Stylization,
+    Retrieved from http://arxiv.org/abs/1607.08022
+    Parameters
+    ----------
+    x : TYPE
+        Description
+    epsilon : float, optional
+        Description
+    """
+    with tf.variable_scope('instance_norm'):
+        mean, var = tf.nn.moments(x, [1, 2], keep_dims=True)
+        scale = tf.get_variable(
+            name='scale',
+            shape=[x.get_shape()[-1]],
+            initializer=tf.truncated_normal_initializer(mean=1.0, stddev=0.02))
+        offset = tf.get_variable(
+            name='offset',
+            shape=[x.get_shape()[-1]],
+            initializer=tf.constant_initializer(0.0))
+        out = scale * tf.div(x - mean, tf.sqrt(var + epsilon)) + offset
+        return out
+def compute_inputs(x, freqs, n_fft, n_frames, input_features, norm=False):
+    if norm:
+        norm_fn = instance_norm
+    else:
+        def norm_fn(x):
+            return x
+    freqs_tf = tf.constant(freqs, name="freqs", dtype='float32')
+    inputs = {}
+    with tf.variable_scope('real'):
+        inputs['real'] = norm_fn(tf.reshape(
+            tf.matmul(x, tf.cos(freqs_tf)), [1, 1, n_frames, n_fft // 2]))
+    with tf.variable_scope('imag'):
+        inputs['imag'] = norm_fn(tf.reshape(
+            tf.matmul(x, tf.sin(freqs_tf)), [1, 1, n_frames, n_fft // 2]))
+    with tf.variable_scope('mags'):
+        inputs['mags'] = norm_fn(tf.reshape(
+            tf.sqrt(
+                tf.maximum(1e-15, inputs['real'] * inputs['real'] + inputs[
+                    'imag'] * inputs['imag'])), [1, 1, n_frames, n_fft // 2]))
+    with tf.variable_scope('phase'):
+        inputs['phase'] = norm_fn(tf.atan2(inputs['imag'], inputs['real']))
+    with tf.variable_scope('unwrapped'):
+        inputs['unwrapped'] = tf.py_func(
+            unwrap, [inputs['phase']], tf.float32)
+    with tf.variable_scope('unwrapped_difference'):
+        inputs['unwrapped_difference'] = (tf.slice(
+                inputs['unwrapped'],
+                [0, 0, 0, 1], [-1, -1, -1, n_fft // 2 - 1]) -
+            tf.slice(
+                inputs['unwrapped'],
+                [0, 0, 0, 0], [-1, -1, -1, n_fft // 2 - 1]))
+    if 'unwrapped_difference' in input_features:
+        for k, v in input_features:
+            if k is not 'unwrapped_difference':
+                inputs[k] = tf.slice(
+                        v, [0, 0, 0, 0], [-1, -1, -1, n_fft // 2 - 1])
+    net = tf.concat([inputs[i] for i in input_features], 1)
+    return inputs, net
+def compute_features(content,
+                     style,
+                     input_features,
+                     norm=False,
+                     stride=1,
+                     n_layers=1,
+                     n_filters=4096,
+                     n_fft=1024,
+                     k_h=1,
+                     k_w=11):
+    n_frames = content.shape[0]
+    n_samples = content.shape[1]
+    content_tf = np.ascontiguousarray(content)
+    style_tf = np.ascontiguousarray(style)
+    g = tf.Graph()
+    kernels = []
+    content_features = []
+    style_features = []
+    config_proto = tf.ConfigProto()
+    config_proto.gpu_options.allow_growth = True
+    with g.as_default(), g.device('/cpu:0'), tf.Session(config=config_proto) as sess:
+        x = tf.placeholder('float32', [n_frames, n_samples], name="x")
+        p = np.reshape(
+            np.linspace(0.0, n_samples - 1, n_samples), [n_samples, 1])
+        k = np.reshape(
+            np.linspace(0.0, 2 * np.pi / n_fft * (n_fft // 2), n_fft // 2),
+            [1, n_fft // 2])
+        freqs = np.dot(p, k)
+        inputs, net = compute_inputs(x, freqs, n_fft, n_frames, input_features, norm)
+        sess.run(tf.initialize_all_variables())
+        content_feature = net.eval(feed_dict={x: content_tf})
+        content_features.append(content_feature)
+        style_feature = inputs['mags'].eval(feed_dict={x: style_tf})
+        features = np.reshape(style_feature, (-1, n_fft // 2))
+        style_gram = np.matmul(features.T, features) / (n_frames)
+        style_features.append(style_gram)
+        for layer_i in range(n_layers):
+            if layer_i == 0:
+                std = np.sqrt(2) * np.sqrt(2.0 / (
+                    (n_fft / 2 + n_filters) * k_w))
+                kernel = np.random.randn(k_h, k_w, n_fft // 2, n_filters) * std
+            else:
+                std = np.sqrt(2) * np.sqrt(2.0 / (
+                    (n_filters + n_filters) * k_w))
+                kernel = np.random.randn(1, k_w, n_filters, n_filters) * std
+            kernels.append(kernel)
+            kernel_tf = tf.constant(
+                kernel, name="kernel{}".format(layer_i), dtype='float32')
+            conv = tf.nn.conv2d(
+                net,
+                kernel_tf,
+                strides=[1, stride, stride, 1],
+                padding="VALID",
+                name="conv{}".format(layer_i))
+            net = tf.nn.relu(conv)
+            content_feature = net.eval(feed_dict={x: content_tf})
+            content_features.append(content_feature)
+            style_feature = net.eval(feed_dict={x: style_tf})
+            features = np.reshape(style_feature, (-1, n_filters))
+            style_gram = np.matmul(features.T, features) / (n_frames)
+            style_features.append(style_gram)
+    return content_features, style_features, kernels, freqs
+def compute_stylization(kernels,
+                        n_samples,
+                        n_frames,
+                        content_features,
+                        style_gram,
+                        freqs,
+                        input_features,
+                        norm=False,
+                        stride=1,
+                        n_layers=1,
+                        n_fft=1024,
+                        alpha=1e-4,
+                        learning_rate=1e-3,
+                        iterations=100,
+                        optimizer='bfgs'):
+    result = None
+    with tf.Graph().as_default():
+        x = tf.Variable(
+            np.random.randn(n_frames, n_samples).astype(np.float32) * 1e-3,
+            name="x")
+        inputs, net = compute_inputs(x, freqs, n_fft, n_frames, input_features, norm)
+        content_loss = alpha * 2 * tf.nn.l2_loss(net - content_features[0])
+        feats = tf.reshape(inputs['mags'], (-1, n_fft // 2))
+        gram = tf.matmul(tf.transpose(feats), feats) / (n_frames)
+        style_loss = 2 * tf.nn.l2_loss(gram - style_gram[0])
+        for layer_i in range(n_layers):
+            kernel_tf = tf.constant(
+                kernels[layer_i],
+                name="kernel{}".format(layer_i),
+                dtype='float32')
+            conv = tf.nn.conv2d(
+                net,
+                kernel_tf,
+                strides=[1, stride, stride, 1],
+                padding="VALID",
+                name="conv{}".format(layer_i))
+            net = tf.nn.relu(conv)
+            content_loss = content_loss + \
+                alpha * 2 * tf.nn.l2_loss(net - content_features[layer_i + 1])
+            _, height, width, number = map(lambda i: i.value, net.get_shape())
+            feats = tf.reshape(net, (-1, number))
+            gram = tf.matmul(tf.transpose(feats), feats) / (n_frames)
+            style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_gram[
+                layer_i + 1])
+        loss = content_loss + style_loss
+        if optimizer == 'bfgs':
+            opt = tf.contrib.opt.ScipyOptimizerInterface(
+                loss, method='L-BFGS-B', options={'maxiter': iterations})
+            # Optimization
+            with tf.Session() as sess:
+                sess.run(tf.initialize_all_variables())
+                print('Started optimization.')
+                opt.minimize(sess)
+                result = x.eval()
+        else:
+            opt = tf.train.AdamOptimizer(
+                learning_rate=learning_rate).minimize(loss)
+            # Optimization
+            with tf.Session() as sess:
+                sess.run(tf.initialize_all_variables())
+                print('Started optimization.')
+                for i in range(iterations):
+                    s, c, l, _ = sess.run([style_loss, content_loss, loss, opt])
+                    print('Style:', s, 'Content:', c, end='\r')
+                result = x.eval()
+    return result
+def run(content_fname,
+        style_fname,
+        output_fname,
+        norm=False,
+        input_features=['real', 'imag', 'mags'],
+        n_fft=4096,
+        n_layers=1,
+        n_filters=4096,
+        hop_length=256,
+        alpha=0.05,
+        k_w=15,
+        k_h=3,
+        optimizer='bfgs',
+        stride=1,
+        iterations=300,
+        sr=22050):
+    frame_size = n_fft // 2
+    audio, fs = librosa.load(content_fname, sr=sr)
+    content = chop(audio, hop_size=hop_length, frame_size=frame_size)
+    audio, fs = librosa.load(style_fname, sr=sr)
+    style = chop(audio, hop_size=hop_length, frame_size=frame_size)
+    n_frames = min(content.shape[0], style.shape[0])
+    n_samples = min(content.shape[1], style.shape[1])
+    content = content[:n_frames, :n_samples]
+    style = style[:n_frames, :n_samples]
+    content_features, style_gram, kernels, freqs = compute_features(
+        content=content,
+        style=style,
+        input_features=input_features,
+        norm=norm,
+        stride=stride,
+        n_fft=n_fft,
+        n_layers=n_layers,
+        n_filters=n_filters,
+        k_w=k_w,
+        k_h=k_h)
+    result = compute_stylization(
+        kernels=kernels,
+        freqs=freqs,
+        input_features=input_features,
+        norm=norm,
+        n_samples=n_samples,
+        n_frames=n_frames,
+        n_fft=n_fft,
+        content_features=content_features,
+        style_gram=style_gram,
+        stride=stride,
+        n_layers=n_layers,
+        alpha=alpha,
+        optimizer=optimizer,
+        iterations=iterations)
+    s = unchop(result, hop_size=hop_length, frame_size=frame_size)
+    librosa.output.write_wav(output_fname, s, sr=sr)
+    s = utils.limiter(s)
+    librosa.output.write_wav(output_fname + '.limiter.wav', s, sr=sr)
+def batch(content_path, style_path, output_path, model):
+    content_files = glob.glob('{}/*.wav'.format(content_path))
+    style_files = glob.glob('{}/*.wav'.format(style_path))
+    for content_fname in content_files:
+        for style_fname in style_files:
+            output_fname = '{}/{}+{}.wav'.format(output_path,
+                                                 content_fname.split('/')[-1],
+                                                 style_fname.split('/')[-1])
+            if os.path.exists(output_fname):
+                continue
+            run(content_fname, style_fname, output_fname, model)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-s', '--style', help='style file', required=True)
+    parser.add_argument('-c', '--content', help='content file', required=True)
+    parser.add_argument('-o', '--output', help='output file', required=True)
+    parser.add_argument(
+        '-m',
+        '--mode',
+        help='mode for training [single] or batch',
+        default='single')
+    args = vars(parser.parse_args())
+    if args['mode'] == 'single':
+        run(args['content'], args['style'], args['output'])
+    else:
+        batch(args['content'], args['style'], args['output'])

audio_style_transfer/models/uylanov.py ADDED Viewed

	@@ -0,0 +1,205 @@

+"""NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
+Parag K. Mital
+"""
+import tensorflow as tf
+import librosa
+import numpy as np
+import argparse
+import glob
+import os
+from audio_style_transfer import utils
+def read_audio_spectum(filename, n_fft=2048, hop_length=512, sr=22050):
+    x, sr = librosa.load(filename, sr=sr)
+    S = librosa.stft(x, n_fft, hop_length)
+    S = np.log1p(np.abs(S)).T
+    return S, sr
+def compute_features(content,
+                     style,
+                     stride=1,
+                     n_layers=1,
+                     n_filters=4096,
+                     k_h=1,
+                     k_w=11):
+    n_frames = content.shape[0]
+    n_samples = content.shape[1]
+    content_tf = np.ascontiguousarray(content)
+    style_tf = np.ascontiguousarray(style)
+    g = tf.Graph()
+    kernels = []
+    layers = []
+    content_features = []
+    style_features = []
+    with g.as_default(), g.device('/cpu:0'), tf.Session():
+        x = tf.placeholder('float32', [None, n_samples], name="x")
+        net = tf.reshape(x, [1, 1, -1, n_samples])
+        for layer_i in range(n_layers):
+            if layer_i == 0:
+                std = np.sqrt(2) * np.sqrt(2.0 / ((n_frames + n_filters) * k_w))
+                kernel = np.random.randn(k_h, k_w, n_samples, n_filters) * std
+            else:
+                std = np.sqrt(2) * np.sqrt(2.0 / (
+                    (n_filters + n_filters) * k_w))
+                kernel = np.random.randn(k_h, k_w, n_filters, n_filters) * std
+            kernels.append(kernel)
+            kernel_tf = tf.constant(
+                kernel, name="kernel{}".format(layer_i), dtype='float32')
+            conv = tf.nn.conv2d(
+                net,
+                kernel_tf,
+                strides=[1, stride, stride, 1],
+                padding="VALID",
+                name="conv{}".format(layer_i))
+            net = tf.nn.relu(conv)
+            layers.append(net)
+            content_feature = net.eval(feed_dict={x: content_tf})
+            content_features.append(content_feature)
+            style_feature = net.eval(feed_dict={x: style_tf})
+            features = np.reshape(style_feature, (-1, n_filters))
+            style_gram = np.matmul(features.T, features) / n_frames
+            style_features.append(style_gram)
+    return content_features, style_features, kernels
+def compute_stylization(kernels,
+                        n_samples,
+                        n_frames,
+                        content_features,
+                        style_features,
+                        stride=1,
+                        n_layers=1,
+                        alpha=1e-4,
+                        learning_rate=1e-3,
+                        iterations=100):
+    result = None
+    with tf.Graph().as_default():
+        x = tf.Variable(
+            np.random.randn(1, 1, n_frames, n_samples).astype(np.float32) *
+            1e-3,
+            name="x")
+        net = x
+        content_loss = 0
+        style_loss = 0
+        for layer_i in range(n_layers):
+            kernel_tf = tf.constant(
+                kernels[layer_i],
+                name="kernel{}".format(layer_i),
+                dtype='float32')
+            conv = tf.nn.conv2d(
+                net,
+                kernel_tf,
+                strides=[1, stride, stride, 1],
+                padding="VALID",
+                name="conv{}".format(layer_i))
+            net = tf.nn.relu(conv)
+            content_loss = content_loss + \
+                alpha * 2 * tf.nn.l2_loss(net - content_features[layer_i])
+            _, height, width, number = map(lambda i: i.value, net.get_shape())
+            feats = tf.reshape(net, (-1, number))
+            gram = tf.matmul(tf.transpose(feats), feats) / n_frames
+            style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_features[
+                layer_i])
+        loss = content_loss + style_loss
+        opt = tf.contrib.opt.ScipyOptimizerInterface(
+            loss, method='L-BFGS-B', options={'maxiter': iterations})
+        # Optimization
+        with tf.Session() as sess:
+            sess.run(tf.initialize_all_variables())
+            print('Started optimization.')
+            opt.minimize(sess)
+            print('Final loss:', loss.eval())
+            result = x.eval()
+    return result
+def run(content_fname,
+        style_fname,
+        output_fname,
+        n_fft=2048,
+        hop_length=256,
+        alpha=0.02,
+        n_layers=1,
+        n_filters=8192,
+        k_w=15,
+        stride=1,
+        iterations=300,
+        phase_iterations=500,
+        sr=22050,
+        signal_length=1,  # second
+        block_length=1024):
+    content, sr = read_audio_spectum(
+        content_fname, n_fft=n_fft, hop_length=hop_length, sr=sr)
+    style, sr = read_audio_spectum(
+        style_fname, n_fft=n_fft, hop_length=hop_length, sr=sr)
+    n_frames = min(content.shape[0], style.shape[0])
+    n_samples = content.shape[1]
+    content = content[:n_frames, :]
+    style = style[:n_frames, :]
+    content_features, style_features, kernels = compute_features(
+        content=content,
+        style=style,
+        stride=stride,
+        n_layers=n_layers,
+        n_filters=n_filters,
+        k_w=k_w)
+    result = compute_stylization(
+        kernels=kernels,
+        n_samples=n_samples,
+        n_frames=n_frames,
+        content_features=content_features,
+        style_features=style_features,
+        stride=stride,
+        n_layers=n_layers,
+        alpha=alpha,
+        iterations=iterations)
+    mags = np.zeros_like(content.T)
+    mags[:, :n_frames] = np.exp(result[0, 0].T) - 1
+    p = 2 * np.pi * np.random.random_sample(mags.shape) - np.pi
+    for i in range(phase_iterations):
+        S = mags * np.exp(1j * p)
+        x = librosa.istft(S, hop_length)
+        p = np.angle(librosa.stft(x, n_fft, hop_length))
+    librosa.output.write_wav('prelimiter.wav', x, sr)
+    limited = utils.limiter(x)
+    librosa.output.write_wav(output_fname, limited, sr)
+def batch(content_path, style_path, output_path):
+    content_files = glob.glob('{}/*.wav'.format(content_path))
+    style_files = glob.glob('{}/*.wav'.format(style_path))
+    for content_filename in content_files:
+        for style_filename in style_files:
+            output_filename = '{}/{}+{}.wav'.format(
+                output_path,
+                content_filename.split('/')[-1], style_filename.split('/')[-1])
+            if os.path.exists(output_filename):
+                continue
+            run(content_filename, style_filename, output_filename)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-s', '--style', help='style file', required=True)
+    parser.add_argument('-c', '--content', help='content file', required=True)
+    parser.add_argument('-o', '--output', help='output file', required=True)
+    parser.add_argument(
+        '-m',
+        '--mode',
+        help='mode for training [single] or batch',
+        default='single')
+    args = vars(parser.parse_args())
+    if args['mode'] == 'single':
+        run(args['content'], args['style'], args['output'])
+    else:
+        batch(args['content'], args['style'], args['output'])

audio_style_transfer/utils.py ADDED Viewed

	@@ -0,0 +1,199 @@

+"""NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
+Parag K. Mital
+"""
+import glob
+import numpy as np
+from scipy.signal import hann
+import librosa
+import matplotlib
+import matplotlib.pyplot as plt
+import os
+def limiter(signal,
+            delay=40,
+            threshold=0.9,
+            release_coeff=0.9995,
+            attack_coeff=0.9):
+    delay_index = 0
+    envelope = 0
+    gain = 1
+    delay = delay
+    delay_line = np.zeros(delay)
+    release_coeff = release_coeff
+    attack_coeff = attack_coeff
+    threshold = threshold
+    for idx, sample in enumerate(signal):
+        delay_line[delay_index] = sample
+        delay_index = (delay_index + 1) % delay
+        # calculate an envelope of the signal
+        envelope = max(np.abs(sample), envelope * release_coeff)
+        if envelope > threshold:
+            target_gain = threshold / envelope
+        else:
+            target_gain = 1.0
+        # have gain go towards a desired limiter gain
+        gain = (gain * attack_coeff + target_gain * (1 - attack_coeff))
+        # limit the delayed signal
+        signal[idx] = delay_line[delay_index] * gain
+    return signal
+def chop(signal, hop_size=256, frame_size=512):
+    n_hops = len(signal) // hop_size
+    frames = []
+    hann_win = hann(frame_size)
+    for hop_i in range(n_hops):
+        frame = signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)]
+        frame = np.pad(frame, (0, frame_size - len(frame)), 'constant')
+        frame *= hann_win
+        frames.append(frame)
+    frames = np.array(frames)
+    return frames
+def unchop(frames, hop_size=256, frame_size=512):
+    signal = np.zeros((frames.shape[0] * hop_size + frame_size,))
+    for hop_i, frame in enumerate(frames):
+        signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)] += frame
+    return signal
+def matrix_dft(V):
+    N = len(V)
+    w = np.exp(-2j * np.pi / N)
+    col = np.vander([w], N, True)
+    W = np.vander(col.flatten(), N, True) / np.sqrt(N)
+    return np.dot(W, V)
+def dft_np(signal, hop_size=256, fft_size=512):
+    s = chop(signal, hop_size, fft_size)
+    N = s.shape[-1]
+    k = np.reshape(
+        np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [1, N // 2])
+    x = np.reshape(np.linspace(0.0, N - 1, N), [N, 1])
+    freqs = np.dot(x, k)
+    real = np.dot(s, np.cos(freqs)) * (2.0 / N)
+    imag = np.dot(s, np.sin(freqs)) * (2.0 / N)
+    return real, imag
+def idft_np(re, im, hop_size=256, fft_size=512):
+    N = re.shape[1] * 2
+    k = np.reshape(
+        np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [N // 2, 1])
+    x = np.reshape(np.linspace(0.0, N - 1, N), [1, N])
+    freqs = np.dot(k, x)
+    signal = np.zeros((re.shape[0] * hop_size + fft_size,))
+    recon = np.dot(re, np.cos(freqs)) + np.dot(im, np.sin(freqs))
+    for hop_i, frame in enumerate(recon):
+        signal[(hop_i * hop_size):(hop_i * hop_size + fft_size)] += frame
+    return signal
+def rainbowgram(path,
+                ax,
+                peak=70.0,
+                use_cqt=False,
+                n_fft=1024,
+                hop_length=256,
+                sr=22050,
+                over_sample=4,
+                res_factor=0.8,
+                octaves=5,
+                notes_per_octave=10):
+    audio = librosa.load(path, sr=sr)[0]
+    if use_cqt:
+        C = librosa.cqt(audio,
+                        sr=sr,
+                        hop_length=hop_length,
+                        bins_per_octave=int(notes_per_octave * over_sample),
+                        n_bins=int(octaves * notes_per_octave * over_sample),
+                        filter_scale=res_factor,
+                        fmin=librosa.note_to_hz('C2'))
+    else:
+        C = librosa.stft(
+            audio,
+            n_fft=n_fft,
+            win_length=n_fft,
+            hop_length=hop_length,
+            center=True)
+    mag, phase = librosa.core.magphase(C)
+    phase_angle = np.angle(phase)
+    phase_unwrapped = np.unwrap(phase_angle)
+    dphase = phase_unwrapped[:, 1:] - phase_unwrapped[:, :-1]
+    dphase = np.concatenate([phase_unwrapped[:, 0:1], dphase], axis=1) / np.pi
+    mag = (librosa.logamplitude(
+        mag**2, amin=1e-13, top_db=peak, ref_power=np.max) / peak) + 1
+    cdict = {
+        'red': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)),
+        'green': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)),
+        'blue': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)),
+        'alpha': ((0.0, 1.0, 1.0), (1.0, 0.0, 0.0))
+    }
+    my_mask = matplotlib.colors.LinearSegmentedColormap('MyMask', cdict)
+    plt.register_cmap(cmap=my_mask)
+    ax.matshow(dphase[::-1, :], cmap=plt.cm.rainbow)
+    ax.matshow(mag[::-1, :], cmap=my_mask)
+def rainbowgrams(list_of_paths,
+                 saveto=None,
+                 rows=2,
+                 cols=4,
+                 col_labels=[],
+                 row_labels=[],
+                 use_cqt=True,
+                 figsize=(15, 20),
+                 peak=70.0):
+    """Build a CQT rowsXcols.
+    """
+    N = len(list_of_paths)
+    assert N == rows * cols
+    fig, axes = plt.subplots(
+        rows, cols, sharex=True, sharey=True, figsize=figsize)
+    fig.subplots_adjust(left=0.05, right=0.95, wspace=0.05, hspace=0.1)
+    #       fig = plt.figure(figsize=(18, N * 1.25))
+    for i, path in enumerate(list_of_paths):
+        row = int(i / cols)
+        col = i % cols
+        if rows == 1 and cols == 1:
+            ax = axes
+        elif rows == 1:
+            ax = axes[col]
+        elif cols == 1:
+            ax = axes[row]
+        else:
+            ax = axes[row, col]
+        rainbowgram(path, ax, peak, use_cqt)
+        ax.set_axis_bgcolor('white')
+        ax.set_xticks([])
+        ax.set_yticks([])
+        if col == 0 and row_labels:
+            ax.set_ylabel(row_labels[row])
+        if row == rows - 1 and col_labels:
+            ax.set_xlabel(col_labels[col])
+    if saveto is not None:
+        fig.savefig(filename='{}.png'.format(saveto))
+def plot_rainbowgrams():
+    for root in ['target', 'corpus', 'results']:
+        files = glob.glob('{}/**/*.wav'.format(root), recursive=True)
+        for f in files:
+            fname = '{}.png'.format(f)
+            if not os.path.exists(fname):
+                rainbowgrams(
+                    [f],
+                    saveto=fname,
+                    figsize=(20, 5),
+                    rows=1,
+                    cols=1)
+                plt.close('all')

environment.yml ADDED Viewed

	@@ -0,0 +1,117 @@

+name: tdnast
+channels:
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=4.5=1_gnu
+  - _tflow_select=2.1.0=gpu
+  - absl-py=0.15.0=pyhd3eb1b0_0
+  - astor=0.8.1=py37h06a4308_0
+  - blas=1.0=mkl
+  - brotli=1.0.9=he6710b0_2
+  - c-ares=1.18.1=h7f8727e_0
+  - ca-certificates=2022.3.29=h06a4308_0
+  - cached-property=1.5.2=py_0
+  - certifi=2021.10.8=py37h06a4308_2
+  - cudatoolkit=10.0.130=0
+  - cudnn=7.6.5=cuda10.0_0
+  - cupti=10.0.130=0
+  - cycler=0.11.0=pyhd3eb1b0_0
+  - dbus=1.13.18=hb2f20db_0
+  - expat=2.4.4=h295c915_0
+  - fontconfig=2.13.1=h6c09931_0
+  - fonttools=4.25.0=pyhd3eb1b0_0
+  - freetype=2.11.0=h70c0345_0
+  - gast=0.2.2=py37_0
+  - giflib=5.2.1=h7b6447c_0
+  - glib=2.69.1=h4ff587b_1
+  - google-pasta=0.2.0=pyhd3eb1b0_0
+  - grpcio=1.42.0=py37hce63b2e_0
+  - gst-plugins-base=1.14.0=h8213a91_2
+  - gstreamer=1.14.0=h28cd5cc_2
+  - h5py=3.6.0=py37ha0f2276_0
+  - hdf5=1.10.6=hb1b8bf9_0
+  - icu=58.2=he6710b0_3
+  - importlib-metadata=4.11.3=py37h06a4308_0
+  - intel-openmp=2021.4.0=h06a4308_3561
+  - jpeg=9d=h7f8727e_0
+  - keras-applications=1.0.8=py_1
+  - keras-preprocessing=1.1.2=pyhd3eb1b0_0
+  - kiwisolver=1.3.2=py37h295c915_0
+  - lcms2=2.12=h3be6417_0
+  - ld_impl_linux-64=2.35.1=h7274673_9
+  - libffi=3.3=he6710b0_2
+  - libgcc-ng=9.3.0=h5101ec6_17
+  - libgfortran-ng=7.5.0=ha8ba4b0_17
+  - libgfortran4=7.5.0=ha8ba4b0_17
+  - libgomp=9.3.0=h5101ec6_17
+  - libpng=1.6.37=hbc83047_0
+  - libprotobuf=3.19.1=h4ff587b_0
+  - libstdcxx-ng=9.3.0=hd4cf53a_17
+  - libtiff=4.2.0=h85742a9_0
+  - libuuid=1.0.3=h7f8727e_2
+  - libwebp=1.2.2=h55f646e_0
+  - libwebp-base=1.2.2=h7f8727e_0
+  - libxcb=1.14=h7b6447c_0
+  - libxml2=2.9.12=h03d6c58_0
+  - lz4-c=1.9.3=h295c915_1
+  - markdown=3.3.4=py37h06a4308_0
+  - matplotlib=3.5.1=py37h06a4308_1
+  - matplotlib-base=3.5.1=py37ha18d171_1
+  - mkl=2021.4.0=h06a4308_640
+  - mkl-service=2.4.0=py37h7f8727e_0
+  - mkl_fft=1.3.1=py37hd3c417c_0
+  - mkl_random=1.2.2=py37h51133e4_0
+  - munkres=1.1.4=py_0
+  - ncurses=6.3=h7f8727e_2
+  - numpy=1.21.2=py37h20f2e39_0
+  - numpy-base=1.21.2=py37h79a1101_0
+  - openssl=1.1.1n=h7f8727e_0
+  - opt_einsum=3.3.0=pyhd3eb1b0_1
+  - packaging=21.3=pyhd3eb1b0_0
+  - pcre=8.45=h295c915_0
+  - pillow=9.0.1=py37h22f2fdc_0
+  - pip=21.2.2=py37h06a4308_0
+  - protobuf=3.19.1=py37h295c915_0
+  - pyparsing=3.0.4=pyhd3eb1b0_0
+  - pyqt=5.9.2=py37h05f1152_2
+  - python=3.7.13=h12debd9_0
+  - python-dateutil=2.8.2=pyhd3eb1b0_0
+  - qt=5.9.7=h5867ecd_1
+  - readline=8.1.2=h7f8727e_1
+  - scipy=1.7.3=py37hc147768_0
+  - setuptools=58.0.4=py37h06a4308_0
+  - sip=4.19.8=py37hf484d3e_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.38.2=hc218d9a_0
+  - tensorboard=1.15.0=pyhb230dea_0
+  - tensorflow=1.15.0=gpu_py37h0f0df58_0
+  - tensorflow-base=1.15.0=gpu_py37h9dcbed7_0
+  - tensorflow-estimator=1.15.1=pyh2649769_0
+  - tensorflow-gpu=1.15.0=h0d30ee6_0
+  - termcolor=1.1.0=py37h06a4308_1
+  - tk=8.6.11=h1ccaba5_0
+  - tornado=6.1=py37h27cfd23_0
+  - typing_extensions=4.1.1=pyh06a4308_0
+  - webencodings=0.5.1=py37_1
+  - werkzeug=0.16.1=py_0
+  - wheel=0.37.1=pyhd3eb1b0_0
+  - wrapt=1.13.3=py37h7f8727e_2
+  - xz=5.2.5=h7b6447c_0
+  - zipp=3.7.0=pyhd3eb1b0_0
+  - zlib=1.2.11=h7f8727e_4
+  - zstd=1.4.9=haebb681_0
+  - pip:
+    - audioread==2.1.9
+    - cffi==1.15.0
+    - decorator==5.1.1
+    - joblib==1.1.0
+    - librosa==0.7.2
+    - llvmlite==0.31.0
+    - numba==0.48.0
+    - pycparser==2.21
+    - resampy==0.2.2
+    - scikit-learn==1.0.2
+    - soundfile==0.10.3.post1
+    - threadpoolctl==3.1.0
+prefix: /home/pkmital/anaconda3/envs/tdnast

nips_2017.sty ADDED Viewed

	@@ -0,0 +1,339 @@

+% partial rewrite of the LaTeX2e package for submissions to the
+% Conference on Neural Information Processing Systems (NIPS):
+%
+% - uses more LaTeX conventions
+% - line numbers at submission time replaced with aligned numbers from
+%   lineno package
+% - \nipsfinalcopy replaced with [final] package option
+% - automatically loads times package for authors
+% - loads natbib automatically; this can be suppressed with the
+%   [nonatbib] package option
+% - adds foot line to first page identifying the conference
+%
+% Roman Garnett (garnett@wustl.edu) and the many authors of
+% nips15submit_e.sty, including MK and drstrip@sandia
+%
+% last revision: March 2017
+\NeedsTeXFormat{LaTeX2e}
+\ProvidesPackage{nips_2017}[2017/03/20 NIPS 2017 submission/camera-ready style file]
+% declare final option, which creates camera-ready copy
+\newif\if@nipsfinal\@nipsfinalfalse
+\DeclareOption{final}{
+  \@nipsfinaltrue
+}
+% declare nonatbib option, which does not load natbib in case of
+% package clash (users can pass options to natbib via
+% \PassOptionsToPackage)
+\newif\if@natbib\@natbibtrue
+\DeclareOption{nonatbib}{
+  \@natbibfalse
+}
+\ProcessOptions\relax
+% fonts
+\renewcommand{\rmdefault}{ptm}
+\renewcommand{\sfdefault}{phv}
+% change this every year for notice string at bottom
+\newcommand{\@nipsordinal}{31st}
+\newcommand{\@nipsyear}{2017}
+\newcommand{\@nipslocation}{Long Beach, CA, USA}
+% handle tweaks for camera-ready copy vs. submission copy
+\if@nipsfinal
+  \newcommand{\@noticestring}{%
+    \@nipsordinal\/ Conference on Neural Information Processing Systems
+    (NIPS \@nipsyear), \@nipslocation.%
+  }
+\else
+  \newcommand{\@noticestring}{%
+    Submitted to \@nipsordinal\/ Conference on Neural Information
+    Processing Systems (NIPS \@nipsyear). Do not distribute.%
+  }
+  % line numbers for submission
+  \RequirePackage{lineno}
+  \linenumbers
+  % fix incompatibilities between lineno and amsmath, if required, by
+  % transparently wrapping linenomath environments around amsmath
+  % environments
+  \AtBeginDocument{%
+    \@ifpackageloaded{amsmath}{%
+      \newcommand*\patchAmsMathEnvironmentForLineno[1]{%
+        \expandafter\let\csname old#1\expandafter\endcsname\csname #1\endcsname
+        \expandafter\let\csname oldend#1\expandafter\endcsname\csname end#1\endcsname
+        \renewenvironment{#1}%
+          {\linenomath\csname old#1\endcsname}%
+          {\csname oldend#1\endcsname\endlinenomath}%
+      }%
+      \newcommand*\patchBothAmsMathEnvironmentsForLineno[1]{%
+        \patchAmsMathEnvironmentForLineno{#1}%
+        \patchAmsMathEnvironmentForLineno{#1*}%
+      }%
+      \patchBothAmsMathEnvironmentsForLineno{equation}%
+      \patchBothAmsMathEnvironmentsForLineno{align}%
+      \patchBothAmsMathEnvironmentsForLineno{flalign}%
+      \patchBothAmsMathEnvironmentsForLineno{alignat}%
+      \patchBothAmsMathEnvironmentsForLineno{gather}%
+      \patchBothAmsMathEnvironmentsForLineno{multline}%
+    }{}
+  }
+\fi
+% load natbib unless told otherwise
+\if@natbib
+  \RequirePackage{natbib}
+\fi
+% set page geometry
+\usepackage[verbose=true,letterpaper]{geometry}
+\AtBeginDocument{
+  \newgeometry{
+    textheight=9in,
+    textwidth=5.5in,
+    top=1in,
+    headheight=12pt,
+    headsep=25pt,
+    footskip=30pt
+  }
+  \@ifpackageloaded{fullpage}
+    {\PackageWarning{nips_2016}{fullpage package not allowed! Overwriting formatting.}}
+    {}
+}
+\widowpenalty=10000
+\clubpenalty=10000
+\flushbottom
+\sloppy
+% font sizes with reduced leading
+\renewcommand{\normalsize}{%
+  \@setfontsize\normalsize\@xpt\@xipt
+  \abovedisplayskip      7\p@ \@plus 2\p@ \@minus 5\p@
+  \abovedisplayshortskip \z@ \@plus 3\p@
+  \belowdisplayskip      \abovedisplayskip
+  \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
+}
+\normalsize
+\renewcommand{\small}{%
+  \@setfontsize\small\@ixpt\@xpt
+  \abovedisplayskip      6\p@ \@plus 1.5\p@ \@minus 4\p@
+  \abovedisplayshortskip \z@  \@plus 2\p@
+  \belowdisplayskip      \abovedisplayskip
+  \belowdisplayshortskip 3\p@ \@plus 2\p@   \@minus 2\p@
+}
+\renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
+\renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
+\renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
+\renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
+\renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
+\renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
+\renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
+\renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
+% sections with less space
+\providecommand{\section}{}
+\renewcommand{\section}{%
+  \@startsection{section}{1}{\z@}%
+                {-2.0ex \@plus -0.5ex \@minus -0.2ex}%
+                { 1.5ex \@plus  0.3ex \@minus  0.2ex}%
+                {\large\bf\raggedright}%
+}
+\providecommand{\subsection}{}
+\renewcommand{\subsection}{%
+  \@startsection{subsection}{2}{\z@}%
+                {-1.8ex \@plus -0.5ex \@minus -0.2ex}%
+                { 0.8ex \@plus  0.2ex}%
+                {\normalsize\bf\raggedright}%
+}
+\providecommand{\subsubsection}{}
+\renewcommand{\subsubsection}{%
+  \@startsection{subsubsection}{3}{\z@}%
+                {-1.5ex \@plus -0.5ex \@minus -0.2ex}%
+                { 0.5ex \@plus  0.2ex}%
+                {\normalsize\bf\raggedright}%
+}
+\providecommand{\paragraph}{}
+\renewcommand{\paragraph}{%
+  \@startsection{paragraph}{4}{\z@}%
+                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
+                {-1em}%
+                {\normalsize\bf}%
+}
+\providecommand{\subparagraph}{}
+\renewcommand{\subparagraph}{%
+  \@startsection{subparagraph}{5}{\z@}%
+                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
+                {-1em}%
+                {\normalsize\bf}%
+}
+\providecommand{\subsubsubsection}{}
+\renewcommand{\subsubsubsection}{%
+  \vskip5pt{\noindent\normalsize\rm\raggedright}%
+}
+% float placement
+\renewcommand{\topfraction      }{0.85}
+\renewcommand{\bottomfraction   }{0.4}
+\renewcommand{\textfraction     }{0.1}
+\renewcommand{\floatpagefraction}{0.7}
+\newlength{\@nipsabovecaptionskip}\setlength{\@nipsabovecaptionskip}{7\p@}
+\newlength{\@nipsbelowcaptionskip}\setlength{\@nipsbelowcaptionskip}{\z@}
+\setlength{\abovecaptionskip}{\@nipsabovecaptionskip}
+\setlength{\belowcaptionskip}{\@nipsbelowcaptionskip}
+% swap above/belowcaptionskip lengths for tables
+\renewenvironment{table}
+  {\setlength{\abovecaptionskip}{\@nipsbelowcaptionskip}%
+   \setlength{\belowcaptionskip}{\@nipsabovecaptionskip}%
+   \@float{table}}
+  {\end@float}
+% footnote formatting
+\setlength{\footnotesep }{6.65\p@}
+\setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@}
+\renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
+\setcounter{footnote}{0}
+% paragraph formatting
+\setlength{\parindent}{\z@}
+\setlength{\parskip  }{5.5\p@}
+% list formatting
+\setlength{\topsep       }{4\p@ \@plus 1\p@   \@minus 2\p@}
+\setlength{\partopsep    }{1\p@ \@plus 0.5\p@ \@minus 0.5\p@}
+\setlength{\itemsep      }{2\p@ \@plus 1\p@   \@minus 0.5\p@}
+\setlength{\parsep       }{2\p@ \@plus 1\p@   \@minus 0.5\p@}
+\setlength{\leftmargin   }{3pc}
+\setlength{\leftmargini  }{\leftmargin}
+\setlength{\leftmarginii }{2em}
+\setlength{\leftmarginiii}{1.5em}
+\setlength{\leftmarginiv }{1.0em}
+\setlength{\leftmarginv  }{0.5em}
+\def\@listi  {\leftmargin\leftmargini}
+\def\@listii {\leftmargin\leftmarginii
+              \labelwidth\leftmarginii
+              \advance\labelwidth-\labelsep
+              \topsep  2\p@ \@plus 1\p@    \@minus 0.5\p@
+              \parsep  1\p@ \@plus 0.5\p@ \@minus 0.5\p@
+              \itemsep \parsep}
+\def\@listiii{\leftmargin\leftmarginiii
+              \labelwidth\leftmarginiii
+              \advance\labelwidth-\labelsep
+              \topsep    1\p@ \@plus 0.5\p@ \@minus 0.5\p@
+              \parsep    \z@
+              \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
+              \itemsep \topsep}
+\def\@listiv {\leftmargin\leftmarginiv
+              \labelwidth\leftmarginiv
+              \advance\labelwidth-\labelsep}
+\def\@listv  {\leftmargin\leftmarginv
+              \labelwidth\leftmarginv
+              \advance\labelwidth-\labelsep}
+\def\@listvi {\leftmargin\leftmarginvi
+              \labelwidth\leftmarginvi
+              \advance\labelwidth-\labelsep}
+% create title
+\providecommand{\maketitle}{}
+\renewcommand{\maketitle}{%
+  \par
+  \begingroup
+    \renewcommand{\thefootnote}{\fnsymbol{footnote}}
+    % for perfect author name centering
+    \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
+    % The footnote-mark was overlapping the footnote-text,
+    % added the following to fix this problem               (MK)
+    \long\def\@makefntext##1{%
+      \parindent 1em\noindent
+      \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
+    }
+    \thispagestyle{empty}
+    \@maketitle
+    \@thanks
+    \@notice
+  \endgroup
+  \let\maketitle\relax
+  \let\thanks\relax
+}
+% rules for title box at top of first page
+\newcommand{\@toptitlebar}{
+  \hrule height 4\p@
+  \vskip 0.25in
+  \vskip -\parskip%
+}
+\newcommand{\@bottomtitlebar}{
+  \vskip 0.29in
+  \vskip -\parskip
+  \hrule height 1\p@
+  \vskip 0.09in%
+}
+% create title (includes both anonymized and non-anonymized versions)
+\providecommand{\@maketitle}{}
+\renewcommand{\@maketitle}{%
+  \vbox{%
+    \hsize\textwidth
+    \linewidth\hsize
+    \vskip 0.1in
+    \@toptitlebar
+    \centering
+    {\LARGE\bf \@title\par}
+    \@bottomtitlebar
+    \if@nipsfinal
+      \def\And{%
+        \end{tabular}\hfil\linebreak[0]\hfil%
+        \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
+      }
+      \def\AND{%
+        \end{tabular}\hfil\linebreak[4]\hfil%
+        \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
+      }
+      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
+    \else
+      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}
+        Anonymous Author(s) \\
+        Affiliation \\
+        Address \\
+        \texttt{email} \\
+      \end{tabular}%
+    \fi
+    \vskip 0.3in \@minus 0.1in
+  }
+}
+% add conference notice to bottom of first page
+\newcommand{\ftype@noticebox}{8}
+\newcommand{\@notice}{%
+  % give a bit of extra room back to authors on first page
+  \enlargethispage{2\baselineskip}%
+  \@float{noticebox}[b]%
+    \footnotesize\@noticestring%
+  \end@float%
+}
+% abstract styling
+\renewenvironment{abstract}%
+{%
+  \vskip 0.075in%
+  \centerline%
+  {\large\bf Abstract}%
+  \vspace{0.5ex}%
+  \begin{quote}%
+}
+{
+  \par%
+  \end{quote}%
+  \vskip 1ex%
+}
+\endinput

paper.pdf ADDED Viewed

Binary file (322 kB). View file

paper.tex ADDED Viewed

	@@ -0,0 +1,224 @@

+\documentclass{article}
+% if you need to pass options to natbib, use, e.g.:
+% \PassOptionsToPackage{numbers, compress}{natbib}
+% before loading nips_2017
+%
+% to avoid loading the natbib package, add option nonatbib:
+% \usepackage[nonatbib]{nips_2017}
+%\usepackage{nips_2017}
+% to compile a camera-ready version, add the [final] option, e.g.:
+\usepackage[final,nonatbib]{nips_2017}
+\usepackage[utf8]{inputenc} % allow utf-8 input
+\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
+\usepackage{hyperref}       % hyperlinks
+\usepackage{url}            % simple URL typesetting
+\usepackage{booktabs}       % professional-quality tables
+\usepackage{amsfonts}       % blackboard math symbols
+\usepackage{nicefrac}       % compact symbols for 1/2, etc.
+\usepackage{microtype}      % microtypography
+\usepackage{graphicx}
+\usepackage{caption}
+\usepackage{subcaption}
+\title{Time Domain Neural Audio Style Transfer}
+% The \author macro works with any number of authors. There are two
+% commands used to separate the names and addresses of multiple
+% authors: \And and \AND.
+%
+% Using \And between authors leaves it to LaTeX to determine where to
+% break the lines. Using \AND forces a line break at that point. So,
+% if LaTeX puts 3 of 4 authors names on the first line, and the last
+% on the second line, try using \AND instead of \And before the third
+% author name.
+\author{
+  Parag K. Mital\\
+  Kadenze, Inc.\thanks{http://kadenze.com}\\
+  \texttt{parag@kadenze.com} \\
+  %% examples of more authors
+  %% \And
+  %% Coauthor \\
+  %% Affiliation \\
+  %% Address \\
+  %% \texttt{email} \\
+  %% \AND
+  %% Coauthor \\
+  %% Affiliation \\
+  %% Address \\
+  %% \texttt{email} \\
+  %% \And
+  %% Coauthor \\
+  %% Affiliation \\
+  %% Address \\
+  %% \texttt{email} \\
+  %% \And
+  %% Coauthor \\
+  %% Affiliation \\
+  %% Address \\
+  %% \texttt{email} \\
+}
+\begin{document}
+% \nipsfinalcopy is no longer used
+\maketitle
+\begin{abstract}
+  A recently published method for audio style transfer has shown how to extend the process of image style transfer to audio.  This method synthesizes audio "content" and "style" independently using the magnitudes of a short time Fourier transform, shallow convolutional networks with randomly initialized filters, and iterative phase reconstruction with Griffin-Lim.  In this work, we explore whether it is possible to directly optimize a time domain audio signal, removing the process of phase reconstruction and opening up possibilities for real-time applications and higher quality syntheses.  We explore a variety of style transfer processes on neural networks that operate directly on time domain audio signals and demonstrate one such network capable of audio stylization.
+\end{abstract}
+\section{Introduction}
+% Style transfer \cite{} is a method for optimizing a randomly initialized image to have the appearance of the content and style of two separate images.  It works by finding the raw activations of a so-called "content" image and optimizing a noise image to resemble the same activations while for "style", it looks at the kernel activations of any given layer and optimizes for these.  The original work by Gatys et al. demonstrated this technique using activations from pre-trained VGG deep convolutional networks, though recent techniques in texture synthesis \cite{} show that similar results are possible with randomly initialized shallow convolutional networks.
+Audio style transfer \cite{Ulyanov2016} attempts to extend the technique of image style transfer \cite{Gatys} to the domain of audio, allowing "content" and "style" to be independently manipulated.  Ulyanov et al. demonstrates the process using the magnitudes of a short time Fourier transform representation of an audio signal as the input to a shallow untrained neural network, following similar work in image style transfer \cite{Ulyanov2016b}, storing the activations of the content and Gram activations of the style.  A noisy input short time magnitude spectra is then optimized such that its activations through the same network resemble the target content and style magnitude's activations.  The optimized magnitudes are then inverted back to an audio signal using an iterative Griffin-lim phase reconstruction process \cite{Griffin1984}.
+Using phase reconstruction ultimately means the stylization process is not modeling the audio signal's fine temporal characteristics contained in its phase information.  For instance, if a particular content or style audio source were to contain information about vibrato or the spatial movement or position of the audio source, this would likely be lost in a magnitude-only representation.  Further, by relying on phase reconstruction, some error during the phase reconstruction is likely to happen, and developing real-time applications are also more difficult \cite{Wyse2017}, though not impossible \cite{Prusa2017}.  In any case, any networks which discard phase information, such as \cite{Wyse2017}, which build on Ulyanov's approach, or recent audio networks such as \cite{Hershey2016} will still require phase reconstruction for stylization/synthesis applications.
+Rather than approach stylization/synthesis via phase reconstruction, this work attempts to directly optimize a raw audio signal.  Recent work in Neural Audio Synthesis has shown it is possible to take as input a raw audio signal and apply blending of musical notes in a neural embedding space on a trained WaveNet autoencoder \cite{Engel2017}.  Though their work is capable of synthesizing raw audio from its embedding space, there is no separation of content and style using this approach, and thus they cannot be independently manipulated.  However, to date, it is not clear whether this network's encoder or decoder could be used for audio stylization using the approach of Ulyanov/Gatys.
+To understand better whether it is possible to perform audio stylization in the time domain, we investigate a variety of networks which take a time domain audio signal as input to their network: using the real and imaginary components of a Discrete Fourier Transform (DFT); using the magnitude and unwrapped phase differential components of a DFT; using combinations of real, imaginary, magnitude, and phase components; using the activations of a pre-trained WaveNet decoder \cite{Oord2016b,Engel2017}; and using the activations of a pre-trained NSynth encoder \cite{Engel2017}.  We then apply audio stylization similarly to Ulyanov using a variety of parameters and report our results.
+% \section{Related Work}
+% There have been a few investigations of audio style transfer employing magnitude representations, such as Ulyanov's original work and a follow-up work employing VGG \cite{Wyse2017}.  These models discard the phase information in favor of phase reconstruction.  As well, there have been further developments in neural networks capable of large scale audio classification such as \cite{Hershey2016}, though these are trained on magnitude representations and would also require phase reconstruction as part of a stylization process.  Perhaps most closely aligned is the work of NSynth \cite{Engel2017}, whose work is capable of taking as input a raw audio signal and allows for applications such as the blending of musical notes in a neural embedding space.  Though their work is capable of synthesizing raw audio from its embedding space, there is no separation of content and style, and thus they cannot be independently manipulated.
+% Speech synthesis techniques
+%TacoTron demonstrated a technique using ...
+%In a similar vein, WaveNet, ...
+%NSynth incorporates a WaveNet decoder and includes an additional encoder, allowing one to encode a time domain audio signal using the encoding part of the network with 16 channels at 125x compression, and use these as biases during the WaveNet decoding.  The embedding space is capable of linearly mixing instruments in its embedding space, though has yet to be explored as a network for audio stylization where content and style are indepenently manipulated.
+% SampleRNN
+% Soundnet
+% VGG (Lonce Wyse, https://arxiv.org/pdf/1706.09559.pdf);
+% Zdenek Pruska
+% Other networks exploring audio include VGGish, built on the AudioSet dataset.  This network, like Ulyanov's original implementation, however does not operate on the raw time domain signal and would require phase reconstruction.  However, it does afford a potentially richer representation than a shallow convolutional network, as its embedding space was trained with the knowledge of many semantic classes of sounds.
+% CycleGAN (https://gauthamzz.github.io/2017/09/23/AudioStyleTransfer/)
+\section{Experiments}
+We explore a variety of computational graphs which use as their first operation a discrete Fourier transform in order to project an audio signal into its real and imaginary components.  We then explore manipulations on these components, including directly applying convolutional layers, or undergoing an additional transformation of the typical magnitude and phase components, as well as combinations of each these components.  For representing phase, we also explored using the original phase, the phase differential, and the unwrapped phase differentials.  From here, we apply the same techniques for stylization as described in \cite{Ulyanov2016}, except we no longer have to optimize a noisy magnitude input, and can instead optimize a time domain signal.  We also explore combinations of using content/style layers following the initial projections and after fully connected layers.
+We also explore two pre-trained networks: a pre-trained WaveNet decoder, and the encoder portion of an NSynth network as provided by Magenta \cite{Engel2017}, and look at the activations of each of these networks at different layers, much like the original image style networks did with VGG.  We also include Ulyanov's original network as a baseline, and report our results as seen through spectrograms and through listening.  Our code is also available online\footnote{https://github.com/pkmital/neural-audio-style-transfer}\footnote{Further details are described in the Supplementary Materials}.
+\section{Results}
+Only one network was capable of producing meaningful audio reconstruction through a stylization process where both the style and content appeared to be retained: including the real, imaginary, and magnitude information as concatenated features in height and using a kernel size 3 height convolutional filter.  This process also includes a content layer which includes the concatenated features before any linear layer, and a style layer which is simply the magnitudes, and then uses a content and style layer following each nonlinearity.  This network produces distinctly different stylizations to Ulyanov's original network, despite having similar parameters, often including quicker and busier temporal changes in content and style.  The stylization also tends to produce what seems like higher fidelity syntheses, especially in lower frequencies, despite having the same sample rate.  Lastly, this approach also tends to produce much less noise than Ulyanov's approach, most likely due to errors in the phase reconstruction/lack of phase representation.
+Every other combination of input manipulations we tried tended towards a white noise signal and did not appear to drop in loss.  The only other network that appeared to produce something recognizable, though with considerable noise was using the magnitude and unwrapped phase differential information with a kernel size 2 height convolutional filter.  We could not manage to stylize any meaningful sounding synthesis using the activations in a WaveNet decoder or NSynth encoder.
+% VGGish, AudioSet; VGG equivalent for audio, but uses a log-mel spectrogram.
+\section{Discussion and Conclusion}
+This work explores neural audio style transfer of a time domain audio signal.  Of these networks, only two produced any meaningful results: the magnitude and unwrapped phase network, which produced distinctly noisier syntheses, and the real, imaginary, and magnitude network which was capable of resembling both the content and style sources in a similar quality to Ulyanov's original approach, though with interesting differences.  It was especially surprising that we were unable to stylize with NSynth's encoder or decoder, though this is perhaps to due to the limited number of combinations of layers and possible activations we explored, and is worth exploring more in the future.
+% Style transfer, like deep dream and its predecessor works in visualizing gradient activations, through exploration have the potential to enable us to understand representations created by neural networks.  Through synthesis, and exploring the representations at each level of a neural network, we can start to gain insights into what sorts of representations if any are created by a network.  However, to date, very few explorations of audio networks for the purpose of dreaming or stylization have been done.
+%End to end learning, http://www.mirlab.org/conference_papers/International_Conference/ICASSP\%202014/papers/p7014-dieleman.pdf - spectrums still do better than raw audio.
+\small
+% Generated by IEEEtran.bst, version: 1.14 (2015/08/26)
+\begin{thebibliography}{1}
+\providecommand{\url}[1]{#1}
+\csname url@samestyle\endcsname
+\providecommand{\newblock}{\relax}
+\providecommand{\bibinfo}[2]{#2}
+\providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}
+\providecommand{\BIBentryALTinterwordstretchfactor}{4}
+\providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus
+\BIBentryALTinterwordstretchfactor\fontdimen3\font minus
+  \fontdimen4\font\relax}
+\providecommand{\BIBforeignlanguage}[2]{{%
+\expandafter\ifx\csname l@#1\endcsname\relax
+\typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%
+\typeout{** loaded for the language `#1'. Using the pattern for}%
+\typeout{** the default language instead.}%
+\else
+\language=\csname l@#1\endcsname
+\fi
+#2}}
+\providecommand{\BIBdecl}{\relax}
+\BIBdecl
+\bibitem{Ulyanov2016}
+D.~Ulyanov and V.~Lebedev, ``{Audio texture synthesis and style transfer},''
+  2016.
+\bibitem{Gatys}
+L.~A. Gatys, A.~S. Ecker, M.~Bethge, and C.~V. Sep, ``{A Neural Algorithm of
+  Artistic Style},'' \emph{Arxiv}, p. 211839, 2015.
+\bibitem{Ulyanov2016b}
+\BIBentryALTinterwordspacing
+D.~Ulyanov, V.~Lebedev, A.~Vedaldi, and V.~Lempitsky, ``{Texture Networks:
+  Feed-forward Synthesis of Textures and Stylized Images},'' 2016. [Online].
+  Available: \url{http://arxiv.org/abs/1603.03417}
+\BIBentrySTDinterwordspacing
+\bibitem{Griffin1984}
+D.~W. Griffin and J.~S. Lim, ``{Signal Estimation from Modified Short-Time
+  Fourier Transform},'' \emph{IEEE Transactions on Acoustics, Speech, and
+  Signal Processing}, vol.~32, no.~2, pp. 236--243, 1984.
+\bibitem{Wyse2017}
+\BIBentryALTinterwordspacing
+L.~Wyse, ``{Audio Spectrogram Representations for Processing with Convolutional
+  Neural Networks},'' in \emph{Proceedings of the First International Workshop
+  on Deep Learning and Music joint with IJCNN}, vol.~1, no.~1, 2017, pp.
+  37--41. [Online]. Available: \url{http://arxiv.org/abs/1706.09559}
+\BIBentrySTDinterwordspacing
+\bibitem{Prusa2017}
+Z.~Prů{\v{s}}a and P.~Rajmic, ``{Toward High-Quality Real-Time Signal
+  Reconstruction from STFT Magnitude},'' \emph{IEEE Signal Processing Letters},
+  vol.~24, no.~6, pp. 892--896, 2017.
+\bibitem{Hershey2016}
+\BIBentryALTinterwordspacing
+S.~Hershey, S.~Chaudhuri, D.~P.~W. Ellis, J.~F. Gemmeke, A.~Jansen, C.~Moore,
+  M.~Plakal, D.~Platt, R.~A. Saurous, B.~Seybold, M.~Slaney, R.~J. Weiss,
+  K.~Wilson, R.~C. Moore, M.~Plakal, D.~Platt, R.~A. Saurous, B.~Seybold,
+  M.~Slaney, R.~J. Weiss, and K.~Wilson, ``{CNN Architectures for Large-Scale
+  Audio Classification},'' \emph{International Conference on Acoustics, Speech
+  and Signal Processing (ICASSP)}, pp. 4--8, 2016. [Online]. Available:
+  \url{http://arxiv.org/abs/1609.09430}
+\BIBentrySTDinterwordspacing
+\bibitem{Engel2017}
+\BIBentryALTinterwordspacing
+J.~Engel, C.~Resnick, A.~Roberts, S.~Dieleman, D.~Eck, K.~Simonyan, and
+  M.~Norouzi, ``{Neural Audio Synthesis of Musical Notes with WaveNet
+  Autoencoders},'' in \emph{Proceedings of the 34th International Conference on
+  Machine Learning}, 2017. [Online]. Available:
+  \url{http://arxiv.org/abs/1704.01279}
+\BIBentrySTDinterwordspacing
+\bibitem{Oord2016b}
+\BIBentryALTinterwordspacing
+A.~van~den Oord, S.~Dieleman, H.~Zen, K.~Simonyan, O.~Vinyals, A.~Graves,
+  N.~Kalchbrenner, A.~Senior, and K.~Kavukcuoglu, ``{WaveNet: A Generative
+  Model for Raw Audio},'' \emph{arxiv}, pp. 1--15, 2016. [Online]. Available:
+  \url{http://arxiv.org/abs/1609.03499}
+\BIBentrySTDinterwordspacing
+\end{thebibliography}
+\begin{figure}
+\centering
+\includegraphics[width=1\linewidth]{synthesis}
+\caption{Example synthesis optimizing audio directly with both the source content and style audible.}
+\end{figure}
+\end{document}

search.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
+Parag K. Mital
+"""
+import os
+import glob
+import numpy as np
+from audio_style_transfer.models import timedomain, uylanov
+def get_path(model, output_path, content_filename, style_filename):
+    output_dir = os.path.join(output_path, model)
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    output_filename = '{}/{}/{}+{}'.format(output_path, model,
+                                           content_filename.split('/')[-1],
+                                           style_filename.split('/')[-1])
+    return output_filename
+def params():
+    n_fft = [2048, 4096, 8196]
+    n_layers = [1, 2, 4]
+    n_filters = [128, 2048, 4096]
+    hop_length = [128, 256, 512]
+    alpha = [0.1, 0.01, 0.005]
+    k_w = [4, 8, 12]
+    norm = [True, False]
+    input_features = [['mags'], ['mags', 'phase'], ['real', 'imag'], ['real', 'imag', 'mags']]
+    return locals()
+def batch(content_path, style_path, output_path, run_timedomain=True, run_uylanov=False):
+    content_files = glob.glob('{}/*.wav'.format(content_path))
+    style_files = glob.glob('{}/*.wav'.format(style_path))
+    content_filename = np.random.choice(content_files)
+    style_filename = np.random.choice(style_files)
+    alpha = np.random.choice(params()['alpha'])
+    n_fft = np.random.choice(params()['n_fft'])
+    n_layers = np.random.choice(params()['n_layers'])
+    n_filters = np.random.choice(params()['n_filters'])
+    hop_length = np.random.choice(params()['hop_length'])
+    norm = np.random.choice(params()['norm'])
+    k_w = np.random.choice(params()['k_w'])
+    # Run the Time Domain Model
+    if run_timedomain:
+        for f in params()['input_features']:
+            fname = get_path('timedomain/input_features={}'.format(",".join(f)),
+                             output_path, content_filename, style_filename)
+            output_filename = ('{},n_fft={},n_layers={},n_filters={},norm={},'
+                               'hop_length={},alpha={},k_w={}.wav'.format(
+                                   fname, n_fft, n_layers, n_filters, norm,
+                                   hop_length, alpha, k_w))
+            print(output_filename)
+            if not os.path.exists(output_filename):
+                timedomain.run(content_fname=content_filename,
+                               style_fname=style_filename,
+                               output_fname=output_filename,
+                               n_fft=n_fft,
+                               n_layers=n_layers,
+                               n_filters=n_filters,
+                               hop_length=hop_length,
+                               alpha=alpha,
+                               norm=norm,
+                               k_w=k_w)
+    if run_uylanov:
+        # Run Original Uylanov Model
+        fname = get_path('uylanov', output_path, content_filename, style_filename)
+        output_filename = ('{},n_fft={},n_layers={},n_filters={},'
+                           'hop_length={},alpha={},k_w={}.wav'.format(
+                               fname, n_fft, n_layers, n_filters, hop_length, alpha,
+                               k_w))
+        print(output_filename)
+        if not os.path.exists(output_filename):
+            uylanov.run(content_filename,
+                        style_filename,
+                        output_filename,
+                        n_fft=n_fft,
+                        n_layers=n_layers,
+                        n_filters=n_filters,
+                        hop_length=hop_length,
+                        alpha=alpha,
+                        k_w=k_w)
+    # These only produce noise so they are commented
+    # # Run NSynth Encoder Model
+    # output_filename = get_path('nsynth-encoder', output_path, content_filename,
+    #                            style_filename)
+    # output_filename = ('{},n_fft={},n_layers={},n_filters={},'
+    #                    'hop_length={},alpha={},k_w={}.wav'.format(
+    #                        fname, n_fft, n_layers, n_filters, hop_length, alpha, k_w))
+    # if not os.path.exists(output_filename):
+    #     nsynth.run(content_filename,
+    #                style_filename,
+    #                output_filename,
+    #                model='encoder',
+    #                n_fft=n_fft,
+    #                n_layers=n_layers,
+    #                n_filters=n_filters,
+    #                hop_length=hop_length,
+    #                alpha=alpha,
+    #                k_w=k_w)
+    # # Run NSynth Decoder Model
+    # output_filename = get_path('wavenet-decoder', output_path, content_filename,
+    #                            style_filename)
+    # output_filename = ('{},n_fft={},n_layers={},n_filters={},'
+    #                    'hop_length={},alpha={},k_w={}.wav'.format(
+    #                        fname, n_fft, n_layers, n_filters, hop_length, alpha, k_w))
+    # if not os.path.exists(output_filename):
+    #     nsynth.run(content_filename,
+    #                style_filename,
+    #                output_filename,
+    #                model='decoder',
+    #                n_fft=n_fft,
+    #                n_layers=n_layers,
+    #                n_filters=n_filters,
+    #                hop_length=hop_length,
+    #                alpha=alpha,
+    #                k_w=k_w)
+if __name__ == '__main__':
+    content_path = './target'
+    style_path = './corpus'
+    output_path = './results'
+    batch(content_path, style_path, output_path)

setup.py ADDED Viewed

	@@ -0,0 +1,116 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Note: To use the 'upload' functionality of this file, you must:
+#   $ pip install twine
+import io
+import os
+import sys
+from shutil import rmtree
+from setuptools import find_packages, setup, Command
+# Package meta-data.
+NAME = 'audio_style_transfer'
+DESCRIPTION = 'Exploring Audio Style Transfer'
+URL = 'https://github.com/pkmital/time-domain-neural-audio-style-transfer'
+EMAIL = 'parag@pkmital.com'
+AUTHOR = 'Parag Mital'
+# What packages are required for this module to be executed?
+REQUIRED = [
+    # 'tensorflow-gpu<2.0.0', 'librosa<0.8.0',
+    # 'magenta'
+]
+# The rest you shouldn't have to touch too much :)
+# ------------------------------------------------
+# Except, perhaps the License and Trove Classifiers!
+# If you do change the License, remember to change the Trove Classifier for that!
+here = os.path.abspath(os.path.dirname(__file__))
+# Import the README and use it as the long-description.
+# Note: this will only work if 'README.rst' is present in your MANIFEST.in file!
+with io.open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
+    long_description = '\n' + f.read()
+# Load the package's __version__.py module as a dictionary.
+about = {}
+with open(os.path.join(here, NAME, '__version__.py')) as f:
+    exec(f.read(), about)
+class UploadCommand(Command):
+    """Support setup.py upload."""
+    description = 'Build and publish the package.'
+    user_options = []
+    @staticmethod
+    def status(s):
+        """Prints things in bold."""
+        print('\033[1m{0}\033[0m'.format(s))
+    def initialize_options(self):
+        pass
+    def finalize_options(self):
+        pass
+    def run(self):
+        try:
+            self.status('Removing previous builds…')
+            rmtree(os.path.join(here, 'dist'))
+        except OSError:
+            pass
+        self.status('Building Source and Wheel (universal) distribution…')
+        os.system('{0} setup.py sdist bdist_wheel --universal'.format(sys.executable))
+        self.status('Uploading the package to PyPi via Twine…')
+        os.system('twine upload dist/*')
+        sys.exit()
+# Where the magic happens:
+setup(
+    name=NAME,
+    version=about['__version__'],
+    description=DESCRIPTION,
+    long_description=long_description,
+    author=AUTHOR,
+    author_email=EMAIL,
+    url=URL,
+    packages=find_packages(exclude=('tests',)),
+    # If your package is a single module, use this instead of 'packages':
+    # py_modules=['mypackage'],
+    # entry_points={
+    #     'console_scripts': ['mycli=mymodule:cli'],
+    # },
+    install_requires=REQUIRED,
+    include_package_data=True,
+    license='MIT',
+    classifiers=[
+        # Trove classifiers
+        # Full list: https://pypi.python.org/pypi?%3Aaction=list_classifiers
+        'License :: OSI Approved :: MIT License',
+        'Programming Language :: Python',
+        'Programming Language :: Python :: 2.6',
+        'Programming Language :: Python :: 2.7',
+        'Programming Language :: Python :: 3',
+        'Programming Language :: Python :: 3.3',
+        'Programming Language :: Python :: 3.4',
+        'Programming Language :: Python :: 3.5',
+        'Programming Language :: Python :: 3.6',
+        'Programming Language :: Python :: Implementation :: CPython',
+        'Programming Language :: Python :: Implementation :: PyPy'
+    ],
+    # $ setup.py publish support.
+    cmdclass={
+        'upload': UploadCommand,
+    },
+)

style-transfer.bib ADDED Viewed

	@@ -0,0 +1,128 @@

+sc{Ulyanov2016,
+author = {Ulyanov, Dmitry and Lebedev, Vadim},
+title = {{Audio texture synthesis and style transfer}},
+urldate = {November 3, 2017},
+year = {2016}
+}
+@inproceedings{Wyse2017,
+abstract = {One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than it seems to be for visual images, and a variety of representations have been used for different applications including the raw digitized sample stream, hand-crafted features, machine discovered features, MFCCs and variants that include deltas, and a variety of spectral representations. This paper reviews some of these representations and issues that arise, focusing particularly on spectrograms for generating audio using neural networks for style transfer.},
+archivePrefix = {arXiv},
+arxivId = {1706.09559},
+author = {Wyse, L.},
+booktitle = {Proceedings of the First International Workshop on Deep Learning and Music joint with IJCNN},
+eprint = {1706.09559},
+file = {:Users/pkmital/Documents/PDFs/Wyse/Wyse - 2017 - Audio Spectrogram Representations for Processing with Convolutional Neural Networks.pdf:pdf},
+keywords = {data representation,sound synthesis,spectrograms,style transfer},
+number = {1},
+pages = {37--41},
+title = {{Audio Spectrogram Representations for Processing with Convolutional Neural Networks}},
+url = {http://arxiv.org/abs/1706.09559},
+volume = {1},
+year = {2017}
+}
+@article{Ustyuzhaninov2016,
+abstract = {Here we demonstrate that the feature space of random shallow convolutional neural networks (CNNs) can serve as a surprisingly good model of natural textures. Patches from the same texture are consistently classified as being more similar then patches from different textures. Samples synthesized from the model capture spatial correlations on scales much larger then the receptive field size, and sometimes even rival or surpass the perceptual quality of state of the art texture models (but show less variability). The current state of the art in parametric texture synthesis relies on the multi-layer feature space of deep CNNs that were trained on natural images. Our finding suggests that such optimized multi-layer feature spaces are not imperative for texture modeling. Instead, much simpler shallow and convolutional networks can serve as the basis for novel texture synthesis algorithms.},
+archivePrefix = {arXiv},
+arxivId = {1606.00021},
+author = {Ustyuzhaninov, Ivan and Brendel, Wieland and Gatys, Leon A. and Bethge, Matthias},
+eprint = {1606.00021},
+file = {:Users/pkmital/Documents/PDFs/Ustyuzhaninov et al/Ustyuzhaninov et al. - 2016 - Texture Synthesis Using Shallow Convolutional Networks with Random Filters.pdf:pdf},
+journal = {Arxiv},
+pages = {1--9},
+title = {{Texture Synthesis Using Shallow Convolutional Networks with Random Filters}},
+url = {http://arxiv.org/abs/1606.00021},
+year = {2016}
+}
+@article{Gatys,
+archivePrefix = {arXiv},
+arxivId = {arXiv:1508.06576v2},
+author = {Gatys, Leon A and Ecker, Alexander S and Bethge, Matthias and Sep, C V},
+eprint = {arXiv:1508.06576v2},
+file = {:Users/pkmital/Documents/PDFs/Gatys et al/Gatys et al. - 2015 - A Neural Algorithm of Artistic Style.pdf:pdf},
+journal = {Arxiv},
+pages = {211839},
+title = {{A Neural Algorithm of Artistic Style}},
+year = {2015}
+}
+@article{Prusa2017,
+author = {Prů{\v{s}}a, Zden{\v{e}}k and Rajmic, Pavel},
+doi = {10.1109/LSP.2017.2696970},
+file = {:Users/pkmital/Documents/PDFs/Prů{\v{s}}a, Rajmic/Prů{\v{s}}a, Rajmic - 2017 - Toward High-Quality Real-Time Signal Reconstruction from STFT Magnitude.pdf:pdf},
+issn = {10709908},
+journal = {IEEE Signal Processing Letters},
+keywords = {Phase reconstruction,real-time,short-time Fourier transform (STFT),spectrogram,time-frequency},
+mendeley-groups = {nips-2017-audio-style},
+number = {6},
+pages = {892--896},
+title = {{Toward High-Quality Real-Time Signal Reconstruction from STFT Magnitude}},
+volume = {24},
+year = {2017}
+}
+@article{Griffin1984,
+author = {Griffin, Daniel W and Lim, Jae S},
+file = {:Users/pkmital/Documents/PDFs/Griffin, Lim/Griffin, Lim - 1984 - Signal Estimation from Modified Short-Time Fourier Transform.pdf:pdf},
+journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
+mendeley-groups = {nips-2017-audio-style},
+number = {2},
+pages = {236--243},
+title = {{Signal Estimation from Modified Short-Time Fourier Transform}},
+volume = {32},
+year = {1984}
+}
+@inproceedings{Engel2017,
+abstract = {Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.},
+archivePrefix = {arXiv},
+arxivId = {1704.01279},
+author = {Engel, Jesse and Resnick, Cinjon and Roberts, Adam and Dieleman, Sander and Eck, Douglas and Simonyan, Karen and Norouzi, Mohammad},
+booktitle = {Proceedings of the 34th International Conference on Machine Learning},
+eprint = {1704.01279},
+file = {:Users/pkmital/Documents/PDFs/Engel et al/Engel et al. - 2017 - Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders(2).pdf:pdf},
+mendeley-groups = {nips-2017-audio-style},
+title = {{Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders}},
+url = {http://arxiv.org/abs/1704.01279},
+year = {2017}
+}
+@article{Oord2016b,
+abstract = {This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.},
+archivePrefix = {arXiv},
+arxivId = {1609.03499},
+author = {van den Oord, Aaron and Dieleman, Sander and Zen, Heiga and Simonyan, Karen and Vinyals, Oriol and Graves, Alex and Kalchbrenner, Nal and Senior, Andrew and Kavukcuoglu, Koray},
+eprint = {1609.03499},
+file = {:Users/pkmital/Documents/PDFs/Oord et al/Oord et al. - 2016 - WaveNet A Generative Model for Raw Audio.pdf:pdf},
+journal = {arxiv},
+mendeley-groups = {Neural Audio},
+pages = {1--15},
+title = {{WaveNet: A Generative Model for Raw Audio}},
+url = {http://arxiv.org/abs/1609.03499},
+year = {2016}
+}
+@article{Hershey2016,
+abstract = {Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.},
+archivePrefix = {arXiv},
+arxivId = {1609.09430},
+author = {Hershey, Shawn and Chaudhuri, Sourish and Ellis, Daniel P. W. and Gemmeke, Jort F. and Jansen, Aren and Moore, Channing and Plakal, Manoj and Platt, Devin and Saurous, Rif A. and Seybold, Bryan and Slaney, Malcolm and Weiss, Ron J. and Wilson, Kevin and Moore, R. Channing and Plakal, Manoj and Platt, Devin and Saurous, Rif A. and Seybold, Bryan and Slaney, Malcolm and Weiss, Ron J. and Wilson, Kevin},
+eprint = {1609.09430},
+file = {:Users/pkmital/Documents/PDFs/Hershey et al/Hershey et al. - 2016 - CNN Architectures for Large-Scale Audio Classification.pdf:pdf},
+isbn = {9781509041176},
+journal = {International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+mendeley-groups = {Embodied Cognition,nips-2017-audio-style},
+pages = {4--8},
+title = {{CNN Architectures for Large-Scale Audio Classification}},
+url = {http://arxiv.org/abs/1609.09430},
+year = {2016}
+}
+@article{Ulyanov2016b,
+abstract = {Gatys et al. recently demonstrated that deep networks can generate beautiful textures and stylized images from a single texture example. However, their methods requires a slow and memory-consuming optimization process. We propose here an alternative approach that moves the computational burden to a learning stage. Given a single example of a texture, our approach trains compact feed-forward convolutional networks to generate multiple samples of the same texture of arbitrary size and to transfer artistic style from a given image to any other image. The resulting networks are remarkably light-weight and can generate textures of quality comparable to Gatys{\~{}}et{\~{}}al., but hundreds of times faster. More generally, our approach highlights the power and flexibility of generative feed-forward models trained with complex and expressive loss functions.},
+archivePrefix = {arXiv},
+arxivId = {1603.03417},
+author = {Ulyanov, Dmitry and Lebedev, Vadim and Vedaldi, Andrea and Lempitsky, Victor},
+eprint = {1603.03417},
+file = {:Users/pkmital/Documents/PDFs/Ulyanov et al/Ulyanov et al. - 2016 - Texture Networks Feed-forward Synthesis of Textures and Stylized Images.pdf:pdf},
+isbn = {9781510829008},
+issn = {1938-7228},
+mendeley-groups = {nips-2017-audio-style},
+title = {{Texture Networks: Feed-forward Synthesis of Textures and Stylized Images}},
+url = {http://arxiv.org/abs/1603.03417},
+year = {2016}
+}