nakas commited on
Commit
2c448c3
1 Parent(s): 028a5c0

github fork

Browse files
.DS_Store ADDED
Binary file (6.15 kB). View file
 
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
MANIFEST.in ADDED
@@ -0,0 +1 @@
 
 
1
+ include README.md LICENSE
audio_style_transfer/__init__.py ADDED
File without changes
audio_style_transfer/__version__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ VERSION = (1, 0, 0)
2
+
3
+ __version__ = '.'.join(map(str, VERSION))
audio_style_transfer/models/__init__.py ADDED
File without changes
audio_style_transfer/models/nsynth.py ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """NSynth & WaveNet Audio Style Transfer."""
2
+ import os
3
+ import glob
4
+ import librosa
5
+ import argparse
6
+ import numpy as np
7
+ import tensorflow as tf
8
+ from magenta.models.nsynth.wavenet import masked
9
+ from magenta.models.nsynth.utils import mu_law, inv_mu_law_numpy
10
+ from audio_style_transfer import utils
11
+
12
+
13
+ def compute_wavenet_encoder_features(content, style):
14
+ ae_hop_length = 512
15
+ ae_bottleneck_width = 16
16
+ ae_num_stages = 10
17
+ ae_num_layers = 30
18
+ ae_filter_length = 3
19
+ ae_width = 128
20
+ # Encode the source with 8-bit Mu-Law.
21
+ n_frames = content.shape[0]
22
+ n_samples = content.shape[1]
23
+ content_tf = np.ascontiguousarray(content)
24
+ style_tf = np.ascontiguousarray(style)
25
+ g = tf.Graph()
26
+ content_features = []
27
+ style_features = []
28
+ layers = []
29
+ with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
30
+ x = tf.placeholder('float32', [n_frames, n_samples], name="x")
31
+ x_quantized = mu_law(x)
32
+ x_scaled = tf.cast(x_quantized, tf.float32) / 128.0
33
+ x_scaled = tf.expand_dims(x_scaled, 2)
34
+ en = masked.conv1d(
35
+ x_scaled,
36
+ causal=False,
37
+ num_filters=ae_width,
38
+ filter_length=ae_filter_length,
39
+ name='ae_startconv')
40
+ for num_layer in range(ae_num_layers):
41
+ dilation = 2**(num_layer % ae_num_stages)
42
+ d = tf.nn.relu(en)
43
+ d = masked.conv1d(
44
+ d,
45
+ causal=False,
46
+ num_filters=ae_width,
47
+ filter_length=ae_filter_length,
48
+ dilation=dilation,
49
+ name='ae_dilatedconv_%d' % (num_layer + 1))
50
+ d = tf.nn.relu(d)
51
+ en += masked.conv1d(
52
+ d,
53
+ num_filters=ae_width,
54
+ filter_length=1,
55
+ name='ae_res_%d' % (num_layer + 1))
56
+ layers.append(en)
57
+ en = masked.conv1d(
58
+ en,
59
+ num_filters=ae_bottleneck_width,
60
+ filter_length=1,
61
+ name='ae_bottleneck')
62
+ en = masked.pool1d(en, ae_hop_length, name='ae_pool', mode='avg')
63
+ saver = tf.train.Saver()
64
+ saver.restore(sess, './model.ckpt-200000')
65
+ content_features = sess.run(layers, feed_dict={x: content_tf})
66
+ styles = sess.run(layers, feed_dict={x: style_tf})
67
+ for i, style_feature in enumerate(styles):
68
+ n_features = np.prod(layers[i].shape.as_list()[-1])
69
+ features = np.reshape(style_feature, (-1, n_features))
70
+ style_gram = np.matmul(features.T, features) / (n_samples *
71
+ n_frames)
72
+ style_features.append(style_gram)
73
+ return content_features, style_features
74
+
75
+
76
+ def compute_wavenet_encoder_stylization(n_samples,
77
+ n_frames,
78
+ content_features,
79
+ style_features,
80
+ alpha=1e-4,
81
+ learning_rate=1e-3,
82
+ iterations=100):
83
+ ae_style_layers = [1, 5]
84
+ ae_num_layers = 30
85
+ ae_num_stages = 10
86
+ ae_filter_length = 3
87
+ ae_width = 128
88
+ layers = []
89
+ with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
90
+ x = tf.placeholder(
91
+ name="x", shape=(n_frames, n_samples, 1), dtype=tf.float32)
92
+ en = masked.conv1d(
93
+ x,
94
+ causal=False,
95
+ num_filters=ae_width,
96
+ filter_length=ae_filter_length,
97
+ name='ae_startconv')
98
+ for num_layer in range(ae_num_layers):
99
+ dilation = 2**(num_layer % ae_num_stages)
100
+ d = tf.nn.relu(en)
101
+ d = masked.conv1d(
102
+ d,
103
+ causal=False,
104
+ num_filters=ae_width,
105
+ filter_length=ae_filter_length,
106
+ dilation=dilation,
107
+ name='ae_dilatedconv_%d' % (num_layer + 1))
108
+ d = tf.nn.relu(d)
109
+ en += masked.conv1d(
110
+ d,
111
+ num_filters=ae_width,
112
+ filter_length=1,
113
+ name='ae_res_%d' % (num_layer + 1))
114
+ layer_i = tf.identity(en, name='layer_{}'.format(num_layer))
115
+ layers.append(layer_i)
116
+ saver = tf.train.Saver()
117
+ saver.restore(sess, './model.ckpt-200000')
118
+ sess.run(tf.initialize_all_variables())
119
+ frozen_graph_def = tf.graph_util.convert_variables_to_constants(
120
+ sess, sess.graph_def, [en.name.replace(':0', '')] +
121
+ ['layer_{}'.format(i) for i in range(ae_num_layers)])
122
+ with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
123
+ x = tf.Variable(
124
+ np.random.randn(n_frames, n_samples, 1).astype(np.float32))
125
+ tf.import_graph_def(frozen_graph_def, input_map={'x:0': x})
126
+ content_loss = np.float32(0.0)
127
+ style_loss = np.float32(0.0)
128
+ for num_layer in ae_style_layers:
129
+ layer_i = g.get_tensor_by_name(name='import/layer_%d:0' %
130
+ (num_layer))
131
+ content_loss = content_loss + alpha * 2 * tf.nn.l2_loss(
132
+ layer_i - content_features[num_layer])
133
+ n_features = layer_i.shape.as_list()[-1]
134
+ features = tf.reshape(layer_i, (-1, n_features))
135
+ gram = tf.matmul(tf.transpose(features), features) / (n_frames *
136
+ n_samples)
137
+ style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_features[
138
+ num_layer])
139
+ loss = content_loss + style_loss
140
+ # Optimization
141
+ print('Started optimization.')
142
+ opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
143
+ var_list = tf.trainable_variables()
144
+ print(var_list)
145
+ sess.run(tf.initialize_all_variables())
146
+ for i in range(iterations):
147
+ s, c, layer, _ = sess.run([style_loss, content_loss, loss, opt])
148
+ print(i, '- Style:', s, 'Content:', c, end='\r')
149
+ result = x.eval()
150
+ result = inv_mu_law_numpy(result[..., 0] / result.max() * 128.0)
151
+ return result
152
+
153
+
154
+ def compute_wavenet_decoder_features(content, style):
155
+ num_stages = 10
156
+ num_layers = 30
157
+ filter_length = 3
158
+ width = 512
159
+ skip_width = 256
160
+ # Encode the source with 8-bit Mu-Law.
161
+ n_frames = content.shape[0]
162
+ n_samples = content.shape[1]
163
+ content_tf = np.ascontiguousarray(content)
164
+ style_tf = np.ascontiguousarray(style)
165
+ g = tf.Graph()
166
+ content_features = []
167
+ style_features = []
168
+ layers = []
169
+ with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
170
+ x = tf.placeholder('float32', [n_frames, n_samples], name="x")
171
+ x_quantized = mu_law(x)
172
+ x_scaled = tf.cast(x_quantized, tf.float32) / 128.0
173
+ x_scaled = tf.expand_dims(x_scaled, 2)
174
+ layer = x_scaled
175
+ layer = masked.conv1d(
176
+ layer, num_filters=width, filter_length=filter_length, name='startconv')
177
+
178
+ # Set up skip connections.
179
+ s = masked.conv1d(
180
+ layer, num_filters=skip_width, filter_length=1, name='skip_start')
181
+
182
+ # Residual blocks with skip connections.
183
+ for i in range(num_layers):
184
+ dilation = 2**(i % num_stages)
185
+ d = masked.conv1d(
186
+ layer,
187
+ num_filters=2 * width,
188
+ filter_length=filter_length,
189
+ dilation=dilation,
190
+ name='dilatedconv_%d' % (i + 1))
191
+ assert d.get_shape().as_list()[2] % 2 == 0
192
+ m = d.get_shape().as_list()[2] // 2
193
+ d_sigmoid = tf.sigmoid(d[:, :, :m])
194
+ d_tanh = tf.tanh(d[:, :, m:])
195
+ d = d_sigmoid * d_tanh
196
+
197
+ layer += masked.conv1d(
198
+ d, num_filters=width, filter_length=1, name='res_%d' % (i + 1))
199
+ s += masked.conv1d(
200
+ d,
201
+ num_filters=skip_width,
202
+ filter_length=1,
203
+ name='skip_%d' % (i + 1))
204
+ layers.append(s)
205
+
206
+ saver = tf.train.Saver()
207
+ saver.restore(sess, './model.ckpt-200000')
208
+ content_features = sess.run(layers, feed_dict={x: content_tf})
209
+ styles = sess.run(layers, feed_dict={x: style_tf})
210
+ for i, style_feature in enumerate(styles):
211
+ n_features = np.prod(layers[i].shape.as_list()[-1])
212
+ features = np.reshape(style_feature, (-1, n_features))
213
+ style_gram = np.matmul(features.T, features) / (n_samples *
214
+ n_frames)
215
+ style_features.append(style_gram)
216
+ return content_features, style_features
217
+
218
+
219
+ def compute_wavenet_decoder_stylization(n_samples,
220
+ n_frames,
221
+ content_features,
222
+ style_features,
223
+ alpha=1e-4,
224
+ learning_rate=1e-3,
225
+ iterations=100):
226
+
227
+ style_layers = [1, 5]
228
+ num_stages = 10
229
+ num_layers = 30
230
+ filter_length = 3
231
+ width = 512
232
+ skip_width = 256
233
+ layers = []
234
+ with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
235
+ x = tf.placeholder(
236
+ name="x", shape=(n_frames, n_samples, 1), dtype=tf.float32)
237
+ layer = x
238
+ layer = masked.conv1d(
239
+ layer, num_filters=width, filter_length=filter_length, name='startconv')
240
+
241
+ # Set up skip connections.
242
+ s = masked.conv1d(
243
+ layer, num_filters=skip_width, filter_length=1, name='skip_start')
244
+
245
+ # Residual blocks with skip connections.
246
+ for i in range(num_layers):
247
+ dilation = 2**(i % num_stages)
248
+ d = masked.conv1d(
249
+ layer,
250
+ num_filters=2 * width,
251
+ filter_length=filter_length,
252
+ dilation=dilation,
253
+ name='dilatedconv_%d' % (i + 1))
254
+ assert d.get_shape().as_list()[2] % 2 == 0
255
+ m = d.get_shape().as_list()[2] // 2
256
+ d_sigmoid = tf.sigmoid(d[:, :, :m])
257
+ d_tanh = tf.tanh(d[:, :, m:])
258
+ d = d_sigmoid * d_tanh
259
+
260
+ layer += masked.conv1d(
261
+ d, num_filters=width, filter_length=1, name='res_%d' % (i + 1))
262
+ s += masked.conv1d(
263
+ d,
264
+ num_filters=skip_width,
265
+ filter_length=1,
266
+ name='skip_%d' % (i + 1))
267
+ layer_i = tf.identity(s, name='layer_{}'.format(num_layers))
268
+ layers.append(layer_i)
269
+ saver = tf.train.Saver()
270
+ saver.restore(sess, './model.ckpt-200000')
271
+ sess.run(tf.initialize_all_variables())
272
+ frozen_graph_def = tf.graph_util.convert_variables_to_constants(
273
+ sess, sess.graph_def, [s.name.replace(':0', '')] +
274
+ ['layer_{}'.format(i) for i in range(num_layers)])
275
+
276
+ with tf.Graph().as_default() as g, g.device('/cpu:0'), tf.Session() as sess:
277
+ x = tf.Variable(
278
+ np.random.randn(n_frames, n_samples, 1).astype(np.float32))
279
+ tf.import_graph_def(frozen_graph_def, input_map={'x:0': x})
280
+ content_loss = np.float32(0.0)
281
+ style_loss = np.float32(0.0)
282
+ for num_layer in style_layers:
283
+ layer_i = g.get_tensor_by_name(name='import/layer_%d:0' %
284
+ (num_layer))
285
+ content_loss = content_loss + alpha * 2 * tf.nn.l2_loss(
286
+ layer_i - content_features[num_layer])
287
+ n_features = layer_i.shape.as_list()[-1]
288
+ features = tf.reshape(layer_i, (-1, n_features))
289
+ gram = tf.matmul(tf.transpose(features), features) / (n_frames *
290
+ n_samples)
291
+ style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_features[
292
+ num_layer])
293
+ loss = content_loss + style_loss
294
+ # Optimization
295
+ print('Started optimization.')
296
+ opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
297
+ var_list = tf.trainable_variables()
298
+ print(var_list)
299
+ sess.run(tf.initialize_all_variables())
300
+ for i in range(iterations):
301
+ s, c, _ = sess.run([style_loss, content_loss, opt])
302
+ print(i, '- Style:', s, 'Content:', c, end='\r')
303
+ result = x.eval()
304
+ result = inv_mu_law_numpy(result[..., 0] / result.max() * 128.0)
305
+
306
+ return result
307
+
308
+
309
+ def run(content_fname,
310
+ style_fname,
311
+ output_path,
312
+ model,
313
+ iterations=100,
314
+ sr=16000,
315
+ hop_size=512,
316
+ frame_size=2048,
317
+ alpha=1e-3):
318
+
319
+ content, fs = librosa.load(content_fname, sr=sr)
320
+ style, fs = librosa.load(style_fname, sr=sr)
321
+ n_samples = (min(content.shape[0], style.shape[0]) // 512) * 512
322
+ content = utils.chop(content[:n_samples], hop_size, frame_size)
323
+ style = utils.chop(style[:n_samples], hop_size, frame_size)
324
+
325
+ if model == 'encoder':
326
+ content_features, style_features = compute_wavenet_encoder_features(
327
+ content=content, style=style)
328
+ result = compute_wavenet_encoder_stylization(
329
+ n_frames=content_features[0].shape[0],
330
+ n_samples=frame_size,
331
+ alpha=alpha,
332
+ content_features=content_features,
333
+ style_features=style_features,
334
+ iterations=iterations)
335
+ elif model == 'decoder':
336
+ content_features, style_features = compute_wavenet_decoder_features(
337
+ content=content, style=style)
338
+ result = compute_wavenet_decoder_stylization(
339
+ n_frames=content_features[0].shape[0],
340
+ n_samples=frame_size,
341
+ alpha=alpha,
342
+ content_features=content_features,
343
+ style_features=style_features,
344
+ iterations=iterations)
345
+ else:
346
+ raise ValueError('Unsupported model type: {}.'.format(model))
347
+
348
+ x = utils.unchop(result, hop_size, frame_size)
349
+ librosa.output.write_wav('prelimiter.wav', x, sr)
350
+
351
+ limited = utils.limiter(x)
352
+ output_fname = '{}/{}+{}.wav'.format(output_path,
353
+ content_fname.split('/')[-1],
354
+ style_fname.split('/')[-1])
355
+ librosa.output.write_wav(output_fname, limited, sr=sr)
356
+
357
+
358
+ def batch(content_path, style_path, output_path, model):
359
+ content_files = glob.glob('{}/*.wav'.format(content_path))
360
+ style_files = glob.glob('{}/*.wav'.format(style_path))
361
+ for content_fname in content_files:
362
+ for style_fname in style_files:
363
+ output_fname = '{}/{}+{}.wav'.format(output_path,
364
+ content_fname.split('/')[-1],
365
+ style_fname.split('/')[-1])
366
+ if os.path.exists(output_fname):
367
+ continue
368
+ run(content_fname, style_fname, output_fname, model)
369
+
370
+
371
+ if __name__ == '__main__':
372
+ parser = argparse.ArgumentParser()
373
+ parser.add_argument(
374
+ '-s', '--style', help='style file(s) location', required=True)
375
+ parser.add_argument(
376
+ '-c', '--content', help='content file(s) location', required=True)
377
+ parser.add_argument('-o', '--output', help='output path', required=True)
378
+ parser.add_argument(
379
+ '-m',
380
+ '--model',
381
+ help='model type: [encoder], or decoder',
382
+ default='encoder')
383
+ parser.add_argument(
384
+ '-t',
385
+ '--type',
386
+ help='mode for training [single] (point to files) or batch (point to path)',
387
+ default='single')
388
+
389
+ args = vars(parser.parse_args())
390
+ if args['model'] == 'single':
391
+ run(args['content'], args['style'], args['output'], args['model'])
392
+ else:
393
+ batch(args['content'], args['style'], args['output'], args['model'])
audio_style_transfer/models/timedomain.py ADDED
@@ -0,0 +1,354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
2
+ Parag K. Mital
3
+ """
4
+ import tensorflow as tf
5
+ import librosa
6
+ import numpy as np
7
+ from scipy.signal import hann
8
+ from audio_style_transfer import utils
9
+ import argparse
10
+ import glob
11
+ import os
12
+
13
+
14
+ def chop(signal, hop_size=256, frame_size=512):
15
+ n_hops = len(signal) // hop_size
16
+ s = []
17
+ hann_win = hann(frame_size)
18
+ for hop_i in range(n_hops):
19
+ frame = signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)]
20
+ frame = np.pad(frame, (0, frame_size - len(frame)), 'constant')
21
+ frame *= hann_win
22
+ s.append(frame)
23
+ s = np.array(s)
24
+ return s
25
+
26
+
27
+ def unchop(frames, hop_size=256, frame_size=512):
28
+ signal = np.zeros((frames.shape[0] * hop_size + frame_size,))
29
+ for hop_i, frame in enumerate(frames):
30
+ signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)] += frame
31
+ return signal
32
+
33
+
34
+ def dft_np(signal, hop_size=256, fft_size=512):
35
+ s = chop(signal, hop_size, fft_size)
36
+ N = s.shape[-1]
37
+ k = np.reshape(
38
+ np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [1, N // 2])
39
+ x = np.reshape(np.linspace(0.0, N - 1, N), [N, 1])
40
+ freqs = np.dot(x, k)
41
+ real = np.dot(s, np.cos(freqs)) * (2.0 / N)
42
+ imag = np.dot(s, np.sin(freqs)) * (2.0 / N)
43
+ return real, imag
44
+
45
+
46
+ def idft_np(re, im, hop_size=256, fft_size=512):
47
+ N = re.shape[1] * 2
48
+ k = np.reshape(
49
+ np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [N // 2, 1])
50
+ x = np.reshape(np.linspace(0.0, N - 1, N), [1, N])
51
+ freqs = np.dot(k, x)
52
+ signal = np.zeros((re.shape[0] * hop_size + fft_size,))
53
+ recon = np.dot(re, np.cos(freqs)) + np.dot(im, np.sin(freqs))
54
+ for hop_i, frame in enumerate(recon):
55
+ signal[(hop_i * hop_size):(hop_i * hop_size + fft_size)] += frame
56
+ return signal
57
+
58
+
59
+ def unwrap(x):
60
+ return np.unwrap(x).astype(np.float32)
61
+
62
+
63
+ def instance_norm(x, epsilon=1e-5):
64
+ """Instance Normalization.
65
+
66
+ See Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016).
67
+ Instance Normalization: The Missing Ingredient for Fast Stylization,
68
+ Retrieved from http://arxiv.org/abs/1607.08022
69
+
70
+ Parameters
71
+ ----------
72
+ x : TYPE
73
+ Description
74
+ epsilon : float, optional
75
+ Description
76
+ """
77
+ with tf.variable_scope('instance_norm'):
78
+ mean, var = tf.nn.moments(x, [1, 2], keep_dims=True)
79
+ scale = tf.get_variable(
80
+ name='scale',
81
+ shape=[x.get_shape()[-1]],
82
+ initializer=tf.truncated_normal_initializer(mean=1.0, stddev=0.02))
83
+ offset = tf.get_variable(
84
+ name='offset',
85
+ shape=[x.get_shape()[-1]],
86
+ initializer=tf.constant_initializer(0.0))
87
+ out = scale * tf.div(x - mean, tf.sqrt(var + epsilon)) + offset
88
+ return out
89
+
90
+
91
+ def compute_inputs(x, freqs, n_fft, n_frames, input_features, norm=False):
92
+ if norm:
93
+ norm_fn = instance_norm
94
+ else:
95
+ def norm_fn(x):
96
+ return x
97
+ freqs_tf = tf.constant(freqs, name="freqs", dtype='float32')
98
+ inputs = {}
99
+ with tf.variable_scope('real'):
100
+ inputs['real'] = norm_fn(tf.reshape(
101
+ tf.matmul(x, tf.cos(freqs_tf)), [1, 1, n_frames, n_fft // 2]))
102
+ with tf.variable_scope('imag'):
103
+ inputs['imag'] = norm_fn(tf.reshape(
104
+ tf.matmul(x, tf.sin(freqs_tf)), [1, 1, n_frames, n_fft // 2]))
105
+ with tf.variable_scope('mags'):
106
+ inputs['mags'] = norm_fn(tf.reshape(
107
+ tf.sqrt(
108
+ tf.maximum(1e-15, inputs['real'] * inputs['real'] + inputs[
109
+ 'imag'] * inputs['imag'])), [1, 1, n_frames, n_fft // 2]))
110
+ with tf.variable_scope('phase'):
111
+ inputs['phase'] = norm_fn(tf.atan2(inputs['imag'], inputs['real']))
112
+ with tf.variable_scope('unwrapped'):
113
+ inputs['unwrapped'] = tf.py_func(
114
+ unwrap, [inputs['phase']], tf.float32)
115
+ with tf.variable_scope('unwrapped_difference'):
116
+ inputs['unwrapped_difference'] = (tf.slice(
117
+ inputs['unwrapped'],
118
+ [0, 0, 0, 1], [-1, -1, -1, n_fft // 2 - 1]) -
119
+ tf.slice(
120
+ inputs['unwrapped'],
121
+ [0, 0, 0, 0], [-1, -1, -1, n_fft // 2 - 1]))
122
+ if 'unwrapped_difference' in input_features:
123
+ for k, v in input_features:
124
+ if k is not 'unwrapped_difference':
125
+ inputs[k] = tf.slice(
126
+ v, [0, 0, 0, 0], [-1, -1, -1, n_fft // 2 - 1])
127
+ net = tf.concat([inputs[i] for i in input_features], 1)
128
+ return inputs, net
129
+
130
+
131
+ def compute_features(content,
132
+ style,
133
+ input_features,
134
+ norm=False,
135
+ stride=1,
136
+ n_layers=1,
137
+ n_filters=4096,
138
+ n_fft=1024,
139
+ k_h=1,
140
+ k_w=11):
141
+ n_frames = content.shape[0]
142
+ n_samples = content.shape[1]
143
+ content_tf = np.ascontiguousarray(content)
144
+ style_tf = np.ascontiguousarray(style)
145
+ g = tf.Graph()
146
+ kernels = []
147
+ content_features = []
148
+ style_features = []
149
+ config_proto = tf.ConfigProto()
150
+ config_proto.gpu_options.allow_growth = True
151
+ with g.as_default(), g.device('/cpu:0'), tf.Session(config=config_proto) as sess:
152
+ x = tf.placeholder('float32', [n_frames, n_samples], name="x")
153
+ p = np.reshape(
154
+ np.linspace(0.0, n_samples - 1, n_samples), [n_samples, 1])
155
+ k = np.reshape(
156
+ np.linspace(0.0, 2 * np.pi / n_fft * (n_fft // 2), n_fft // 2),
157
+ [1, n_fft // 2])
158
+ freqs = np.dot(p, k)
159
+ inputs, net = compute_inputs(x, freqs, n_fft, n_frames, input_features, norm)
160
+ sess.run(tf.initialize_all_variables())
161
+ content_feature = net.eval(feed_dict={x: content_tf})
162
+ content_features.append(content_feature)
163
+ style_feature = inputs['mags'].eval(feed_dict={x: style_tf})
164
+ features = np.reshape(style_feature, (-1, n_fft // 2))
165
+ style_gram = np.matmul(features.T, features) / (n_frames)
166
+ style_features.append(style_gram)
167
+ for layer_i in range(n_layers):
168
+ if layer_i == 0:
169
+ std = np.sqrt(2) * np.sqrt(2.0 / (
170
+ (n_fft / 2 + n_filters) * k_w))
171
+ kernel = np.random.randn(k_h, k_w, n_fft // 2, n_filters) * std
172
+ else:
173
+ std = np.sqrt(2) * np.sqrt(2.0 / (
174
+ (n_filters + n_filters) * k_w))
175
+ kernel = np.random.randn(1, k_w, n_filters, n_filters) * std
176
+ kernels.append(kernel)
177
+ kernel_tf = tf.constant(
178
+ kernel, name="kernel{}".format(layer_i), dtype='float32')
179
+ conv = tf.nn.conv2d(
180
+ net,
181
+ kernel_tf,
182
+ strides=[1, stride, stride, 1],
183
+ padding="VALID",
184
+ name="conv{}".format(layer_i))
185
+ net = tf.nn.relu(conv)
186
+ content_feature = net.eval(feed_dict={x: content_tf})
187
+ content_features.append(content_feature)
188
+ style_feature = net.eval(feed_dict={x: style_tf})
189
+ features = np.reshape(style_feature, (-1, n_filters))
190
+ style_gram = np.matmul(features.T, features) / (n_frames)
191
+ style_features.append(style_gram)
192
+ return content_features, style_features, kernels, freqs
193
+
194
+
195
+ def compute_stylization(kernels,
196
+ n_samples,
197
+ n_frames,
198
+ content_features,
199
+ style_gram,
200
+ freqs,
201
+ input_features,
202
+ norm=False,
203
+ stride=1,
204
+ n_layers=1,
205
+ n_fft=1024,
206
+ alpha=1e-4,
207
+ learning_rate=1e-3,
208
+ iterations=100,
209
+ optimizer='bfgs'):
210
+ result = None
211
+ with tf.Graph().as_default():
212
+ x = tf.Variable(
213
+ np.random.randn(n_frames, n_samples).astype(np.float32) * 1e-3,
214
+ name="x")
215
+ inputs, net = compute_inputs(x, freqs, n_fft, n_frames, input_features, norm)
216
+ content_loss = alpha * 2 * tf.nn.l2_loss(net - content_features[0])
217
+ feats = tf.reshape(inputs['mags'], (-1, n_fft // 2))
218
+ gram = tf.matmul(tf.transpose(feats), feats) / (n_frames)
219
+ style_loss = 2 * tf.nn.l2_loss(gram - style_gram[0])
220
+ for layer_i in range(n_layers):
221
+ kernel_tf = tf.constant(
222
+ kernels[layer_i],
223
+ name="kernel{}".format(layer_i),
224
+ dtype='float32')
225
+ conv = tf.nn.conv2d(
226
+ net,
227
+ kernel_tf,
228
+ strides=[1, stride, stride, 1],
229
+ padding="VALID",
230
+ name="conv{}".format(layer_i))
231
+ net = tf.nn.relu(conv)
232
+ content_loss = content_loss + \
233
+ alpha * 2 * tf.nn.l2_loss(net - content_features[layer_i + 1])
234
+ _, height, width, number = map(lambda i: i.value, net.get_shape())
235
+ feats = tf.reshape(net, (-1, number))
236
+ gram = tf.matmul(tf.transpose(feats), feats) / (n_frames)
237
+ style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_gram[
238
+ layer_i + 1])
239
+ loss = content_loss + style_loss
240
+ if optimizer == 'bfgs':
241
+ opt = tf.contrib.opt.ScipyOptimizerInterface(
242
+ loss, method='L-BFGS-B', options={'maxiter': iterations})
243
+ # Optimization
244
+ with tf.Session() as sess:
245
+ sess.run(tf.initialize_all_variables())
246
+ print('Started optimization.')
247
+ opt.minimize(sess)
248
+ result = x.eval()
249
+ else:
250
+ opt = tf.train.AdamOptimizer(
251
+ learning_rate=learning_rate).minimize(loss)
252
+ # Optimization
253
+ with tf.Session() as sess:
254
+ sess.run(tf.initialize_all_variables())
255
+ print('Started optimization.')
256
+ for i in range(iterations):
257
+ s, c, l, _ = sess.run([style_loss, content_loss, loss, opt])
258
+ print('Style:', s, 'Content:', c, end='\r')
259
+ result = x.eval()
260
+ return result
261
+
262
+
263
+ def run(content_fname,
264
+ style_fname,
265
+ output_fname,
266
+ norm=False,
267
+ input_features=['real', 'imag', 'mags'],
268
+ n_fft=4096,
269
+ n_layers=1,
270
+ n_filters=4096,
271
+ hop_length=256,
272
+ alpha=0.05,
273
+ k_w=15,
274
+ k_h=3,
275
+ optimizer='bfgs',
276
+ stride=1,
277
+ iterations=300,
278
+ sr=22050):
279
+
280
+ frame_size = n_fft // 2
281
+
282
+ audio, fs = librosa.load(content_fname, sr=sr)
283
+ content = chop(audio, hop_size=hop_length, frame_size=frame_size)
284
+ audio, fs = librosa.load(style_fname, sr=sr)
285
+ style = chop(audio, hop_size=hop_length, frame_size=frame_size)
286
+
287
+ n_frames = min(content.shape[0], style.shape[0])
288
+ n_samples = min(content.shape[1], style.shape[1])
289
+ content = content[:n_frames, :n_samples]
290
+ style = style[:n_frames, :n_samples]
291
+
292
+ content_features, style_gram, kernels, freqs = compute_features(
293
+ content=content,
294
+ style=style,
295
+ input_features=input_features,
296
+ norm=norm,
297
+ stride=stride,
298
+ n_fft=n_fft,
299
+ n_layers=n_layers,
300
+ n_filters=n_filters,
301
+ k_w=k_w,
302
+ k_h=k_h)
303
+
304
+ result = compute_stylization(
305
+ kernels=kernels,
306
+ freqs=freqs,
307
+ input_features=input_features,
308
+ norm=norm,
309
+ n_samples=n_samples,
310
+ n_frames=n_frames,
311
+ n_fft=n_fft,
312
+ content_features=content_features,
313
+ style_gram=style_gram,
314
+ stride=stride,
315
+ n_layers=n_layers,
316
+ alpha=alpha,
317
+ optimizer=optimizer,
318
+ iterations=iterations)
319
+
320
+ s = unchop(result, hop_size=hop_length, frame_size=frame_size)
321
+ librosa.output.write_wav(output_fname, s, sr=sr)
322
+ s = utils.limiter(s)
323
+ librosa.output.write_wav(output_fname + '.limiter.wav', s, sr=sr)
324
+
325
+
326
+ def batch(content_path, style_path, output_path, model):
327
+ content_files = glob.glob('{}/*.wav'.format(content_path))
328
+ style_files = glob.glob('{}/*.wav'.format(style_path))
329
+ for content_fname in content_files:
330
+ for style_fname in style_files:
331
+ output_fname = '{}/{}+{}.wav'.format(output_path,
332
+ content_fname.split('/')[-1],
333
+ style_fname.split('/')[-1])
334
+ if os.path.exists(output_fname):
335
+ continue
336
+ run(content_fname, style_fname, output_fname, model)
337
+
338
+
339
+ if __name__ == '__main__':
340
+ parser = argparse.ArgumentParser()
341
+ parser.add_argument('-s', '--style', help='style file', required=True)
342
+ parser.add_argument('-c', '--content', help='content file', required=True)
343
+ parser.add_argument('-o', '--output', help='output file', required=True)
344
+ parser.add_argument(
345
+ '-m',
346
+ '--mode',
347
+ help='mode for training [single] or batch',
348
+ default='single')
349
+
350
+ args = vars(parser.parse_args())
351
+ if args['mode'] == 'single':
352
+ run(args['content'], args['style'], args['output'])
353
+ else:
354
+ batch(args['content'], args['style'], args['output'])
audio_style_transfer/models/uylanov.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
2
+ Parag K. Mital
3
+ """
4
+ import tensorflow as tf
5
+ import librosa
6
+ import numpy as np
7
+ import argparse
8
+ import glob
9
+ import os
10
+ from audio_style_transfer import utils
11
+
12
+
13
+ def read_audio_spectum(filename, n_fft=2048, hop_length=512, sr=22050):
14
+ x, sr = librosa.load(filename, sr=sr)
15
+ S = librosa.stft(x, n_fft, hop_length)
16
+ S = np.log1p(np.abs(S)).T
17
+ return S, sr
18
+
19
+
20
+ def compute_features(content,
21
+ style,
22
+ stride=1,
23
+ n_layers=1,
24
+ n_filters=4096,
25
+ k_h=1,
26
+ k_w=11):
27
+ n_frames = content.shape[0]
28
+ n_samples = content.shape[1]
29
+ content_tf = np.ascontiguousarray(content)
30
+ style_tf = np.ascontiguousarray(style)
31
+ g = tf.Graph()
32
+ kernels = []
33
+ layers = []
34
+ content_features = []
35
+ style_features = []
36
+ with g.as_default(), g.device('/cpu:0'), tf.Session():
37
+ x = tf.placeholder('float32', [None, n_samples], name="x")
38
+ net = tf.reshape(x, [1, 1, -1, n_samples])
39
+ for layer_i in range(n_layers):
40
+ if layer_i == 0:
41
+ std = np.sqrt(2) * np.sqrt(2.0 / ((n_frames + n_filters) * k_w))
42
+ kernel = np.random.randn(k_h, k_w, n_samples, n_filters) * std
43
+ else:
44
+ std = np.sqrt(2) * np.sqrt(2.0 / (
45
+ (n_filters + n_filters) * k_w))
46
+ kernel = np.random.randn(k_h, k_w, n_filters, n_filters) * std
47
+ kernels.append(kernel)
48
+ kernel_tf = tf.constant(
49
+ kernel, name="kernel{}".format(layer_i), dtype='float32')
50
+ conv = tf.nn.conv2d(
51
+ net,
52
+ kernel_tf,
53
+ strides=[1, stride, stride, 1],
54
+ padding="VALID",
55
+ name="conv{}".format(layer_i))
56
+ net = tf.nn.relu(conv)
57
+ layers.append(net)
58
+ content_feature = net.eval(feed_dict={x: content_tf})
59
+ content_features.append(content_feature)
60
+ style_feature = net.eval(feed_dict={x: style_tf})
61
+ features = np.reshape(style_feature, (-1, n_filters))
62
+ style_gram = np.matmul(features.T, features) / n_frames
63
+ style_features.append(style_gram)
64
+ return content_features, style_features, kernels
65
+
66
+
67
+ def compute_stylization(kernels,
68
+ n_samples,
69
+ n_frames,
70
+ content_features,
71
+ style_features,
72
+ stride=1,
73
+ n_layers=1,
74
+ alpha=1e-4,
75
+ learning_rate=1e-3,
76
+ iterations=100):
77
+ result = None
78
+ with tf.Graph().as_default():
79
+ x = tf.Variable(
80
+ np.random.randn(1, 1, n_frames, n_samples).astype(np.float32) *
81
+ 1e-3,
82
+ name="x")
83
+ net = x
84
+ content_loss = 0
85
+ style_loss = 0
86
+ for layer_i in range(n_layers):
87
+ kernel_tf = tf.constant(
88
+ kernels[layer_i],
89
+ name="kernel{}".format(layer_i),
90
+ dtype='float32')
91
+ conv = tf.nn.conv2d(
92
+ net,
93
+ kernel_tf,
94
+ strides=[1, stride, stride, 1],
95
+ padding="VALID",
96
+ name="conv{}".format(layer_i))
97
+ net = tf.nn.relu(conv)
98
+ content_loss = content_loss + \
99
+ alpha * 2 * tf.nn.l2_loss(net - content_features[layer_i])
100
+ _, height, width, number = map(lambda i: i.value, net.get_shape())
101
+ feats = tf.reshape(net, (-1, number))
102
+ gram = tf.matmul(tf.transpose(feats), feats) / n_frames
103
+ style_loss = style_loss + 2 * tf.nn.l2_loss(gram - style_features[
104
+ layer_i])
105
+ loss = content_loss + style_loss
106
+ opt = tf.contrib.opt.ScipyOptimizerInterface(
107
+ loss, method='L-BFGS-B', options={'maxiter': iterations})
108
+ # Optimization
109
+ with tf.Session() as sess:
110
+ sess.run(tf.initialize_all_variables())
111
+ print('Started optimization.')
112
+ opt.minimize(sess)
113
+ print('Final loss:', loss.eval())
114
+ result = x.eval()
115
+ return result
116
+
117
+
118
+ def run(content_fname,
119
+ style_fname,
120
+ output_fname,
121
+ n_fft=2048,
122
+ hop_length=256,
123
+ alpha=0.02,
124
+ n_layers=1,
125
+ n_filters=8192,
126
+ k_w=15,
127
+ stride=1,
128
+ iterations=300,
129
+ phase_iterations=500,
130
+ sr=22050,
131
+ signal_length=1, # second
132
+ block_length=1024):
133
+
134
+ content, sr = read_audio_spectum(
135
+ content_fname, n_fft=n_fft, hop_length=hop_length, sr=sr)
136
+ style, sr = read_audio_spectum(
137
+ style_fname, n_fft=n_fft, hop_length=hop_length, sr=sr)
138
+
139
+ n_frames = min(content.shape[0], style.shape[0])
140
+ n_samples = content.shape[1]
141
+ content = content[:n_frames, :]
142
+ style = style[:n_frames, :]
143
+
144
+ content_features, style_features, kernels = compute_features(
145
+ content=content,
146
+ style=style,
147
+ stride=stride,
148
+ n_layers=n_layers,
149
+ n_filters=n_filters,
150
+ k_w=k_w)
151
+
152
+ result = compute_stylization(
153
+ kernels=kernels,
154
+ n_samples=n_samples,
155
+ n_frames=n_frames,
156
+ content_features=content_features,
157
+ style_features=style_features,
158
+ stride=stride,
159
+ n_layers=n_layers,
160
+ alpha=alpha,
161
+ iterations=iterations)
162
+
163
+ mags = np.zeros_like(content.T)
164
+ mags[:, :n_frames] = np.exp(result[0, 0].T) - 1
165
+
166
+ p = 2 * np.pi * np.random.random_sample(mags.shape) - np.pi
167
+ for i in range(phase_iterations):
168
+ S = mags * np.exp(1j * p)
169
+ x = librosa.istft(S, hop_length)
170
+ p = np.angle(librosa.stft(x, n_fft, hop_length))
171
+
172
+ librosa.output.write_wav('prelimiter.wav', x, sr)
173
+ limited = utils.limiter(x)
174
+ librosa.output.write_wav(output_fname, limited, sr)
175
+
176
+
177
+ def batch(content_path, style_path, output_path):
178
+ content_files = glob.glob('{}/*.wav'.format(content_path))
179
+ style_files = glob.glob('{}/*.wav'.format(style_path))
180
+ for content_filename in content_files:
181
+ for style_filename in style_files:
182
+ output_filename = '{}/{}+{}.wav'.format(
183
+ output_path,
184
+ content_filename.split('/')[-1], style_filename.split('/')[-1])
185
+ if os.path.exists(output_filename):
186
+ continue
187
+ run(content_filename, style_filename, output_filename)
188
+
189
+
190
+ if __name__ == '__main__':
191
+ parser = argparse.ArgumentParser()
192
+ parser.add_argument('-s', '--style', help='style file', required=True)
193
+ parser.add_argument('-c', '--content', help='content file', required=True)
194
+ parser.add_argument('-o', '--output', help='output file', required=True)
195
+ parser.add_argument(
196
+ '-m',
197
+ '--mode',
198
+ help='mode for training [single] or batch',
199
+ default='single')
200
+
201
+ args = vars(parser.parse_args())
202
+ if args['mode'] == 'single':
203
+ run(args['content'], args['style'], args['output'])
204
+ else:
205
+ batch(args['content'], args['style'], args['output'])
audio_style_transfer/utils.py ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
2
+ Parag K. Mital
3
+ """
4
+ import glob
5
+ import numpy as np
6
+ from scipy.signal import hann
7
+ import librosa
8
+ import matplotlib
9
+ import matplotlib.pyplot as plt
10
+ import os
11
+
12
+
13
+ def limiter(signal,
14
+ delay=40,
15
+ threshold=0.9,
16
+ release_coeff=0.9995,
17
+ attack_coeff=0.9):
18
+
19
+ delay_index = 0
20
+ envelope = 0
21
+ gain = 1
22
+ delay = delay
23
+ delay_line = np.zeros(delay)
24
+ release_coeff = release_coeff
25
+ attack_coeff = attack_coeff
26
+ threshold = threshold
27
+
28
+ for idx, sample in enumerate(signal):
29
+ delay_line[delay_index] = sample
30
+ delay_index = (delay_index + 1) % delay
31
+
32
+ # calculate an envelope of the signal
33
+ envelope = max(np.abs(sample), envelope * release_coeff)
34
+
35
+ if envelope > threshold:
36
+ target_gain = threshold / envelope
37
+ else:
38
+ target_gain = 1.0
39
+
40
+ # have gain go towards a desired limiter gain
41
+ gain = (gain * attack_coeff + target_gain * (1 - attack_coeff))
42
+
43
+ # limit the delayed signal
44
+ signal[idx] = delay_line[delay_index] * gain
45
+ return signal
46
+
47
+
48
+ def chop(signal, hop_size=256, frame_size=512):
49
+ n_hops = len(signal) // hop_size
50
+ frames = []
51
+ hann_win = hann(frame_size)
52
+ for hop_i in range(n_hops):
53
+ frame = signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)]
54
+ frame = np.pad(frame, (0, frame_size - len(frame)), 'constant')
55
+ frame *= hann_win
56
+ frames.append(frame)
57
+ frames = np.array(frames)
58
+ return frames
59
+
60
+
61
+ def unchop(frames, hop_size=256, frame_size=512):
62
+ signal = np.zeros((frames.shape[0] * hop_size + frame_size,))
63
+ for hop_i, frame in enumerate(frames):
64
+ signal[(hop_i * hop_size):(hop_i * hop_size + frame_size)] += frame
65
+ return signal
66
+
67
+
68
+ def matrix_dft(V):
69
+ N = len(V)
70
+ w = np.exp(-2j * np.pi / N)
71
+ col = np.vander([w], N, True)
72
+ W = np.vander(col.flatten(), N, True) / np.sqrt(N)
73
+ return np.dot(W, V)
74
+
75
+
76
+ def dft_np(signal, hop_size=256, fft_size=512):
77
+ s = chop(signal, hop_size, fft_size)
78
+ N = s.shape[-1]
79
+ k = np.reshape(
80
+ np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [1, N // 2])
81
+ x = np.reshape(np.linspace(0.0, N - 1, N), [N, 1])
82
+ freqs = np.dot(x, k)
83
+ real = np.dot(s, np.cos(freqs)) * (2.0 / N)
84
+ imag = np.dot(s, np.sin(freqs)) * (2.0 / N)
85
+ return real, imag
86
+
87
+
88
+ def idft_np(re, im, hop_size=256, fft_size=512):
89
+ N = re.shape[1] * 2
90
+ k = np.reshape(
91
+ np.linspace(0.0, 2 * np.pi / N * (N // 2), N // 2), [N // 2, 1])
92
+ x = np.reshape(np.linspace(0.0, N - 1, N), [1, N])
93
+ freqs = np.dot(k, x)
94
+ signal = np.zeros((re.shape[0] * hop_size + fft_size,))
95
+ recon = np.dot(re, np.cos(freqs)) + np.dot(im, np.sin(freqs))
96
+ for hop_i, frame in enumerate(recon):
97
+ signal[(hop_i * hop_size):(hop_i * hop_size + fft_size)] += frame
98
+ return signal
99
+
100
+
101
+ def rainbowgram(path,
102
+ ax,
103
+ peak=70.0,
104
+ use_cqt=False,
105
+ n_fft=1024,
106
+ hop_length=256,
107
+ sr=22050,
108
+ over_sample=4,
109
+ res_factor=0.8,
110
+ octaves=5,
111
+ notes_per_octave=10):
112
+ audio = librosa.load(path, sr=sr)[0]
113
+ if use_cqt:
114
+ C = librosa.cqt(audio,
115
+ sr=sr,
116
+ hop_length=hop_length,
117
+ bins_per_octave=int(notes_per_octave * over_sample),
118
+ n_bins=int(octaves * notes_per_octave * over_sample),
119
+ filter_scale=res_factor,
120
+ fmin=librosa.note_to_hz('C2'))
121
+ else:
122
+ C = librosa.stft(
123
+ audio,
124
+ n_fft=n_fft,
125
+ win_length=n_fft,
126
+ hop_length=hop_length,
127
+ center=True)
128
+ mag, phase = librosa.core.magphase(C)
129
+ phase_angle = np.angle(phase)
130
+ phase_unwrapped = np.unwrap(phase_angle)
131
+ dphase = phase_unwrapped[:, 1:] - phase_unwrapped[:, :-1]
132
+ dphase = np.concatenate([phase_unwrapped[:, 0:1], dphase], axis=1) / np.pi
133
+ mag = (librosa.logamplitude(
134
+ mag**2, amin=1e-13, top_db=peak, ref_power=np.max) / peak) + 1
135
+ cdict = {
136
+ 'red': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)),
137
+ 'green': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)),
138
+ 'blue': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)),
139
+ 'alpha': ((0.0, 1.0, 1.0), (1.0, 0.0, 0.0))
140
+ }
141
+ my_mask = matplotlib.colors.LinearSegmentedColormap('MyMask', cdict)
142
+ plt.register_cmap(cmap=my_mask)
143
+ ax.matshow(dphase[::-1, :], cmap=plt.cm.rainbow)
144
+ ax.matshow(mag[::-1, :], cmap=my_mask)
145
+
146
+
147
+ def rainbowgrams(list_of_paths,
148
+ saveto=None,
149
+ rows=2,
150
+ cols=4,
151
+ col_labels=[],
152
+ row_labels=[],
153
+ use_cqt=True,
154
+ figsize=(15, 20),
155
+ peak=70.0):
156
+ """Build a CQT rowsXcols.
157
+ """
158
+ N = len(list_of_paths)
159
+ assert N == rows * cols
160
+ fig, axes = plt.subplots(
161
+ rows, cols, sharex=True, sharey=True, figsize=figsize)
162
+ fig.subplots_adjust(left=0.05, right=0.95, wspace=0.05, hspace=0.1)
163
+ # fig = plt.figure(figsize=(18, N * 1.25))
164
+ for i, path in enumerate(list_of_paths):
165
+ row = int(i / cols)
166
+ col = i % cols
167
+ if rows == 1 and cols == 1:
168
+ ax = axes
169
+ elif rows == 1:
170
+ ax = axes[col]
171
+ elif cols == 1:
172
+ ax = axes[row]
173
+ else:
174
+ ax = axes[row, col]
175
+ rainbowgram(path, ax, peak, use_cqt)
176
+ ax.set_axis_bgcolor('white')
177
+ ax.set_xticks([])
178
+ ax.set_yticks([])
179
+ if col == 0 and row_labels:
180
+ ax.set_ylabel(row_labels[row])
181
+ if row == rows - 1 and col_labels:
182
+ ax.set_xlabel(col_labels[col])
183
+ if saveto is not None:
184
+ fig.savefig(filename='{}.png'.format(saveto))
185
+
186
+
187
+ def plot_rainbowgrams():
188
+ for root in ['target', 'corpus', 'results']:
189
+ files = glob.glob('{}/**/*.wav'.format(root), recursive=True)
190
+ for f in files:
191
+ fname = '{}.png'.format(f)
192
+ if not os.path.exists(fname):
193
+ rainbowgrams(
194
+ [f],
195
+ saveto=fname,
196
+ figsize=(20, 5),
197
+ rows=1,
198
+ cols=1)
199
+ plt.close('all')
environment.yml ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: tdnast
2
+ channels:
3
+ - defaults
4
+ dependencies:
5
+ - _libgcc_mutex=0.1=main
6
+ - _openmp_mutex=4.5=1_gnu
7
+ - _tflow_select=2.1.0=gpu
8
+ - absl-py=0.15.0=pyhd3eb1b0_0
9
+ - astor=0.8.1=py37h06a4308_0
10
+ - blas=1.0=mkl
11
+ - brotli=1.0.9=he6710b0_2
12
+ - c-ares=1.18.1=h7f8727e_0
13
+ - ca-certificates=2022.3.29=h06a4308_0
14
+ - cached-property=1.5.2=py_0
15
+ - certifi=2021.10.8=py37h06a4308_2
16
+ - cudatoolkit=10.0.130=0
17
+ - cudnn=7.6.5=cuda10.0_0
18
+ - cupti=10.0.130=0
19
+ - cycler=0.11.0=pyhd3eb1b0_0
20
+ - dbus=1.13.18=hb2f20db_0
21
+ - expat=2.4.4=h295c915_0
22
+ - fontconfig=2.13.1=h6c09931_0
23
+ - fonttools=4.25.0=pyhd3eb1b0_0
24
+ - freetype=2.11.0=h70c0345_0
25
+ - gast=0.2.2=py37_0
26
+ - giflib=5.2.1=h7b6447c_0
27
+ - glib=2.69.1=h4ff587b_1
28
+ - google-pasta=0.2.0=pyhd3eb1b0_0
29
+ - grpcio=1.42.0=py37hce63b2e_0
30
+ - gst-plugins-base=1.14.0=h8213a91_2
31
+ - gstreamer=1.14.0=h28cd5cc_2
32
+ - h5py=3.6.0=py37ha0f2276_0
33
+ - hdf5=1.10.6=hb1b8bf9_0
34
+ - icu=58.2=he6710b0_3
35
+ - importlib-metadata=4.11.3=py37h06a4308_0
36
+ - intel-openmp=2021.4.0=h06a4308_3561
37
+ - jpeg=9d=h7f8727e_0
38
+ - keras-applications=1.0.8=py_1
39
+ - keras-preprocessing=1.1.2=pyhd3eb1b0_0
40
+ - kiwisolver=1.3.2=py37h295c915_0
41
+ - lcms2=2.12=h3be6417_0
42
+ - ld_impl_linux-64=2.35.1=h7274673_9
43
+ - libffi=3.3=he6710b0_2
44
+ - libgcc-ng=9.3.0=h5101ec6_17
45
+ - libgfortran-ng=7.5.0=ha8ba4b0_17
46
+ - libgfortran4=7.5.0=ha8ba4b0_17
47
+ - libgomp=9.3.0=h5101ec6_17
48
+ - libpng=1.6.37=hbc83047_0
49
+ - libprotobuf=3.19.1=h4ff587b_0
50
+ - libstdcxx-ng=9.3.0=hd4cf53a_17
51
+ - libtiff=4.2.0=h85742a9_0
52
+ - libuuid=1.0.3=h7f8727e_2
53
+ - libwebp=1.2.2=h55f646e_0
54
+ - libwebp-base=1.2.2=h7f8727e_0
55
+ - libxcb=1.14=h7b6447c_0
56
+ - libxml2=2.9.12=h03d6c58_0
57
+ - lz4-c=1.9.3=h295c915_1
58
+ - markdown=3.3.4=py37h06a4308_0
59
+ - matplotlib=3.5.1=py37h06a4308_1
60
+ - matplotlib-base=3.5.1=py37ha18d171_1
61
+ - mkl=2021.4.0=h06a4308_640
62
+ - mkl-service=2.4.0=py37h7f8727e_0
63
+ - mkl_fft=1.3.1=py37hd3c417c_0
64
+ - mkl_random=1.2.2=py37h51133e4_0
65
+ - munkres=1.1.4=py_0
66
+ - ncurses=6.3=h7f8727e_2
67
+ - numpy=1.21.2=py37h20f2e39_0
68
+ - numpy-base=1.21.2=py37h79a1101_0
69
+ - openssl=1.1.1n=h7f8727e_0
70
+ - opt_einsum=3.3.0=pyhd3eb1b0_1
71
+ - packaging=21.3=pyhd3eb1b0_0
72
+ - pcre=8.45=h295c915_0
73
+ - pillow=9.0.1=py37h22f2fdc_0
74
+ - pip=21.2.2=py37h06a4308_0
75
+ - protobuf=3.19.1=py37h295c915_0
76
+ - pyparsing=3.0.4=pyhd3eb1b0_0
77
+ - pyqt=5.9.2=py37h05f1152_2
78
+ - python=3.7.13=h12debd9_0
79
+ - python-dateutil=2.8.2=pyhd3eb1b0_0
80
+ - qt=5.9.7=h5867ecd_1
81
+ - readline=8.1.2=h7f8727e_1
82
+ - scipy=1.7.3=py37hc147768_0
83
+ - setuptools=58.0.4=py37h06a4308_0
84
+ - sip=4.19.8=py37hf484d3e_0
85
+ - six=1.16.0=pyhd3eb1b0_1
86
+ - sqlite=3.38.2=hc218d9a_0
87
+ - tensorboard=1.15.0=pyhb230dea_0
88
+ - tensorflow=1.15.0=gpu_py37h0f0df58_0
89
+ - tensorflow-base=1.15.0=gpu_py37h9dcbed7_0
90
+ - tensorflow-estimator=1.15.1=pyh2649769_0
91
+ - tensorflow-gpu=1.15.0=h0d30ee6_0
92
+ - termcolor=1.1.0=py37h06a4308_1
93
+ - tk=8.6.11=h1ccaba5_0
94
+ - tornado=6.1=py37h27cfd23_0
95
+ - typing_extensions=4.1.1=pyh06a4308_0
96
+ - webencodings=0.5.1=py37_1
97
+ - werkzeug=0.16.1=py_0
98
+ - wheel=0.37.1=pyhd3eb1b0_0
99
+ - wrapt=1.13.3=py37h7f8727e_2
100
+ - xz=5.2.5=h7b6447c_0
101
+ - zipp=3.7.0=pyhd3eb1b0_0
102
+ - zlib=1.2.11=h7f8727e_4
103
+ - zstd=1.4.9=haebb681_0
104
+ - pip:
105
+ - audioread==2.1.9
106
+ - cffi==1.15.0
107
+ - decorator==5.1.1
108
+ - joblib==1.1.0
109
+ - librosa==0.7.2
110
+ - llvmlite==0.31.0
111
+ - numba==0.48.0
112
+ - pycparser==2.21
113
+ - resampy==0.2.2
114
+ - scikit-learn==1.0.2
115
+ - soundfile==0.10.3.post1
116
+ - threadpoolctl==3.1.0
117
+ prefix: /home/pkmital/anaconda3/envs/tdnast
nips_2017.sty ADDED
@@ -0,0 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % partial rewrite of the LaTeX2e package for submissions to the
2
+ % Conference on Neural Information Processing Systems (NIPS):
3
+ %
4
+ % - uses more LaTeX conventions
5
+ % - line numbers at submission time replaced with aligned numbers from
6
+ % lineno package
7
+ % - \nipsfinalcopy replaced with [final] package option
8
+ % - automatically loads times package for authors
9
+ % - loads natbib automatically; this can be suppressed with the
10
+ % [nonatbib] package option
11
+ % - adds foot line to first page identifying the conference
12
+ %
13
+ % Roman Garnett (garnett@wustl.edu) and the many authors of
14
+ % nips15submit_e.sty, including MK and drstrip@sandia
15
+ %
16
+ % last revision: March 2017
17
+
18
+ \NeedsTeXFormat{LaTeX2e}
19
+ \ProvidesPackage{nips_2017}[2017/03/20 NIPS 2017 submission/camera-ready style file]
20
+
21
+ % declare final option, which creates camera-ready copy
22
+ \newif\if@nipsfinal\@nipsfinalfalse
23
+ \DeclareOption{final}{
24
+ \@nipsfinaltrue
25
+ }
26
+
27
+ % declare nonatbib option, which does not load natbib in case of
28
+ % package clash (users can pass options to natbib via
29
+ % \PassOptionsToPackage)
30
+ \newif\if@natbib\@natbibtrue
31
+ \DeclareOption{nonatbib}{
32
+ \@natbibfalse
33
+ }
34
+
35
+ \ProcessOptions\relax
36
+
37
+ % fonts
38
+ \renewcommand{\rmdefault}{ptm}
39
+ \renewcommand{\sfdefault}{phv}
40
+
41
+ % change this every year for notice string at bottom
42
+ \newcommand{\@nipsordinal}{31st}
43
+ \newcommand{\@nipsyear}{2017}
44
+ \newcommand{\@nipslocation}{Long Beach, CA, USA}
45
+
46
+ % handle tweaks for camera-ready copy vs. submission copy
47
+ \if@nipsfinal
48
+ \newcommand{\@noticestring}{%
49
+ \@nipsordinal\/ Conference on Neural Information Processing Systems
50
+ (NIPS \@nipsyear), \@nipslocation.%
51
+ }
52
+ \else
53
+ \newcommand{\@noticestring}{%
54
+ Submitted to \@nipsordinal\/ Conference on Neural Information
55
+ Processing Systems (NIPS \@nipsyear). Do not distribute.%
56
+ }
57
+
58
+ % line numbers for submission
59
+ \RequirePackage{lineno}
60
+ \linenumbers
61
+
62
+ % fix incompatibilities between lineno and amsmath, if required, by
63
+ % transparently wrapping linenomath environments around amsmath
64
+ % environments
65
+ \AtBeginDocument{%
66
+ \@ifpackageloaded{amsmath}{%
67
+ \newcommand*\patchAmsMathEnvironmentForLineno[1]{%
68
+ \expandafter\let\csname old#1\expandafter\endcsname\csname #1\endcsname
69
+ \expandafter\let\csname oldend#1\expandafter\endcsname\csname end#1\endcsname
70
+ \renewenvironment{#1}%
71
+ {\linenomath\csname old#1\endcsname}%
72
+ {\csname oldend#1\endcsname\endlinenomath}%
73
+ }%
74
+ \newcommand*\patchBothAmsMathEnvironmentsForLineno[1]{%
75
+ \patchAmsMathEnvironmentForLineno{#1}%
76
+ \patchAmsMathEnvironmentForLineno{#1*}%
77
+ }%
78
+ \patchBothAmsMathEnvironmentsForLineno{equation}%
79
+ \patchBothAmsMathEnvironmentsForLineno{align}%
80
+ \patchBothAmsMathEnvironmentsForLineno{flalign}%
81
+ \patchBothAmsMathEnvironmentsForLineno{alignat}%
82
+ \patchBothAmsMathEnvironmentsForLineno{gather}%
83
+ \patchBothAmsMathEnvironmentsForLineno{multline}%
84
+ }{}
85
+ }
86
+ \fi
87
+
88
+ % load natbib unless told otherwise
89
+ \if@natbib
90
+ \RequirePackage{natbib}
91
+ \fi
92
+
93
+ % set page geometry
94
+ \usepackage[verbose=true,letterpaper]{geometry}
95
+ \AtBeginDocument{
96
+ \newgeometry{
97
+ textheight=9in,
98
+ textwidth=5.5in,
99
+ top=1in,
100
+ headheight=12pt,
101
+ headsep=25pt,
102
+ footskip=30pt
103
+ }
104
+ \@ifpackageloaded{fullpage}
105
+ {\PackageWarning{nips_2016}{fullpage package not allowed! Overwriting formatting.}}
106
+ {}
107
+ }
108
+
109
+ \widowpenalty=10000
110
+ \clubpenalty=10000
111
+ \flushbottom
112
+ \sloppy
113
+
114
+ % font sizes with reduced leading
115
+ \renewcommand{\normalsize}{%
116
+ \@setfontsize\normalsize\@xpt\@xipt
117
+ \abovedisplayskip 7\p@ \@plus 2\p@ \@minus 5\p@
118
+ \abovedisplayshortskip \z@ \@plus 3\p@
119
+ \belowdisplayskip \abovedisplayskip
120
+ \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
121
+ }
122
+ \normalsize
123
+ \renewcommand{\small}{%
124
+ \@setfontsize\small\@ixpt\@xpt
125
+ \abovedisplayskip 6\p@ \@plus 1.5\p@ \@minus 4\p@
126
+ \abovedisplayshortskip \z@ \@plus 2\p@
127
+ \belowdisplayskip \abovedisplayskip
128
+ \belowdisplayshortskip 3\p@ \@plus 2\p@ \@minus 2\p@
129
+ }
130
+ \renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
131
+ \renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
132
+ \renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
133
+ \renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
134
+ \renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
135
+ \renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
136
+ \renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
137
+ \renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
138
+
139
+ % sections with less space
140
+ \providecommand{\section}{}
141
+ \renewcommand{\section}{%
142
+ \@startsection{section}{1}{\z@}%
143
+ {-2.0ex \@plus -0.5ex \@minus -0.2ex}%
144
+ { 1.5ex \@plus 0.3ex \@minus 0.2ex}%
145
+ {\large\bf\raggedright}%
146
+ }
147
+ \providecommand{\subsection}{}
148
+ \renewcommand{\subsection}{%
149
+ \@startsection{subsection}{2}{\z@}%
150
+ {-1.8ex \@plus -0.5ex \@minus -0.2ex}%
151
+ { 0.8ex \@plus 0.2ex}%
152
+ {\normalsize\bf\raggedright}%
153
+ }
154
+ \providecommand{\subsubsection}{}
155
+ \renewcommand{\subsubsection}{%
156
+ \@startsection{subsubsection}{3}{\z@}%
157
+ {-1.5ex \@plus -0.5ex \@minus -0.2ex}%
158
+ { 0.5ex \@plus 0.2ex}%
159
+ {\normalsize\bf\raggedright}%
160
+ }
161
+ \providecommand{\paragraph}{}
162
+ \renewcommand{\paragraph}{%
163
+ \@startsection{paragraph}{4}{\z@}%
164
+ {1.5ex \@plus 0.5ex \@minus 0.2ex}%
165
+ {-1em}%
166
+ {\normalsize\bf}%
167
+ }
168
+ \providecommand{\subparagraph}{}
169
+ \renewcommand{\subparagraph}{%
170
+ \@startsection{subparagraph}{5}{\z@}%
171
+ {1.5ex \@plus 0.5ex \@minus 0.2ex}%
172
+ {-1em}%
173
+ {\normalsize\bf}%
174
+ }
175
+ \providecommand{\subsubsubsection}{}
176
+ \renewcommand{\subsubsubsection}{%
177
+ \vskip5pt{\noindent\normalsize\rm\raggedright}%
178
+ }
179
+
180
+ % float placement
181
+ \renewcommand{\topfraction }{0.85}
182
+ \renewcommand{\bottomfraction }{0.4}
183
+ \renewcommand{\textfraction }{0.1}
184
+ \renewcommand{\floatpagefraction}{0.7}
185
+
186
+ \newlength{\@nipsabovecaptionskip}\setlength{\@nipsabovecaptionskip}{7\p@}
187
+ \newlength{\@nipsbelowcaptionskip}\setlength{\@nipsbelowcaptionskip}{\z@}
188
+
189
+ \setlength{\abovecaptionskip}{\@nipsabovecaptionskip}
190
+ \setlength{\belowcaptionskip}{\@nipsbelowcaptionskip}
191
+
192
+ % swap above/belowcaptionskip lengths for tables
193
+ \renewenvironment{table}
194
+ {\setlength{\abovecaptionskip}{\@nipsbelowcaptionskip}%
195
+ \setlength{\belowcaptionskip}{\@nipsabovecaptionskip}%
196
+ \@float{table}}
197
+ {\end@float}
198
+
199
+ % footnote formatting
200
+ \setlength{\footnotesep }{6.65\p@}
201
+ \setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@}
202
+ \renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
203
+ \setcounter{footnote}{0}
204
+
205
+ % paragraph formatting
206
+ \setlength{\parindent}{\z@}
207
+ \setlength{\parskip }{5.5\p@}
208
+
209
+ % list formatting
210
+ \setlength{\topsep }{4\p@ \@plus 1\p@ \@minus 2\p@}
211
+ \setlength{\partopsep }{1\p@ \@plus 0.5\p@ \@minus 0.5\p@}
212
+ \setlength{\itemsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@}
213
+ \setlength{\parsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@}
214
+ \setlength{\leftmargin }{3pc}
215
+ \setlength{\leftmargini }{\leftmargin}
216
+ \setlength{\leftmarginii }{2em}
217
+ \setlength{\leftmarginiii}{1.5em}
218
+ \setlength{\leftmarginiv }{1.0em}
219
+ \setlength{\leftmarginv }{0.5em}
220
+ \def\@listi {\leftmargin\leftmargini}
221
+ \def\@listii {\leftmargin\leftmarginii
222
+ \labelwidth\leftmarginii
223
+ \advance\labelwidth-\labelsep
224
+ \topsep 2\p@ \@plus 1\p@ \@minus 0.5\p@
225
+ \parsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@
226
+ \itemsep \parsep}
227
+ \def\@listiii{\leftmargin\leftmarginiii
228
+ \labelwidth\leftmarginiii
229
+ \advance\labelwidth-\labelsep
230
+ \topsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@
231
+ \parsep \z@
232
+ \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
233
+ \itemsep \topsep}
234
+ \def\@listiv {\leftmargin\leftmarginiv
235
+ \labelwidth\leftmarginiv
236
+ \advance\labelwidth-\labelsep}
237
+ \def\@listv {\leftmargin\leftmarginv
238
+ \labelwidth\leftmarginv
239
+ \advance\labelwidth-\labelsep}
240
+ \def\@listvi {\leftmargin\leftmarginvi
241
+ \labelwidth\leftmarginvi
242
+ \advance\labelwidth-\labelsep}
243
+
244
+ % create title
245
+ \providecommand{\maketitle}{}
246
+ \renewcommand{\maketitle}{%
247
+ \par
248
+ \begingroup
249
+ \renewcommand{\thefootnote}{\fnsymbol{footnote}}
250
+ % for perfect author name centering
251
+ \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
252
+ % The footnote-mark was overlapping the footnote-text,
253
+ % added the following to fix this problem (MK)
254
+ \long\def\@makefntext##1{%
255
+ \parindent 1em\noindent
256
+ \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
257
+ }
258
+ \thispagestyle{empty}
259
+ \@maketitle
260
+ \@thanks
261
+ \@notice
262
+ \endgroup
263
+ \let\maketitle\relax
264
+ \let\thanks\relax
265
+ }
266
+
267
+ % rules for title box at top of first page
268
+ \newcommand{\@toptitlebar}{
269
+ \hrule height 4\p@
270
+ \vskip 0.25in
271
+ \vskip -\parskip%
272
+ }
273
+ \newcommand{\@bottomtitlebar}{
274
+ \vskip 0.29in
275
+ \vskip -\parskip
276
+ \hrule height 1\p@
277
+ \vskip 0.09in%
278
+ }
279
+
280
+ % create title (includes both anonymized and non-anonymized versions)
281
+ \providecommand{\@maketitle}{}
282
+ \renewcommand{\@maketitle}{%
283
+ \vbox{%
284
+ \hsize\textwidth
285
+ \linewidth\hsize
286
+ \vskip 0.1in
287
+ \@toptitlebar
288
+ \centering
289
+ {\LARGE\bf \@title\par}
290
+ \@bottomtitlebar
291
+ \if@nipsfinal
292
+ \def\And{%
293
+ \end{tabular}\hfil\linebreak[0]\hfil%
294
+ \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
295
+ }
296
+ \def\AND{%
297
+ \end{tabular}\hfil\linebreak[4]\hfil%
298
+ \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
299
+ }
300
+ \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
301
+ \else
302
+ \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}
303
+ Anonymous Author(s) \\
304
+ Affiliation \\
305
+ Address \\
306
+ \texttt{email} \\
307
+ \end{tabular}%
308
+ \fi
309
+ \vskip 0.3in \@minus 0.1in
310
+ }
311
+ }
312
+
313
+ % add conference notice to bottom of first page
314
+ \newcommand{\ftype@noticebox}{8}
315
+ \newcommand{\@notice}{%
316
+ % give a bit of extra room back to authors on first page
317
+ \enlargethispage{2\baselineskip}%
318
+ \@float{noticebox}[b]%
319
+ \footnotesize\@noticestring%
320
+ \end@float%
321
+ }
322
+
323
+ % abstract styling
324
+ \renewenvironment{abstract}%
325
+ {%
326
+ \vskip 0.075in%
327
+ \centerline%
328
+ {\large\bf Abstract}%
329
+ \vspace{0.5ex}%
330
+ \begin{quote}%
331
+ }
332
+ {
333
+ \par%
334
+ \end{quote}%
335
+ \vskip 1ex%
336
+ }
337
+
338
+ \endinput
339
+
paper.pdf ADDED
Binary file (322 kB). View file
 
paper.tex ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass{article}
2
+
3
+ % if you need to pass options to natbib, use, e.g.:
4
+ % \PassOptionsToPackage{numbers, compress}{natbib}
5
+ % before loading nips_2017
6
+ %
7
+ % to avoid loading the natbib package, add option nonatbib:
8
+ % \usepackage[nonatbib]{nips_2017}
9
+
10
+ %\usepackage{nips_2017}
11
+
12
+ % to compile a camera-ready version, add the [final] option, e.g.:
13
+ \usepackage[final,nonatbib]{nips_2017}
14
+
15
+ \usepackage[utf8]{inputenc} % allow utf-8 input
16
+ \usepackage[T1]{fontenc} % use 8-bit T1 fonts
17
+ \usepackage{hyperref} % hyperlinks
18
+ \usepackage{url} % simple URL typesetting
19
+ \usepackage{booktabs} % professional-quality tables
20
+ \usepackage{amsfonts} % blackboard math symbols
21
+ \usepackage{nicefrac} % compact symbols for 1/2, etc.
22
+ \usepackage{microtype} % microtypography
23
+ \usepackage{graphicx}
24
+ \usepackage{caption}
25
+ \usepackage{subcaption}
26
+
27
+ \title{Time Domain Neural Audio Style Transfer}
28
+
29
+ % The \author macro works with any number of authors. There are two
30
+ % commands used to separate the names and addresses of multiple
31
+ % authors: \And and \AND.
32
+ %
33
+ % Using \And between authors leaves it to LaTeX to determine where to
34
+ % break the lines. Using \AND forces a line break at that point. So,
35
+ % if LaTeX puts 3 of 4 authors names on the first line, and the last
36
+ % on the second line, try using \AND instead of \And before the third
37
+ % author name.
38
+
39
+ \author{
40
+ Parag K. Mital\\
41
+ Kadenze, Inc.\thanks{http://kadenze.com}\\
42
+ \texttt{parag@kadenze.com} \\
43
+ %% examples of more authors
44
+ %% \And
45
+ %% Coauthor \\
46
+ %% Affiliation \\
47
+ %% Address \\
48
+ %% \texttt{email} \\
49
+ %% \AND
50
+ %% Coauthor \\
51
+ %% Affiliation \\
52
+ %% Address \\
53
+ %% \texttt{email} \\
54
+ %% \And
55
+ %% Coauthor \\
56
+ %% Affiliation \\
57
+ %% Address \\
58
+ %% \texttt{email} \\
59
+ %% \And
60
+ %% Coauthor \\
61
+ %% Affiliation \\
62
+ %% Address \\
63
+ %% \texttt{email} \\
64
+ }
65
+
66
+ \begin{document}
67
+ % \nipsfinalcopy is no longer used
68
+
69
+ \maketitle
70
+
71
+ \begin{abstract}
72
+ A recently published method for audio style transfer has shown how to extend the process of image style transfer to audio. This method synthesizes audio "content" and "style" independently using the magnitudes of a short time Fourier transform, shallow convolutional networks with randomly initialized filters, and iterative phase reconstruction with Griffin-Lim. In this work, we explore whether it is possible to directly optimize a time domain audio signal, removing the process of phase reconstruction and opening up possibilities for real-time applications and higher quality syntheses. We explore a variety of style transfer processes on neural networks that operate directly on time domain audio signals and demonstrate one such network capable of audio stylization.
73
+ \end{abstract}
74
+
75
+ \section{Introduction}
76
+
77
+ % Style transfer \cite{} is a method for optimizing a randomly initialized image to have the appearance of the content and style of two separate images. It works by finding the raw activations of a so-called "content" image and optimizing a noise image to resemble the same activations while for "style", it looks at the kernel activations of any given layer and optimizes for these. The original work by Gatys et al. demonstrated this technique using activations from pre-trained VGG deep convolutional networks, though recent techniques in texture synthesis \cite{} show that similar results are possible with randomly initialized shallow convolutional networks.
78
+
79
+ Audio style transfer \cite{Ulyanov2016} attempts to extend the technique of image style transfer \cite{Gatys} to the domain of audio, allowing "content" and "style" to be independently manipulated. Ulyanov et al. demonstrates the process using the magnitudes of a short time Fourier transform representation of an audio signal as the input to a shallow untrained neural network, following similar work in image style transfer \cite{Ulyanov2016b}, storing the activations of the content and Gram activations of the style. A noisy input short time magnitude spectra is then optimized such that its activations through the same network resemble the target content and style magnitude's activations. The optimized magnitudes are then inverted back to an audio signal using an iterative Griffin-lim phase reconstruction process \cite{Griffin1984}.
80
+
81
+ Using phase reconstruction ultimately means the stylization process is not modeling the audio signal's fine temporal characteristics contained in its phase information. For instance, if a particular content or style audio source were to contain information about vibrato or the spatial movement or position of the audio source, this would likely be lost in a magnitude-only representation. Further, by relying on phase reconstruction, some error during the phase reconstruction is likely to happen, and developing real-time applications are also more difficult \cite{Wyse2017}, though not impossible \cite{Prusa2017}. In any case, any networks which discard phase information, such as \cite{Wyse2017}, which build on Ulyanov's approach, or recent audio networks such as \cite{Hershey2016} will still require phase reconstruction for stylization/synthesis applications.
82
+
83
+ Rather than approach stylization/synthesis via phase reconstruction, this work attempts to directly optimize a raw audio signal. Recent work in Neural Audio Synthesis has shown it is possible to take as input a raw audio signal and apply blending of musical notes in a neural embedding space on a trained WaveNet autoencoder \cite{Engel2017}. Though their work is capable of synthesizing raw audio from its embedding space, there is no separation of content and style using this approach, and thus they cannot be independently manipulated. However, to date, it is not clear whether this network's encoder or decoder could be used for audio stylization using the approach of Ulyanov/Gatys.
84
+
85
+ To understand better whether it is possible to perform audio stylization in the time domain, we investigate a variety of networks which take a time domain audio signal as input to their network: using the real and imaginary components of a Discrete Fourier Transform (DFT); using the magnitude and unwrapped phase differential components of a DFT; using combinations of real, imaginary, magnitude, and phase components; using the activations of a pre-trained WaveNet decoder \cite{Oord2016b,Engel2017}; and using the activations of a pre-trained NSynth encoder \cite{Engel2017}. We then apply audio stylization similarly to Ulyanov using a variety of parameters and report our results.
86
+
87
+ % \section{Related Work}
88
+
89
+ % There have been a few investigations of audio style transfer employing magnitude representations, such as Ulyanov's original work and a follow-up work employing VGG \cite{Wyse2017}. These models discard the phase information in favor of phase reconstruction. As well, there have been further developments in neural networks capable of large scale audio classification such as \cite{Hershey2016}, though these are trained on magnitude representations and would also require phase reconstruction as part of a stylization process. Perhaps most closely aligned is the work of NSynth \cite{Engel2017}, whose work is capable of taking as input a raw audio signal and allows for applications such as the blending of musical notes in a neural embedding space. Though their work is capable of synthesizing raw audio from its embedding space, there is no separation of content and style, and thus they cannot be independently manipulated.
90
+
91
+ % Speech synthesis techniques
92
+ %TacoTron demonstrated a technique using ...
93
+ %In a similar vein, WaveNet, ...
94
+ %NSynth incorporates a WaveNet decoder and includes an additional encoder, allowing one to encode a time domain audio signal using the encoding part of the network with 16 channels at 125x compression, and use these as biases during the WaveNet decoding. The embedding space is capable of linearly mixing instruments in its embedding space, though has yet to be explored as a network for audio stylization where content and style are indepenently manipulated.
95
+
96
+ % SampleRNN
97
+
98
+ % Soundnet
99
+
100
+ % VGG (Lonce Wyse, https://arxiv.org/pdf/1706.09559.pdf);
101
+
102
+ % Zdenek Pruska
103
+
104
+ % Other networks exploring audio include VGGish, built on the AudioSet dataset. This network, like Ulyanov's original implementation, however does not operate on the raw time domain signal and would require phase reconstruction. However, it does afford a potentially richer representation than a shallow convolutional network, as its embedding space was trained with the knowledge of many semantic classes of sounds.
105
+
106
+ % CycleGAN (https://gauthamzz.github.io/2017/09/23/AudioStyleTransfer/)
107
+
108
+
109
+ \section{Experiments}
110
+
111
+ We explore a variety of computational graphs which use as their first operation a discrete Fourier transform in order to project an audio signal into its real and imaginary components. We then explore manipulations on these components, including directly applying convolutional layers, or undergoing an additional transformation of the typical magnitude and phase components, as well as combinations of each these components. For representing phase, we also explored using the original phase, the phase differential, and the unwrapped phase differentials. From here, we apply the same techniques for stylization as described in \cite{Ulyanov2016}, except we no longer have to optimize a noisy magnitude input, and can instead optimize a time domain signal. We also explore combinations of using content/style layers following the initial projections and after fully connected layers.
112
+
113
+ We also explore two pre-trained networks: a pre-trained WaveNet decoder, and the encoder portion of an NSynth network as provided by Magenta \cite{Engel2017}, and look at the activations of each of these networks at different layers, much like the original image style networks did with VGG. We also include Ulyanov's original network as a baseline, and report our results as seen through spectrograms and through listening. Our code is also available online\footnote{https://github.com/pkmital/neural-audio-style-transfer}\footnote{Further details are described in the Supplementary Materials}.
114
+
115
+ \section{Results}
116
+
117
+ Only one network was capable of producing meaningful audio reconstruction through a stylization process where both the style and content appeared to be retained: including the real, imaginary, and magnitude information as concatenated features in height and using a kernel size 3 height convolutional filter. This process also includes a content layer which includes the concatenated features before any linear layer, and a style layer which is simply the magnitudes, and then uses a content and style layer following each nonlinearity. This network produces distinctly different stylizations to Ulyanov's original network, despite having similar parameters, often including quicker and busier temporal changes in content and style. The stylization also tends to produce what seems like higher fidelity syntheses, especially in lower frequencies, despite having the same sample rate. Lastly, this approach also tends to produce much less noise than Ulyanov's approach, most likely due to errors in the phase reconstruction/lack of phase representation.
118
+
119
+ Every other combination of input manipulations we tried tended towards a white noise signal and did not appear to drop in loss. The only other network that appeared to produce something recognizable, though with considerable noise was using the magnitude and unwrapped phase differential information with a kernel size 2 height convolutional filter. We could not manage to stylize any meaningful sounding synthesis using the activations in a WaveNet decoder or NSynth encoder.
120
+
121
+ % VGGish, AudioSet; VGG equivalent for audio, but uses a log-mel spectrogram.
122
+
123
+ \section{Discussion and Conclusion}
124
+
125
+ This work explores neural audio style transfer of a time domain audio signal. Of these networks, only two produced any meaningful results: the magnitude and unwrapped phase network, which produced distinctly noisier syntheses, and the real, imaginary, and magnitude network which was capable of resembling both the content and style sources in a similar quality to Ulyanov's original approach, though with interesting differences. It was especially surprising that we were unable to stylize with NSynth's encoder or decoder, though this is perhaps to due to the limited number of combinations of layers and possible activations we explored, and is worth exploring more in the future.
126
+
127
+ % Style transfer, like deep dream and its predecessor works in visualizing gradient activations, through exploration have the potential to enable us to understand representations created by neural networks. Through synthesis, and exploring the representations at each level of a neural network, we can start to gain insights into what sorts of representations if any are created by a network. However, to date, very few explorations of audio networks for the purpose of dreaming or stylization have been done.
128
+
129
+ %End to end learning, http://www.mirlab.org/conference_papers/International_Conference/ICASSP\%202014/papers/p7014-dieleman.pdf - spectrums still do better than raw audio.
130
+
131
+ \small
132
+ % Generated by IEEEtran.bst, version: 1.14 (2015/08/26)
133
+ \begin{thebibliography}{1}
134
+ \providecommand{\url}[1]{#1}
135
+ \csname url@samestyle\endcsname
136
+ \providecommand{\newblock}{\relax}
137
+ \providecommand{\bibinfo}[2]{#2}
138
+ \providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}
139
+ \providecommand{\BIBentryALTinterwordstretchfactor}{4}
140
+ \providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus
141
+ \BIBentryALTinterwordstretchfactor\fontdimen3\font minus
142
+ \fontdimen4\font\relax}
143
+ \providecommand{\BIBforeignlanguage}[2]{{%
144
+ \expandafter\ifx\csname l@#1\endcsname\relax
145
+ \typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%
146
+ \typeout{** loaded for the language `#1'. Using the pattern for}%
147
+ \typeout{** the default language instead.}%
148
+ \else
149
+ \language=\csname l@#1\endcsname
150
+ \fi
151
+ #2}}
152
+ \providecommand{\BIBdecl}{\relax}
153
+ \BIBdecl
154
+
155
+ \bibitem{Ulyanov2016}
156
+ D.~Ulyanov and V.~Lebedev, ``{Audio texture synthesis and style transfer},''
157
+ 2016.
158
+
159
+ \bibitem{Gatys}
160
+ L.~A. Gatys, A.~S. Ecker, M.~Bethge, and C.~V. Sep, ``{A Neural Algorithm of
161
+ Artistic Style},'' \emph{Arxiv}, p. 211839, 2015.
162
+
163
+ \bibitem{Ulyanov2016b}
164
+ \BIBentryALTinterwordspacing
165
+ D.~Ulyanov, V.~Lebedev, A.~Vedaldi, and V.~Lempitsky, ``{Texture Networks:
166
+ Feed-forward Synthesis of Textures and Stylized Images},'' 2016. [Online].
167
+ Available: \url{http://arxiv.org/abs/1603.03417}
168
+ \BIBentrySTDinterwordspacing
169
+
170
+ \bibitem{Griffin1984}
171
+ D.~W. Griffin and J.~S. Lim, ``{Signal Estimation from Modified Short-Time
172
+ Fourier Transform},'' \emph{IEEE Transactions on Acoustics, Speech, and
173
+ Signal Processing}, vol.~32, no.~2, pp. 236--243, 1984.
174
+
175
+ \bibitem{Wyse2017}
176
+ \BIBentryALTinterwordspacing
177
+ L.~Wyse, ``{Audio Spectrogram Representations for Processing with Convolutional
178
+ Neural Networks},'' in \emph{Proceedings of the First International Workshop
179
+ on Deep Learning and Music joint with IJCNN}, vol.~1, no.~1, 2017, pp.
180
+ 37--41. [Online]. Available: \url{http://arxiv.org/abs/1706.09559}
181
+ \BIBentrySTDinterwordspacing
182
+
183
+ \bibitem{Prusa2017}
184
+ Z.~Prů{\v{s}}a and P.~Rajmic, ``{Toward High-Quality Real-Time Signal
185
+ Reconstruction from STFT Magnitude},'' \emph{IEEE Signal Processing Letters},
186
+ vol.~24, no.~6, pp. 892--896, 2017.
187
+
188
+ \bibitem{Hershey2016}
189
+ \BIBentryALTinterwordspacing
190
+ S.~Hershey, S.~Chaudhuri, D.~P.~W. Ellis, J.~F. Gemmeke, A.~Jansen, C.~Moore,
191
+ M.~Plakal, D.~Platt, R.~A. Saurous, B.~Seybold, M.~Slaney, R.~J. Weiss,
192
+ K.~Wilson, R.~C. Moore, M.~Plakal, D.~Platt, R.~A. Saurous, B.~Seybold,
193
+ M.~Slaney, R.~J. Weiss, and K.~Wilson, ``{CNN Architectures for Large-Scale
194
+ Audio Classification},'' \emph{International Conference on Acoustics, Speech
195
+ and Signal Processing (ICASSP)}, pp. 4--8, 2016. [Online]. Available:
196
+ \url{http://arxiv.org/abs/1609.09430}
197
+ \BIBentrySTDinterwordspacing
198
+
199
+ \bibitem{Engel2017}
200
+ \BIBentryALTinterwordspacing
201
+ J.~Engel, C.~Resnick, A.~Roberts, S.~Dieleman, D.~Eck, K.~Simonyan, and
202
+ M.~Norouzi, ``{Neural Audio Synthesis of Musical Notes with WaveNet
203
+ Autoencoders},'' in \emph{Proceedings of the 34th International Conference on
204
+ Machine Learning}, 2017. [Online]. Available:
205
+ \url{http://arxiv.org/abs/1704.01279}
206
+ \BIBentrySTDinterwordspacing
207
+
208
+ \bibitem{Oord2016b}
209
+ \BIBentryALTinterwordspacing
210
+ A.~van~den Oord, S.~Dieleman, H.~Zen, K.~Simonyan, O.~Vinyals, A.~Graves,
211
+ N.~Kalchbrenner, A.~Senior, and K.~Kavukcuoglu, ``{WaveNet: A Generative
212
+ Model for Raw Audio},'' \emph{arxiv}, pp. 1--15, 2016. [Online]. Available:
213
+ \url{http://arxiv.org/abs/1609.03499}
214
+ \BIBentrySTDinterwordspacing
215
+
216
+ \end{thebibliography}
217
+
218
+ \begin{figure}
219
+ \centering
220
+ \includegraphics[width=1\linewidth]{synthesis}
221
+ \caption{Example synthesis optimizing audio directly with both the source content and style audible.}
222
+ \end{figure}
223
+ \end{document}
224
+
search.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """NIPS2017 "Time Domain Neural Audio Style Transfer" code repository
2
+ Parag K. Mital
3
+ """
4
+ import os
5
+ import glob
6
+ import numpy as np
7
+ from audio_style_transfer.models import timedomain, uylanov
8
+
9
+
10
+ def get_path(model, output_path, content_filename, style_filename):
11
+ output_dir = os.path.join(output_path, model)
12
+ if not os.path.exists(output_dir):
13
+ os.makedirs(output_dir)
14
+ output_filename = '{}/{}/{}+{}'.format(output_path, model,
15
+ content_filename.split('/')[-1],
16
+ style_filename.split('/')[-1])
17
+ return output_filename
18
+
19
+
20
+ def params():
21
+ n_fft = [2048, 4096, 8196]
22
+ n_layers = [1, 2, 4]
23
+ n_filters = [128, 2048, 4096]
24
+ hop_length = [128, 256, 512]
25
+ alpha = [0.1, 0.01, 0.005]
26
+ k_w = [4, 8, 12]
27
+ norm = [True, False]
28
+ input_features = [['mags'], ['mags', 'phase'], ['real', 'imag'], ['real', 'imag', 'mags']]
29
+ return locals()
30
+
31
+
32
+ def batch(content_path, style_path, output_path, run_timedomain=True, run_uylanov=False):
33
+ content_files = glob.glob('{}/*.wav'.format(content_path))
34
+ style_files = glob.glob('{}/*.wav'.format(style_path))
35
+ content_filename = np.random.choice(content_files)
36
+ style_filename = np.random.choice(style_files)
37
+ alpha = np.random.choice(params()['alpha'])
38
+ n_fft = np.random.choice(params()['n_fft'])
39
+ n_layers = np.random.choice(params()['n_layers'])
40
+ n_filters = np.random.choice(params()['n_filters'])
41
+ hop_length = np.random.choice(params()['hop_length'])
42
+ norm = np.random.choice(params()['norm'])
43
+ k_w = np.random.choice(params()['k_w'])
44
+
45
+ # Run the Time Domain Model
46
+ if run_timedomain:
47
+ for f in params()['input_features']:
48
+ fname = get_path('timedomain/input_features={}'.format(",".join(f)),
49
+ output_path, content_filename, style_filename)
50
+ output_filename = ('{},n_fft={},n_layers={},n_filters={},norm={},'
51
+ 'hop_length={},alpha={},k_w={}.wav'.format(
52
+ fname, n_fft, n_layers, n_filters, norm,
53
+ hop_length, alpha, k_w))
54
+ print(output_filename)
55
+ if not os.path.exists(output_filename):
56
+ timedomain.run(content_fname=content_filename,
57
+ style_fname=style_filename,
58
+ output_fname=output_filename,
59
+ n_fft=n_fft,
60
+ n_layers=n_layers,
61
+ n_filters=n_filters,
62
+ hop_length=hop_length,
63
+ alpha=alpha,
64
+ norm=norm,
65
+ k_w=k_w)
66
+
67
+ if run_uylanov:
68
+ # Run Original Uylanov Model
69
+ fname = get_path('uylanov', output_path, content_filename, style_filename)
70
+ output_filename = ('{},n_fft={},n_layers={},n_filters={},'
71
+ 'hop_length={},alpha={},k_w={}.wav'.format(
72
+ fname, n_fft, n_layers, n_filters, hop_length, alpha,
73
+ k_w))
74
+ print(output_filename)
75
+ if not os.path.exists(output_filename):
76
+ uylanov.run(content_filename,
77
+ style_filename,
78
+ output_filename,
79
+ n_fft=n_fft,
80
+ n_layers=n_layers,
81
+ n_filters=n_filters,
82
+ hop_length=hop_length,
83
+ alpha=alpha,
84
+ k_w=k_w)
85
+
86
+ # These only produce noise so they are commented
87
+ # # Run NSynth Encoder Model
88
+ # output_filename = get_path('nsynth-encoder', output_path, content_filename,
89
+ # style_filename)
90
+ # output_filename = ('{},n_fft={},n_layers={},n_filters={},'
91
+ # 'hop_length={},alpha={},k_w={}.wav'.format(
92
+ # fname, n_fft, n_layers, n_filters, hop_length, alpha, k_w))
93
+ # if not os.path.exists(output_filename):
94
+ # nsynth.run(content_filename,
95
+ # style_filename,
96
+ # output_filename,
97
+ # model='encoder',
98
+ # n_fft=n_fft,
99
+ # n_layers=n_layers,
100
+ # n_filters=n_filters,
101
+ # hop_length=hop_length,
102
+ # alpha=alpha,
103
+ # k_w=k_w)
104
+ # # Run NSynth Decoder Model
105
+ # output_filename = get_path('wavenet-decoder', output_path, content_filename,
106
+ # style_filename)
107
+ # output_filename = ('{},n_fft={},n_layers={},n_filters={},'
108
+ # 'hop_length={},alpha={},k_w={}.wav'.format(
109
+ # fname, n_fft, n_layers, n_filters, hop_length, alpha, k_w))
110
+ # if not os.path.exists(output_filename):
111
+ # nsynth.run(content_filename,
112
+ # style_filename,
113
+ # output_filename,
114
+ # model='decoder',
115
+ # n_fft=n_fft,
116
+ # n_layers=n_layers,
117
+ # n_filters=n_filters,
118
+ # hop_length=hop_length,
119
+ # alpha=alpha,
120
+ # k_w=k_w)
121
+
122
+
123
+ if __name__ == '__main__':
124
+ content_path = './target'
125
+ style_path = './corpus'
126
+ output_path = './results'
127
+ batch(content_path, style_path, output_path)
setup.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # -*- coding: utf-8 -*-
3
+
4
+ # Note: To use the 'upload' functionality of this file, you must:
5
+ # $ pip install twine
6
+
7
+ import io
8
+ import os
9
+ import sys
10
+ from shutil import rmtree
11
+
12
+ from setuptools import find_packages, setup, Command
13
+
14
+ # Package meta-data.
15
+ NAME = 'audio_style_transfer'
16
+ DESCRIPTION = 'Exploring Audio Style Transfer'
17
+ URL = 'https://github.com/pkmital/time-domain-neural-audio-style-transfer'
18
+ EMAIL = 'parag@pkmital.com'
19
+ AUTHOR = 'Parag Mital'
20
+
21
+ # What packages are required for this module to be executed?
22
+ REQUIRED = [
23
+ # 'tensorflow-gpu<2.0.0', 'librosa<0.8.0',
24
+ # 'magenta'
25
+ ]
26
+
27
+ # The rest you shouldn't have to touch too much :)
28
+ # ------------------------------------------------
29
+ # Except, perhaps the License and Trove Classifiers!
30
+ # If you do change the License, remember to change the Trove Classifier for that!
31
+
32
+ here = os.path.abspath(os.path.dirname(__file__))
33
+
34
+ # Import the README and use it as the long-description.
35
+ # Note: this will only work if 'README.rst' is present in your MANIFEST.in file!
36
+ with io.open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
37
+ long_description = '\n' + f.read()
38
+
39
+ # Load the package's __version__.py module as a dictionary.
40
+ about = {}
41
+ with open(os.path.join(here, NAME, '__version__.py')) as f:
42
+ exec(f.read(), about)
43
+
44
+
45
+ class UploadCommand(Command):
46
+ """Support setup.py upload."""
47
+
48
+ description = 'Build and publish the package.'
49
+ user_options = []
50
+
51
+ @staticmethod
52
+ def status(s):
53
+ """Prints things in bold."""
54
+ print('\033[1m{0}\033[0m'.format(s))
55
+
56
+ def initialize_options(self):
57
+ pass
58
+
59
+ def finalize_options(self):
60
+ pass
61
+
62
+ def run(self):
63
+ try:
64
+ self.status('Removing previous builds…')
65
+ rmtree(os.path.join(here, 'dist'))
66
+ except OSError:
67
+ pass
68
+
69
+ self.status('Building Source and Wheel (universal) distribution…')
70
+ os.system('{0} setup.py sdist bdist_wheel --universal'.format(sys.executable))
71
+
72
+ self.status('Uploading the package to PyPi via Twine…')
73
+ os.system('twine upload dist/*')
74
+
75
+ sys.exit()
76
+
77
+
78
+ # Where the magic happens:
79
+ setup(
80
+ name=NAME,
81
+ version=about['__version__'],
82
+ description=DESCRIPTION,
83
+ long_description=long_description,
84
+ author=AUTHOR,
85
+ author_email=EMAIL,
86
+ url=URL,
87
+ packages=find_packages(exclude=('tests',)),
88
+ # If your package is a single module, use this instead of 'packages':
89
+ # py_modules=['mypackage'],
90
+
91
+ # entry_points={
92
+ # 'console_scripts': ['mycli=mymodule:cli'],
93
+ # },
94
+ install_requires=REQUIRED,
95
+ include_package_data=True,
96
+ license='MIT',
97
+ classifiers=[
98
+ # Trove classifiers
99
+ # Full list: https://pypi.python.org/pypi?%3Aaction=list_classifiers
100
+ 'License :: OSI Approved :: MIT License',
101
+ 'Programming Language :: Python',
102
+ 'Programming Language :: Python :: 2.6',
103
+ 'Programming Language :: Python :: 2.7',
104
+ 'Programming Language :: Python :: 3',
105
+ 'Programming Language :: Python :: 3.3',
106
+ 'Programming Language :: Python :: 3.4',
107
+ 'Programming Language :: Python :: 3.5',
108
+ 'Programming Language :: Python :: 3.6',
109
+ 'Programming Language :: Python :: Implementation :: CPython',
110
+ 'Programming Language :: Python :: Implementation :: PyPy'
111
+ ],
112
+ # $ setup.py publish support.
113
+ cmdclass={
114
+ 'upload': UploadCommand,
115
+ },
116
+ )
style-transfer.bib ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sc{Ulyanov2016,
2
+ author = {Ulyanov, Dmitry and Lebedev, Vadim},
3
+ title = {{Audio texture synthesis and style transfer}},
4
+ urldate = {November 3, 2017},
5
+ year = {2016}
6
+ }
7
+ @inproceedings{Wyse2017,
8
+ abstract = {One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than it seems to be for visual images, and a variety of representations have been used for different applications including the raw digitized sample stream, hand-crafted features, machine discovered features, MFCCs and variants that include deltas, and a variety of spectral representations. This paper reviews some of these representations and issues that arise, focusing particularly on spectrograms for generating audio using neural networks for style transfer.},
9
+ archivePrefix = {arXiv},
10
+ arxivId = {1706.09559},
11
+ author = {Wyse, L.},
12
+ booktitle = {Proceedings of the First International Workshop on Deep Learning and Music joint with IJCNN},
13
+ eprint = {1706.09559},
14
+ file = {:Users/pkmital/Documents/PDFs/Wyse/Wyse - 2017 - Audio Spectrogram Representations for Processing with Convolutional Neural Networks.pdf:pdf},
15
+ keywords = {data representation,sound synthesis,spectrograms,style transfer},
16
+ number = {1},
17
+ pages = {37--41},
18
+ title = {{Audio Spectrogram Representations for Processing with Convolutional Neural Networks}},
19
+ url = {http://arxiv.org/abs/1706.09559},
20
+ volume = {1},
21
+ year = {2017}
22
+ }
23
+ @article{Ustyuzhaninov2016,
24
+ abstract = {Here we demonstrate that the feature space of random shallow convolutional neural networks (CNNs) can serve as a surprisingly good model of natural textures. Patches from the same texture are consistently classified as being more similar then patches from different textures. Samples synthesized from the model capture spatial correlations on scales much larger then the receptive field size, and sometimes even rival or surpass the perceptual quality of state of the art texture models (but show less variability). The current state of the art in parametric texture synthesis relies on the multi-layer feature space of deep CNNs that were trained on natural images. Our finding suggests that such optimized multi-layer feature spaces are not imperative for texture modeling. Instead, much simpler shallow and convolutional networks can serve as the basis for novel texture synthesis algorithms.},
25
+ archivePrefix = {arXiv},
26
+ arxivId = {1606.00021},
27
+ author = {Ustyuzhaninov, Ivan and Brendel, Wieland and Gatys, Leon A. and Bethge, Matthias},
28
+ eprint = {1606.00021},
29
+ file = {:Users/pkmital/Documents/PDFs/Ustyuzhaninov et al/Ustyuzhaninov et al. - 2016 - Texture Synthesis Using Shallow Convolutional Networks with Random Filters.pdf:pdf},
30
+ journal = {Arxiv},
31
+ pages = {1--9},
32
+ title = {{Texture Synthesis Using Shallow Convolutional Networks with Random Filters}},
33
+ url = {http://arxiv.org/abs/1606.00021},
34
+ year = {2016}
35
+ }
36
+ @article{Gatys,
37
+ archivePrefix = {arXiv},
38
+ arxivId = {arXiv:1508.06576v2},
39
+ author = {Gatys, Leon A and Ecker, Alexander S and Bethge, Matthias and Sep, C V},
40
+ eprint = {arXiv:1508.06576v2},
41
+ file = {:Users/pkmital/Documents/PDFs/Gatys et al/Gatys et al. - 2015 - A Neural Algorithm of Artistic Style.pdf:pdf},
42
+ journal = {Arxiv},
43
+ pages = {211839},
44
+ title = {{A Neural Algorithm of Artistic Style}},
45
+ year = {2015}
46
+ }
47
+ @article{Prusa2017,
48
+ author = {Prů{\v{s}}a, Zden{\v{e}}k and Rajmic, Pavel},
49
+ doi = {10.1109/LSP.2017.2696970},
50
+ file = {:Users/pkmital/Documents/PDFs/Prů{\v{s}}a, Rajmic/Prů{\v{s}}a, Rajmic - 2017 - Toward High-Quality Real-Time Signal Reconstruction from STFT Magnitude.pdf:pdf},
51
+ issn = {10709908},
52
+ journal = {IEEE Signal Processing Letters},
53
+ keywords = {Phase reconstruction,real-time,short-time Fourier transform (STFT),spectrogram,time-frequency},
54
+ mendeley-groups = {nips-2017-audio-style},
55
+ number = {6},
56
+ pages = {892--896},
57
+ title = {{Toward High-Quality Real-Time Signal Reconstruction from STFT Magnitude}},
58
+ volume = {24},
59
+ year = {2017}
60
+ }
61
+ @article{Griffin1984,
62
+ author = {Griffin, Daniel W and Lim, Jae S},
63
+ file = {:Users/pkmital/Documents/PDFs/Griffin, Lim/Griffin, Lim - 1984 - Signal Estimation from Modified Short-Time Fourier Transform.pdf:pdf},
64
+ journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
65
+ mendeley-groups = {nips-2017-audio-style},
66
+ number = {2},
67
+ pages = {236--243},
68
+ title = {{Signal Estimation from Modified Short-Time Fourier Transform}},
69
+ volume = {32},
70
+ year = {1984}
71
+ }
72
+ @inproceedings{Engel2017,
73
+ abstract = {Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.},
74
+ archivePrefix = {arXiv},
75
+ arxivId = {1704.01279},
76
+ author = {Engel, Jesse and Resnick, Cinjon and Roberts, Adam and Dieleman, Sander and Eck, Douglas and Simonyan, Karen and Norouzi, Mohammad},
77
+ booktitle = {Proceedings of the 34th International Conference on Machine Learning},
78
+ eprint = {1704.01279},
79
+ file = {:Users/pkmital/Documents/PDFs/Engel et al/Engel et al. - 2017 - Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders(2).pdf:pdf},
80
+ mendeley-groups = {nips-2017-audio-style},
81
+ title = {{Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders}},
82
+ url = {http://arxiv.org/abs/1704.01279},
83
+ year = {2017}
84
+ }
85
+ @article{Oord2016b,
86
+ abstract = {This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.},
87
+ archivePrefix = {arXiv},
88
+ arxivId = {1609.03499},
89
+ author = {van den Oord, Aaron and Dieleman, Sander and Zen, Heiga and Simonyan, Karen and Vinyals, Oriol and Graves, Alex and Kalchbrenner, Nal and Senior, Andrew and Kavukcuoglu, Koray},
90
+ eprint = {1609.03499},
91
+ file = {:Users/pkmital/Documents/PDFs/Oord et al/Oord et al. - 2016 - WaveNet A Generative Model for Raw Audio.pdf:pdf},
92
+ journal = {arxiv},
93
+ mendeley-groups = {Neural Audio},
94
+ pages = {1--15},
95
+ title = {{WaveNet: A Generative Model for Raw Audio}},
96
+ url = {http://arxiv.org/abs/1609.03499},
97
+ year = {2016}
98
+ }
99
+ @article{Hershey2016,
100
+ abstract = {Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.},
101
+ archivePrefix = {arXiv},
102
+ arxivId = {1609.09430},
103
+ author = {Hershey, Shawn and Chaudhuri, Sourish and Ellis, Daniel P. W. and Gemmeke, Jort F. and Jansen, Aren and Moore, Channing and Plakal, Manoj and Platt, Devin and Saurous, Rif A. and Seybold, Bryan and Slaney, Malcolm and Weiss, Ron J. and Wilson, Kevin and Moore, R. Channing and Plakal, Manoj and Platt, Devin and Saurous, Rif A. and Seybold, Bryan and Slaney, Malcolm and Weiss, Ron J. and Wilson, Kevin},
104
+ eprint = {1609.09430},
105
+ file = {:Users/pkmital/Documents/PDFs/Hershey et al/Hershey et al. - 2016 - CNN Architectures for Large-Scale Audio Classification.pdf:pdf},
106
+ isbn = {9781509041176},
107
+ journal = {International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
108
+ mendeley-groups = {Embodied Cognition,nips-2017-audio-style},
109
+ pages = {4--8},
110
+ title = {{CNN Architectures for Large-Scale Audio Classification}},
111
+ url = {http://arxiv.org/abs/1609.09430},
112
+ year = {2016}
113
+ }
114
+ @article{Ulyanov2016b,
115
+ abstract = {Gatys et al. recently demonstrated that deep networks can generate beautiful textures and stylized images from a single texture example. However, their methods requires a slow and memory-consuming optimization process. We propose here an alternative approach that moves the computational burden to a learning stage. Given a single example of a texture, our approach trains compact feed-forward convolutional networks to generate multiple samples of the same texture of arbitrary size and to transfer artistic style from a given image to any other image. The resulting networks are remarkably light-weight and can generate textures of quality comparable to Gatys{\~{}}et{\~{}}al., but hundreds of times faster. More generally, our approach highlights the power and flexibility of generative feed-forward models trained with complex and expressive loss functions.},
116
+ archivePrefix = {arXiv},
117
+ arxivId = {1603.03417},
118
+ author = {Ulyanov, Dmitry and Lebedev, Vadim and Vedaldi, Andrea and Lempitsky, Victor},
119
+ eprint = {1603.03417},
120
+ file = {:Users/pkmital/Documents/PDFs/Ulyanov et al/Ulyanov et al. - 2016 - Texture Networks Feed-forward Synthesis of Textures and Stylized Images.pdf:pdf},
121
+ isbn = {9781510829008},
122
+ issn = {1938-7228},
123
+ mendeley-groups = {nips-2017-audio-style},
124
+ title = {{Texture Networks: Feed-forward Synthesis of Textures and Stylized Images}},
125
+ url = {http://arxiv.org/abs/1603.03417},
126
+ year = {2016}
127
+ }
128
+