File size: 5,463 Bytes
5238467
 
 
 
 
 
5325fcc
 
5238467
5325fcc
5238467
5325fcc
5238467
 
 
 
 
 
 
 
5325fcc
5238467
5325fcc
5238467
 
 
 
 
 
 
 
 
 
 
5325fcc
5238467
 
5325fcc
5238467
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5325fcc
 
 
5238467
 
 
 
 
 
5325fcc
5238467
 
 
 
 
 
 
 
 
 
 
 
 
 
5325fcc
5238467
 
 
 
 
 
5325fcc
5238467
 
 
 
 
 
 
 
5325fcc
5238467
 
5325fcc
 
5238467
5325fcc
5238467
 
 
 
 
 
 
 
 
5325fcc
5238467
 
 
5325fcc
5238467
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5325fcc
 
5238467
 
 
5325fcc
5238467
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AudioGen\n",
    "Welcome to AudioGen's demo jupyter notebook. Here you will find a series of self-contained examples of how to use AudioGen in different settings.\n",
    "\n",
    "First, we start by initializing AudioGen. For now, we provide only a medium sized model for AudioGen: `facebook/audiogen-medium` - 1.5B transformer decoder. \n",
    "\n",
    "**Important note:** This variant is different from the original AudioGen model presented at [\"AudioGen: Textually-guided audio generation\"](https://arxiv.org/abs/2209.15352) as the model architecture is similar to MusicGen with a smaller frame rate and multiple streams of tokens, allowing to reduce generation time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.models import AudioGen\n",
    "\n",
    "model = AudioGen.get_pretrained('facebook/audiogen-medium')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let us configure the generation parameters. Specifically, you can control the following:\n",
    "* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.\n",
    "* `top_k` (int, optional): top_k used for sampling. Defaults to 250.\n",
    "* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.0.\n",
    "* `temperature` (float, optional): softmax temperature parameter. Defaults to 1.0.\n",
    "* `duration` (float, optional): duration of the generated waveform. Defaults to 10.0.\n",
    "* `cfg_coef` (float, optional): coefficient used for classifier free guidance. Defaults to 3.0.\n",
    "\n",
    "When left unchanged, AudioGen will revert to its default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_generation_params(\n",
    "    use_sampling=True,\n",
    "    top_k=250,\n",
    "    duration=5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can go ahead and start generating sound using one of the following modes:\n",
    "* Audio continuation using `model.generate_continuation`\n",
    "* Text-conditional samples using `model.generate`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Audio Continuation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "import torchaudio\n",
    "import torch\n",
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "def get_bip_bip(bip_duration=0.125, frequency=440,\n",
    "                duration=0.5, sample_rate=16000, device=\"cuda\"):\n",
    "    \"\"\"Generates a series of bip bip at the given frequency.\"\"\"\n",
    "    t = torch.arange(\n",
    "        int(duration * sample_rate), device=\"cuda\", dtype=torch.float) / sample_rate\n",
    "    wav = torch.cos(2 * math.pi * 440 * t)[None]\n",
    "    tp = (t % (2 * bip_duration)) / (2 * bip_duration)\n",
    "    envelope = (tp >= 0.5).float()\n",
    "    return wav * envelope"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here we use a synthetic signal to prompt the generated audio.\n",
    "res = model.generate_continuation(\n",
    "    get_bip_bip(0.125).expand(2, -1, -1), \n",
    "    16000, ['Whistling with wind blowing', \n",
    "            'Typing on a typewriter'], \n",
    "    progress=True)\n",
    "display_audio(res, 16000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can also use any audio from a file. Make sure to trim the file if it is too long!\n",
    "prompt_waveform, prompt_sr = torchaudio.load(\"../assets/sirens_and_a_humming_engine_approach_and_pass.mp3\")\n",
    "prompt_duration = 2\n",
    "prompt_waveform = prompt_waveform[..., :int(prompt_duration * prompt_sr)]\n",
    "output = model.generate_continuation(prompt_waveform, prompt_sample_rate=prompt_sr, progress=True)\n",
    "display_audio(output, sample_rate=16000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-conditional Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "output = model.generate(\n",
    "    descriptions=[\n",
    "        'Subway train blowing its horn',\n",
    "        'A cat meowing',\n",
    "    ],\n",
    "    progress=True\n",
    ")\n",
    "display_audio(output, sample_rate=16000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}