Spaces:

facebook
/

MelodyFlow

Running on Zero

File size: 7,591 Bytes

9d0d223

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MAGNeT\n",
    "Welcome to MAGNeT's demo jupyter notebook. \n",
    "Here you will find a self-contained example of how to use MAGNeT for music/sound-effect generation.\n",
    "\n",
    "First, we start by initializing MAGNeT for music generation, you can choose a model from the following selection:\n",
    "1. facebook/magnet-small-10secs - a 300M non-autoregressive transformer capable of generating 10-second music conditioned on text.\n",
    "2. facebook/magnet-medium-10secs - 1.5B parameters, 10 seconds music samples.\n",
    "3. facebook/magnet-small-30secs - 300M parameters, 30 seconds music samples.\n",
    "4. facebook/magnet-medium-30secs - 1.5B parameters, 30 seconds music samples.\n",
    "\n",
    "We will use the `facebook/magnet-small-10secs` variant for the purpose of this demonstration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.models import MAGNeT\n",
    "\n",
    "model = MAGNeT.get_pretrained('facebook/magnet-small-10secs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let us configure the generation parameters. Specifically, you can control the following:\n",
    "* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.\n",
    "* `top_k` (int, optional): top_k used for sampling. Defaults to 0.\n",
    "* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.9.\n",
    "* `temperature` (float, optional): Initial softmax temperature parameter. Defaults to 3.0.\n",
    "* `max_clsfg_coef` (float, optional): Initial coefficient used for classifier free guidance. Defaults to 10.0.\n",
    "* `min_clsfg_coef` (float, optional): Final coefficient used for classifier free guidance. Defaults to 1.0.\n",
    "* `decoding_steps` (list of n_q ints, optional): The number of iterative decoding steps, for each of the n_q RVQ codebooks.\n",
    "* `span_arrangement` (str, optional): Use either non-overlapping spans ('nonoverlap') or overlapping spans ('stride1') \n",
    "                                      in the masking scheme. \n",
    "\n",
    "When left unchanged, MAGNeT will revert to its default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_generation_params(\n",
    "    use_sampling=True,\n",
    "    top_k=0,\n",
    "    top_p=0.9,\n",
    "    temperature=3.0,\n",
    "    max_cfg_coef=10.0,\n",
    "    min_cfg_coef=1.0,\n",
    "    decoding_steps=[int(20 * model.lm.cfg.dataset.segment_duration // 10),  10, 10, 10],\n",
    "    span_arrangement='stride1'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can go ahead and start generating music given textual prompts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-conditional Generation - Music"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "###### Text-to-music prompts - examples ######\n",
    "text = \"80s electronic track with melodic synthesizers, catchy beat and groovy bass\"\n",
    "# text = \"80s electronic track with melodic synthesizers, catchy beat and groovy bass. 170 bpm\"\n",
    "# text = \"Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves\"\n",
    "# text = \"Funky groove with electric piano playing blue chords rhythmically\"\n",
    "# text = \"Rock with saturated guitars, a heavy bass line and crazy drum break and fills.\"\n",
    "# text = \"A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, creating a cinematic atmosphere fit for a heroic battle\"\n",
    "                   \n",
    "N_VARIATIONS = 3\n",
    "descriptions = [text for _ in range(N_VARIATIONS)]\n",
    "\n",
    "print(f\"text prompt: {text}\\n\")\n",
    "output = model.generate(descriptions=descriptions, progress=True, return_tokens=True)\n",
    "display_audio(output[0], sample_rate=model.compression_model.sample_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-conditional Generation - Sound Effects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides music, MAGNeT models can generate sound effects given textual prompts. \n",
    "First, let's load an Audio-MAGNeT model, out of the following collection: \n",
    "1. facebook/audio-magnet-small - a 300M non-autoregressive transformer capable of generating 10 second sound effects conditioned on text.\n",
    "2. facebook/audio-magnet-medium - 10 second sound effect generation, 1.5B parameters.\n",
    "\n",
    "We will use the `facebook/audio-magnet-small` variant for the purpose of this demonstration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.models import MAGNeT\n",
    "\n",
    "model = MAGNeT.get_pretrained('facebook/audio-magnet-small')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The recommended parameters for sound generation are a bit different than the defaults in MAGNeT, let's initialize it: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_generation_params(\n",
    "    use_sampling=True,\n",
    "    top_k=0,\n",
    "    top_p=0.8,\n",
    "    temperature=3.5,\n",
    "    max_cfg_coef=20.0,\n",
    "    min_cfg_coef=1.0,\n",
    "    decoding_steps=[int(20 * model.lm.cfg.dataset.segment_duration // 10),  10, 10, 10],\n",
    "    span_arrangement='stride1'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can go ahead and start generating sounds given textual prompts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.utils.notebook import display_audio\n",
    "               \n",
    "###### Text-to-audio prompts - examples ######\n",
    "text = \"Seagulls squawking as ocean waves crash while wind blows heavily into a microphone.\"\n",
    "# text = \"A toilet flushing as music is playing and a man is singing in the distance.\"\n",
    "\n",
    "N_VARIATIONS = 3\n",
    "descriptions = [text for _ in range(N_VARIATIONS)]\n",
    "\n",
    "print(f\"text prompt: {text}\\n\")\n",
    "output = model.generate(descriptions=descriptions, progress=True, return_tokens=True)\n",
    "display_audio(output[0], sample_rate=model.compression_model.sample_rate)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "vscode": {
   "interpreter": {
    "hash": "b02c911f9b3627d505ea4a19966a915ef21f28afb50dbf6b2115072d27c69103"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}