File size: 7,268 Bytes

b386992

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Prune the fine-tuned teacher model to create a pruned student\n",
    "In this step, we will explore two methods to prune the fine-tuned teacher model - depth and width pruning. Refer to the [README.md](./README.md) to decide which pruning techniques you would like to explore. We use [NeMo Run](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html) to run the pruning recipe. For usage details of pruning recipe or alternative commandline script, please refer to the [pruning docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html).\n",
    "\n",
    "Let's define the common recipe setup for depth or width pruning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nemo_run as run\n",
    "from nemo.collections import llm\n",
    "from nemo.collections.llm.modelopt import PruningConfig\n",
    "from nemo.collections.llm.modelopt.recipes import prune_recipe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set path(s) if different:\n",
    "ROOT_DIR = \"/workspace\"\n",
    "TEACHER_MODEL_PATH = f\"{ROOT_DIR}/Llama-3.1-8B-nemo-ft/checkpoints/best\"\n",
    "SEQ_LENGTH = 8192\n",
    "DATA_PATH = f\"{ROOT_DIR}/wikitext-data\"\n",
    "DATA_PATHS = {\n",
    "    \"train\": [1.0, f\"{DATA_PATH}/wikitext_tokenized_train_text_document\"],\n",
    "    \"validation\": [f\"{DATA_PATH}/wikitext_tokenized_test_text_document\"],\n",
    "    \"test\": [f\"{DATA_PATH}/wikitext_tokenized_val_text_document\"],\n",
    "}\n",
    "INDEX_MAPPING_DIR = f\"{DATA_PATH}/index_mappings\"\n",
    "\n",
    "# Change these to accommodate resources:\n",
    "DEVICES = 8\n",
    "NODES = 1\n",
    "TENSOR_PARALLEL_SIZE = 1  # Pruning only supports tensor parallelism 1\n",
    "PIPELINE_PARALLEL_SIZE = DEVICES\n",
    "MICRO_BATCH_SIZE = 4\n",
    "\n",
    "# Reduce this number to speed up the pruning process but may result in a slightly worse pruned model\n",
    "# Not used if pruning_config.drop_layers is set\n",
    "NUM_TRAIN_SAMPLES = 1024\n",
    "\n",
    "\n",
    "def configure_recipe(pruning_config: PruningConfig, save_path: str):\n",
    "    # Define the recipe\n",
    "    recipe = prune_recipe(\n",
    "        nemo_checkpoint=TEACHER_MODEL_PATH,\n",
    "        save_path=save_path,\n",
    "    )\n",
    "\n",
    "    # Change dataset (default is Squad dataset)\n",
    "    recipe.data = run.Config(\n",
    "        llm.PreTrainingDataModule,\n",
    "        paths=DATA_PATHS,\n",
    "        index_mapping_dir=INDEX_MAPPING_DIR,\n",
    "        seq_length=SEQ_LENGTH,\n",
    "        micro_batch_size=MICRO_BATCH_SIZE,\n",
    "        global_batch_size=MICRO_BATCH_SIZE,  # Global batch size has no effect on pruning\n",
    "    )\n",
    "\n",
    "    recipe.devices = DEVICES\n",
    "    recipe.num_nodes = NODES\n",
    "    recipe.tp_size = TENSOR_PARALLEL_SIZE\n",
    "    recipe.pp_size = PIPELINE_PARALLEL_SIZE\n",
    "    recipe.legacy_ckpt = True  # For compatibility with newer versions of TransformerEngine\n",
    "    recipe.num_train_samples = NUM_TRAIN_SAMPLES\n",
    "\n",
    "    for k, v in pruning_config.__dict__.items():\n",
    "        if v is not None:\n",
    "            setattr(recipe.pruning_config, k, v)\n",
    "\n",
    "    return recipe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 3a: Using depth-pruning \n",
    "To depth-prune, we will trim the layers 16-31 (leaving 1-15 and 32) in the finetined teacher model. Per the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) and [tech report](https://arxiv.org/pdf/2408.11796), removing contiguous layers from the second last block (layers 16 to 31 continuously) yields the best overall results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pruning_config = PruningConfig(\n",
    "    # To drop specific layers\n",
    "    drop_layers=[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],\n",
    "    # To drop layers automatically based on the cosine similarity\n",
    "    # target_num_layers=16,\n",
    ")\n",
    "save_path = f\"{ROOT_DIR}/Llama-3.1-8B-nemo-ft-depth-pruned\"\n",
    "\n",
    "recipe = configure_recipe(pruning_config=pruning_config, save_path=save_path)\n",
    "print(recipe)\n",
    "\n",
    "env_vars = {\n",
    "    \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\",  # Disable caching NCCL communication buffer memory\n",
    "    \"NCCL_NVLS_ENABLE\": \"0\",  # Disable NVLink SHARP to save memory\n",
    "}\n",
    "executor = run.LocalExecutor(ntasks_per_node=recipe.devices, launcher=\"torchrun\", env_vars=env_vars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run.run(recipe, executor=executor, name=\"depth_pruning\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running this script will save the depth-pruned model to your workspace at `/workspace/Llama-3.1-8B-nemo-ft-depth-pruned`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 3b: Using width-pruning \n",
    "To width-prune, we will trim the ffn_hidden_size to 9216 and hidden_size to 3072. We can also trim the `num_attention_heads` and `num_query_groups` if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pruning_config = PruningConfig(\n",
    "    target_ffn_hidden_size=9216,\n",
    "    target_hidden_size=3072,\n",
    "    target_num_attention_heads=None,\n",
    "    target_num_query_groups=None,\n",
    ")\n",
    "save_path = f\"{ROOT_DIR}/Llama-3.1-8B-nemo-ft-width-pruned\"\n",
    "\n",
    "recipe = configure_recipe(pruning_config=pruning_config, save_path=save_path)\n",
    "print(recipe)\n",
    "\n",
    "env_vars = {\n",
    "    \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\",  # Disable caching NCCL communication buffer memory\n",
    "    \"NCCL_NVLS_ENABLE\": \"0\",  # Disable NVLink SHARP to save memory\n",
    "}\n",
    "executor = run.LocalExecutor(ntasks_per_node=recipe.devices, launcher=\"torchrun\", env_vars=env_vars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run.run(recipe, executor=executor, name=\"width_pruning\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running this script will save the width-pruned model to your workspace at `/workspace/Llama-3.1-8B-nemo-ft-width-pruned`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the depth and width pruned models, we can distill them from the finetuned teacher model in next step."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}