File size: 8,000 Bytes

b386992

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7c9d27250020aba6",
   "metadata": {},
   "source": [
    "# Exporting Llama 3.2 Model into Embedding Model To ONNX and TensorRT\n",
    "\n",
    "## Goal\n",
    "\n",
    "Once the [finetuning the LLaMA 3.2 Model into an Embedding Model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/embedding/llama_embedding.ipynb) is completed, you need to export the model to ONNX and TensorRT for fast inference. Please follow the steps below in order to generate ONNX and TensorRT models.\n",
    "\n",
    "**Note:** Please make sure to run the last cell (Convert the Model to HuggingFace Transformer format section) in the [finetuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/embedding/llama_embedding.ipynb) in order to generate the checkpoint used in this tutorial. And please make sure to mount it to **/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/** or change the path of the checkpoint accordingly."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87846682e01e1a50",
   "metadata": {},
   "source": [
    "#### Launch the NeMo Framework container as follows: \n",
    "\n",
    "Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
    "```\n",
    "docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '\"device=0,1\"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.02\n",
    "```\n",
    "\n",
    "#### Launch Jupyter Notebook as follows: \n",
    "```\n",
    "jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "656bf98e-bcce-417e-ba29-cdcce7ec1cba",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install onnxruntime-gpu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "523f0670-319d-4983-b4cc-4e8bd379b29d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "import torch\n",
    "from typing import Literal, Optional, Union\n",
    "from nemo.collections.llm.gpt.model import get_llama_bidirectional_hf_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d12cfd71-225b-4874-9fa9-c45a6d6dc99f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Paths\n",
    "hf_model_path = \"/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/\" # Path of the embedding model.\n",
    "\n",
    "# HF model parameters\n",
    "pooling_mode = \"avg\" # Pooling method in the embedding model.\n",
    "normalize = False\n",
    "\n",
    "# ONNX params\n",
    "opset = 17 # ONNX version number\n",
    "onnx_export_path = \"/opt/checkpoints/llama_embedding_onnx/\" # Path for the ONNX file.\n",
    "export_dtype = \"fp32\" # ONNX export data precision.\n",
    "use_dimension_arg = True # Whether dimension was used in the model forward function or not.\n",
    "\n",
    "# TRT params\n",
    "trt_model_path = Path(\"/opt/checkpoints/llama_embedding_trt/\") # Path for the TensorRT .plan file.\n",
    "override_layers_to_fp32 = [\"/model/norm/\", \"/pooling_module\", \"/ReduceL2\", \"/Div\", ] # Model specific layers to override the precision to fp32.\n",
    "override_layernorm_precision_to_fp32 = True # Model specific operation wheter to override layernorm precision or not.\n",
    "profiling_verbosity = \"layer_names_only\"\n",
    "export_to_trt = True # Export ONNX model to TensorRT or not.\n",
    "# Generate version compatible TensorRT engine or not. This option might provide slower inference time. \n",
    "# If you know the TensorRT engine versions match (where the engine was generated versus where it's used), set this to False.\n",
    "# Please check here https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#version-compatibility for more information.\n",
    "trt_version_compatible = True "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c539a33a-fea9-4168-a179-c277120767fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Base Llama model needs to be adapted to turn it into an embedding model.\n",
    "model, tokenizer = get_llama_bidirectional_hf_model(\n",
    "    model_name_or_path=hf_model_path,\n",
    "    normalize=normalize,\n",
    "    pooling_mode=pooling_mode,\n",
    "    trust_remote_code=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95cd98f4-1cd4-4c0b-8b92-7bb79991de19",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo.export.onnx_llm_exporter import OnnxLLMExporter\n",
    "\n",
    "if use_dimension_arg:\n",
    "    input_names = [\"input_ids\", \"attention_mask\", \"dimensions\"] # ONNX specific arguments, input names in this case.\n",
    "    dynamic_axes_input = {\"input_ids\": {0: \"batch_size\", 1: \"seq_length\"},\n",
    "                            \"attention_mask\": {0: \"batch_size\", 1: \"seq_length\"}, \"dimensions\": {0: \"batch_size\"}}\n",
    "else:\n",
    "    input_names = [\"input_ids\", \"attention_mask\"]\n",
    "    dynamic_axes_input = {\"input_ids\": {0: \"batch_size\", 1: \"seq_length\"},\n",
    "                            \"attention_mask\": {0: \"batch_size\", 1: \"seq_length\"}}\n",
    "\n",
    "output_names = [\"embeddings\"] # ONNX specific arguments, output names in this case.\n",
    "dynamic_axes_output = {\"embeddings\": {0: \"batch_size\", 1: \"embedding_dim\"}}\n",
    "\n",
    "onnx_exporter = OnnxLLMExporter(\n",
    "    onnx_model_dir=onnx_export_path, \n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "\n",
    "onnx_exporter.export(    \n",
    "    input_names=input_names,\n",
    "    output_names=output_names,\n",
    "    opset=opset,\n",
    "    dynamic_axes_input=dynamic_axes_input,\n",
    "    dynamic_axes_output=dynamic_axes_output,\n",
    "    export_dtype=\"fp32\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1aab9b9-97d0-485c-8d86-dbd21b9a6a33",
   "metadata": {},
   "outputs": [],
   "source": [
    "if export_to_trt:\n",
    "    if use_dimension_arg:\n",
    "        input_profiles = [{\"input_ids\": [[1, 3], [16, 128], [64, 256]], \"attention_mask\": [[1, 3], [16, 128], [64, 256]],\n",
    "                            \"dimensions\": [[1], [16], [64]]}]\n",
    "    else:\n",
    "        input_profiles = [{\"input_ids\": [[1, 3], [16, 128], [64, 256]], \"attention_mask\": [[1, 3], [16, 128], [64, 256]]}]\n",
    "\n",
    "    trt_builder_flags = None\n",
    "    if trt_version_compatible:\n",
    "        import tensorrt as trt\n",
    "        trt_builder_flags=[trt.BuilderFlag.VERSION_COMPATIBLE]\n",
    "    \n",
    "    onnx_exporter.export_onnx_to_trt(\n",
    "        trt_model_dir=trt_model_path,\n",
    "        profiles=input_profiles,\n",
    "        override_layernorm_precision_to_fp32=override_layernorm_precision_to_fp32,\n",
    "        override_layers_to_fp32=override_layers_to_fp32,\n",
    "        profiling_verbosity=profiling_verbosity,\n",
    "        trt_builder_flags=trt_builder_flags,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "051200b7-6eba-44db-b223-059f1dfb60bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = [\"hello\", \"world\"]\n",
    "dimensions = [2, 4] if use_dimension_arg else None\n",
    "\n",
    "onnx_exporter.forward(prompt, dimensions)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}