Spaces:

towardsai-tutors
/

ai-tutor-chatbot

Sleeping

App Files Files Community

Omar Solano commited on Jul 18

Commit

37cbdf5

•

1 Parent(s): afc6d39

add gpt-4o-mini

Browse files

Files changed (4) hide show

scripts/create_db.ipynb +527 -0
scripts/custom_retriever.py +65 -0
scripts/gradio-ui.py +112 -38
scripts/tutor_prompts.py +22 -12

scripts/create_db.ipynb CHANGED Viewed

@@ -268,6 +268,517 @@
     "    document_dict = pickle.load(f)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -347,6 +858,22 @@
     ")"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,

     "    document_dict = pickle.load(f)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The LLM sees this: \n",
+      " url: https://huggingface.co/docs/transformers/deepspeed\n",
+      "\n",
+      "DeepSpeed is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it’s core is the Zero Redundancy Optimizer (ZeRO) which enables training large models at scale. ZeRO works in several stages:\n",
+      "ZeRO-1, optimizer state partioning across GPUs ZeRO-2, gradient partitioning across GPUs ZeRO-3, parameteter partitioning across GPUs\n",
+      "In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers Trainer class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models.\n",
+      "This guide will walk you through how to deploy DeepSpeed training, the features you can enable, how to setup the config files for different ZeRO stages, offloading, inference, and using DeepSpeed without the Trainer .\n",
+      "\n",
+      "DeepSpeed is available to install from PyPI or Transformers (for more detailed installation options, take a look at the DeepSpeed installation details or the GitHub README ).\n",
+      "If you’re having difficulties installing DeepSpeed, check the DeepSpeed CUDA installation guide. While DeepSpeed has a pip installable PyPI package, it is highly recommended to install it from source to best match your hardware and to support certain features, like 1-bit Adam, which aren’t available in the PyPI distribution.\n",
+      "PyPI Transformers\n",
+      "Copied pip install deepspeed\n",
+      "\n",
+      "Before you begin, it is a good idea to check whether you have enough GPU and CPU memory to fit your model. DeepSpeed provides a tool for estimating the required CPU/GPU memory. For example, to estimate the memory requirements for the bigscience/T0_3B model on a single GPU:\n",
+      "Copied $ python -c 'from transformers import AutoModel; \\\n",
+      "from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \\\n",
+      "model = AutoModel.from_pretrained(\"bigscience/T0_3B\"); \\\n",
+      "estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' [...]\n",
+      "Estimated memory needed for params, optim states and gradients for a:\n",
+      "HW: Setup with 1 node, 1 GPU per node.\n",
+      "SW: Model with 2783M total params, 65M largest layer params.\n",
+      "  per CPU  |  per GPU |   Options\n",
+      "   70.00GB |   0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1\n",
+      "   70.00GB |   0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0\n",
+      "   62.23GB |   5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1\n",
+      "   62.23GB |   5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0\n",
+      "    0.37GB |  46.91GB | offload_param=none, offload_optimizer=none, zero_init=1\n",
+      "   15.56GB |  46.91GB | offload_param=none, offload_optimizer=none, zero_init=0\n",
+      "This means you either need a single 80GB GPU without CPU offload or a 8GB GPU and a ~60GB CPU to offload to (these are just the memory requirements for the parameters, optimizer states and gradients, and you’ll need a bit more for the CUDA kernels and activations). You should also consider the tradeoff between cost and speed because it’ll be cheaper to rent or buy a smaller GPU but it’ll take longer to train your model.\n",
+      "If you have enough GPU memory make sure you disable CPU/NVMe offload to make everything faster.\n",
+      "\n",
+      "After you’ve installed DeepSpeed and have a better idea of your memory requirements, the next step is selecting a ZeRO stage to use. In order of fastest and most memory-efficient:\n",
+      "| Fastest          | Memory efficient   |\n",
+      "|------------------|--------------------|\n",
+      "| ZeRO-1           | ZeRO-3 + offload   |\n",
+      "| ZeRO-2           | ZeRO-3             |\n",
+      "| ZeRO-2 + offload | ZeRO-2 + offload   |\n",
+      "| ZeRO-3           | ZeRO-2             |\n",
+      "| ZeRO-3 + offload | ZeRO-1             |\n",
+      "To find what works best for you, start with the fastest approach and if you run out of memory, try the next stage which is slower but more memory efficient. Feel free to work in whichever direction you prefer (starting with the most memory efficient or fastest) to discover the appropriate balance between speed and memory usage.\n",
+      "A general process you can use is (start with batch size of 1):\n",
+      "enable gradient checkpointing try ZeRO-2 try ZeRO-2 and offload the optimizer try ZeRO-3 try ZeRO-3 and offload parameters to the CPU try ZeRO-3 and offload parameters and the optimizer to the CPU try lowering various default values like a narrower search beam if you’re using the generate() method try mixed half-precision (fp16 on older GPU architectures and bf16 on Ampere) over full-precision weights add more hardware if possible or enable Infinity to offload parameters and the optimizer to a NVMe once you’re not running out of memory, measure effective throughput and then try to increase the batch size as large as you can to maximize GPU efficiency lastly, try to optimize your training setup by disabling some offload features or use a faster ZeRO stage and increasing/decreasing the batch size to find the best tradeoff between speed and memory usage\n",
+      "\n",
+      "DeepSpeed works with the Trainer class by way of a config file containing all the parameters for configuring how you want setup your training run. When you execute your training script, DeepSpeed logs the configuration it received from Trainer to the console so you can see exactly what configuration was used.\n",
+      "Find a complete list of DeepSpeed configuration options on the DeepSpeed Configuration JSON reference. You can also find more practical examples of various DeepSpeed configuration examples on the DeepSpeedExamples repository or the main DeepSpeed repository. To quickly find specific examples, you can: Copied git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples\n",
+      "find . -name '*json' # find examples with the Lamb optimizer grep -i Lamb $(find . -name '*json' )\n",
+      "The DeepSpeed configuration file is passed as a path to a JSON file if you’re training from the command line interface or as a nested dict object if you’re using the Trainer in a notebook setting.\n",
+      "path to file nested dict\n",
+      "Copied TrainingArguments(..., deepspeed= \"path/to/deepspeed_config.json\" )\n",
+      "\n",
+      "There are three types of configuration parameters:\n",
+      "Some of the configuration parameters are shared by Trainer and DeepSpeed, and it can be difficult to identify errors when there are conflicting definitions. To make it easier, these shared configuration parameters are configured from the Trainer command line arguments. Some configuration parameters that are automatically derived from the model configuration so you don’t need to manually adjust these values. The Trainer uses a configuration value auto to determine set the most correct or efficient value. You could set your own configuration parameters explicitly, but you must take care to ensure the Trainer arguments and DeepSpeed configuration parameters agree. Mismatches may cause the training to fail in very difficult to detect ways! Some configuration parameters specific to DeepSpeed only which need to be manually set based on your training needs.\n",
+      "You could also modify the DeepSpeed configuration and edit TrainingArguments from it:\n",
+      "Create or load a DeepSpeed configuration to used as the main configuration Create a TrainingArguments object based on these DeepSpeed configuration values\n",
+      "Some values, such as scheduler.params.total_num_steps are calculated by the Trainer during training.\n",
+      "\n",
+      "There are three configurations, each corresponding to a different ZeRO stage. Stage 1 is not as interesting for scalability, and this guide focuses on stages 2 and 3. The zero_optimization configuration contains all the options for what to enable and how to configure them. For a more detailed explanation of each parameter, take a look at the DeepSpeed Configuration JSON reference.\n",
+      "DeepSpeed doesn’t validate parameter names and any typos fallback on the parameter's default setting. You can watch the DeepSpeed engine startup log messages to see what values it is going to use.\n",
+      "The following configurations must be setup with DeepSpeed because the Trainer doesn’t provide equivalent command line arguments.\n",
+      "ZeRO-1 ZeRO-2 ZeRO-3\n",
+      "ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed up. The ZeRO-1 config can be setup like this: Copied { \"zero_optimization\": { \"stage\": 1 }\n",
+      "}\n",
+      "\n",
+      "ZeRO-Infinity allows offloading model states to the CPU and/or NVMe to save even more memory. Smart partitioning and tiling algorithms allow each GPU to send and receive very small amounts of data during offloading such that a modern NVMe can fit an even larger total memory pool than is available to your training process. ZeRO-Infinity requires ZeRO-3.\n",
+      "Depending on the CPU and/or NVMe memory available, you can offload both the optimizer states and parameters , just one of them, or none. You should also make sure the nvme_path is pointing to an NVMe device, because while it still works with a normal hard drive or solid state drive, it’ll be significantly slower. With a modern NVMe, you can expect peak transfer speeds of ~3.5GB/s for read and ~3GB/s for write operations. Lastly, run a benchmark on your training setup to determine the optimal aio configuration.\n",
+      "The example ZeRO-3/Infinity configuration file below sets most of the parameter values to auto , but you could also manually add these values.\n",
+      "Copied { \"fp16\": { \"enabled\": \"auto\" , \"loss_scale\": 0 , \"loss_scale_window\": 1000 , \"initial_scale_power\": 16 , \"hysteresis\": 2 , \"min_loss_scale\": 1 }, \"optimizer\": { \"type\": \"AdamW\" , \"params\": { \"lr\": \"auto\" , \"betas\": \"auto\" , \"eps\": \"auto\" , \"weight_decay\": \"auto\" }\n",
+      "    }, \"scheduler\": { \"type\": \"WarmupLR\" , \"params\": { \"warmup_min_lr\": \"auto\" , \"warmup_max_lr\": \"auto\" , \"warmup_num_steps\": \"auto\" }\n",
+      "    }, \"zero_optimization\": { \"stage\": 3 , \"offload_optimizer\": { \"device\": \"nvme\" , \"nvme_path\": \"/local_nvme\" , \"pin_memory\": true , \"buffer_count\": 4 , \"fast_init\": false }, \"offload_param\": { \"device\": \"nvme\" , \"nvme_path\": \"/local_nvme\" , \"pin_memory\": true , \"buffer_count\": 5 , \"buffer_size\": 1e8 , \"max_in_cpu\": 1e9 }, \"aio\": { \"block_size\": 262144 , \"queue_depth\": 32 , \"thread_count\": 1 , \"single_submit\": false , \"overlap_events\": true }, \"overlap_comm\": true , \"contiguous_gradients\": true , \"sub_group_size\": 1e9 , \"reduce_bucket_size\": \"auto\" , \"stage3_prefetch_bucket_size\": \"auto\" , \"stage3_param_persistence_threshold\": \"auto\" , \"stage3_max_live_parameters\": 1e9 , \"stage3_max_reuse_distance\": 1e9 , \"stage3_gather_16bit_weights_on_model_save\": true }, \"gradient_accumulation_steps\": \"auto\" , \"gradient_clipping\": \"auto\" , \"steps_per_print\": 2000 , \"train_batch_size\": \"auto\" , \"train_micro_batch_size_per_gpu\": \"auto\" , \"wall_clock_breakdown\": false }\n",
+      "\n",
+      "There are a number of important parameters to specify in the DeepSpeed configuration file which are briefly described in this section.\n",
+      "\n",
+      "Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. To enable this feature:\n",
+      "For a Hugging Face model, set model.gradient_checkpointing_enable() or --gradient_checkpointing in the Trainer . For a non-Hugging Face model, use the DeepSpeed Activation Checkpointing API . You could also replace the Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed API. This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them.\n",
+      "\n",
+      "DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don’t enable offload_optimizer . When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation.\n",
+      "The optimizer and scheduler parameters for the config file can be set from the command line to avoid hard to find errors. For example, if the learning rate is set to a different value in another place you can override it from the command line. Aside from the optimizer and scheduler parameters, you’ll need to ensure your Trainer command line arguments match the DeepSpeed configuration.\n",
+      "optimizer scheduler\n",
+      "DeepSpeed offers several optimizers (Adam, AdamW, OneBitAdam, and LAMB) but you can also import other optimizers from PyTorch. If you don’t configure the optimizer in the config, the Trainer automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr , adam_beta1 , adam_beta2 , adam_epsilon , weight_decay . You can set the parameters to \"auto\" or manually input your own desired values. Copied { \"optimizer\": { \"type\": \"AdamW\" , \"params\": { \"lr\": \"auto\" , \"betas\": \"auto\" , \"eps\": \"auto\" , \"weight_decay\": \"auto\" }\n",
+      "   }\n",
+      "} You can also use an unsupported optimizer by adding the following to the top level configuration. Copied { \"zero_allow_untested_optimizer\": true } From DeepSpeed==0.8.3 on, if you want to use offload, you’ll also need to the following to the top level configuration because offload works best with DeepSpeed’s CPU Adam optimizer. Copied { \"zero_force_ds_cpu_optimizer\": false }\n",
+      "\n",
+      "Deepspeed supports fp32, fp16, and bf16 mixed precision.\n",
+      "fp32 fp16 bf16\n",
+      "If your model doesn’t work well with mixed precision, for example if it wasn’t pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode. Copied { \"fp16\": { \"enabled\": false }\n",
+      "} For Ampere GPUs and PyTorch > 1.7, it automatically switches to the more efficient tf32 format for some operations but the results are still in fp32. You can control it from the Trainer by setting --tf32 to enable it, and --tf32 0 or --no_tf32 to disable it.\n",
+      "\n",
+      "The batch size can be auto-configured or explicitly set. If you choose to use the \"auto\" option, Trainer sets train_micro_batch_size_per_gpu to the value of args. per_device_train_batch_size and train_batch_size to args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps .\n",
+      "Copied { \"train_micro_batch_size_per_gpu\": \"auto\" , \"train_batch_size\": \"auto\" }\n",
+      "\n",
+      "Gradient accumulation can be auto-configured or explicitly set. If you choose to use the \"auto\" option, Trainer sets it to the value of args.gradient_accumulation_steps .\n",
+      "Copied { \"gradient_accumulation_steps\": \"auto\" }\n",
+      "\n",
+      "Gradient clipping can be auto-configured or explicitly set. If you choose to use the \"auto\" option, Trainer sets it to the value of args.max_grad_norm .\n",
+      "Copied { \"gradient_clipping\": \"auto\" }\n",
+      "\n",
+      "For communication collectives like reduction, gathering and scattering operations, a separate data type is used.\n",
+      "All gather and scatter operations are performed in the same data type the data is in. For example, if you’re training with bf16, the data is also gathered in bf16 because gathering is a non-lossy operation.\n",
+      "Reduce operations are lossy, for example when gradients are averaged across multiple GPUs. When the communication is done in fp16 or bf16, it is more likely to be lossy because adding multiple numbers in low precision isn’t exact. This is especially the case with bf16 which has a lower precision than fp16. For this reason, fp16 is the default for reduction operations because the loss is minimal when averaging gradients.\n",
+      "You can choose the communication data type by setting the communication_data_type parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you’re training in.\n",
+      "Copied { \"communication_data_type\": \"fp32\" }\n",
+      "\n",
+      "DeepSpeed can be deployed by different launchers such as torchrun , the deepspeed launcher, or Accelerate . To deploy, add --deepspeed ds_config.json to the Trainer command line. It’s recommended to use DeepSpeed’s add_config_arguments utility to add any necessary command line arguments to your code.\n",
+      "This guide will show you how to deploy DeepSpeed with the deepspeed launcher for different training setups. You can check out this post for more practical usage examples.\n",
+      "multi-GPU single-GPU\n",
+      "To deploy DeepSpeed on multiple GPUs, add the --num_gpus parameter. If you want to use all available GPUs, you don’t need to add --num_gpus . The example below uses 2 GPUs. Copied deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \\\n",
+      "--deepspeed tests/deepspeed/ds_config_zero3.json \\\n",
+      "--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \\\n",
+      "--output_dir output_dir --overwrite_output_dir --fp16 \\\n",
+      "--do_train --max_train_samples 500 --num_train_epochs 1 \\\n",
+      "--dataset_name wmt16 --dataset_config \"ro-en\" \\\n",
+      "--source_lang en --target_lang ro\n",
+      "\n",
+      "A node is one or more GPUs for running a workload. A more powerful setup is a multi-node setup which can be launched with the deepspeed launcher. For this guide, let’s assume there are two nodes with 8 GPUs each. The first node can be accessed ssh hostname1 and the second node with ssh hostname2 . Both nodes must be able to communicate with each other locally over ssh without a password.\n",
+      "By default, DeepSpeed expects your multi-node environment to use a shared storage. If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a checkpoint to allow loading without access to a shared filesystem:\n",
+      "Copied { \"checkpoint\": { \"use_node_local_storage\": true }\n",
+      "}\n",
+      "You could also use the Trainer ’s --save_on_each_node argument to automatically add the above checkpoint to your config.\n",
+      "torchrun deepspeed\n",
+      "For torchrun , you have to ssh to each node and run the following command on both of them. The launcher waits until both nodes are synchronized before launching the training. Copied torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \\\n",
+      "--master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json\n",
+      "\n",
+      "In a SLURM environment, you’ll need to adapt your SLURM script to your specific SLURM environment. An example SLURM script may look like:\n",
+      "Copied #SBATCH --job-name=test-nodes        # name #SBATCH --nodes=2                    # nodes #SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node! #SBATCH --cpus-per-task=10           # number of cores per tasks #SBATCH --gres=gpu:8                 # number of gpus #SBATCH --time 20:00:00              # maximum execution time (HH:MM:SS) #SBATCH --output=%x-%j.out           # output file name export GPUS_PER_NODE=8 export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=9901\n",
+      "\n",
+      "srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \\\n",
+      " --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \\\n",
+      " --master_addr $MASTER_ADDR --master_port $MASTER_PORT \\\n",
+      "your_program.py <normal cl args> --deepspeed ds_config.json'\n",
+      "Then you can schedule your multi-node deployment with the following command which launches training simultaneously on all nodes.\n",
+      "Copied sbatch launch.slurm\n",
+      "\n",
+      "The deepspeed launcher doesn’t support deployment from a notebook so you’ll need to emulate the distributed environment. However, this only works for 1 GPU. If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. This means you have to use the deepspeed launcher which can’t be emulated as shown here.\n",
+      "Copied # DeepSpeed requires a distributed environment even when only one process is used. # This emulates a launcher in the notebook import os\n",
+      "\n",
+      "os.environ[ \"MASTER_ADDR\" ] = \"localhost\" os.environ[ \"MASTER_PORT\" ] = \"9994\" # modify if RuntimeError: Address already in use os.environ[ \"RANK\" ] = \"0\" os.environ[ \"LOCAL_RANK\" ] = \"0\" os.environ[ \"WORLD_SIZE\" ] = \"1\" # Now proceed as normal, plus pass the DeepSpeed config file training_args = TrainingArguments(..., deepspeed= \"ds_config_zero3.json\" )\n",
+      "trainer = Trainer(...)\n",
+      "trainer.train()\n",
+      "If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated cell.\n",
+      "Copied %%bash\n",
+      "cat << 'EOT' > ds_config_zero3.json\n",
+      "{ \"fp16\" : { \"enabled\" : \"auto\" , \"loss_scale\" : 0 , \"loss_scale_window\" : 1000 , \"initial_scale_power\" : 16 , \"hysteresis\" : 2 , \"min_loss_scale\" : 1 }, \"optimizer\" : { \"type\" : \"AdamW\" , \"params\" : { \"lr\" : \"auto\" , \"betas\" : \"auto\" , \"eps\" : \"auto\" , \"weight_decay\" : \"auto\" }\n",
+      "    }, \"scheduler\" : { \"type\" : \"WarmupLR\" , \"params\" : { \"warmup_min_lr\" : \"auto\" , \"warmup_max_lr\" : \"auto\" , \"warmup_num_steps\" : \"auto\" }\n",
+      "    }, \"zero_optimization\" : { \"stage\" : 3 , \"offload_optimizer\" : { \"device\" : \"cpu\" , \"pin_memory\" : true\n",
+      "        }, \"offload_param\" : { \"device\" : \"cpu\" , \"pin_memory\" : true\n",
+      "        }, \"overlap_comm\" : true, \"contiguous_gradients\" : true, \"sub_group_size\" : 1e9 , \"reduce_bucket_size\" : \"auto\" , \"stage3_prefetch_bucket_size\" : \"auto\" , \"stage3_param_persistence_threshold\" : \"auto\" , \"stage3_max_live_parameters\" : 1e9 , \"stage3_max_reuse_distance\" : 1e9 , \"stage3_gather_16bit_weights_on_model_save\" : true\n",
+      "    }, \"gradient_accumulation_steps\" : \"auto\" , \"gradient_clipping\" : \"auto\" , \"steps_per_print\" : 2000 , \"train_batch_size\" : \"auto\" , \"train_micro_batch_size_per_gpu\" : \"auto\" , \"wall_clock_breakdown\" : false\n",
+      "}\n",
+      "EOT\n",
+      "If the training script is in a file and not in a notebook cell, you can launch deepspeed normally from the shell in a notebook cell. For example, to launch run_translation.py :\n",
+      "Copied !git clone https://github.com/huggingface/transformers\n",
+      "!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...\n",
+      "You could also use %%bash magic and write multi-line code to run the shell program, but you won’t be able to view the logs until training is complete. With %%bash magic, you don’t need to emulate a distributed environment.\n",
+      "Copied %%bash\n",
+      "\n",
+      "git clone https://github.com/huggingface/transformers\n",
+      "cd transformers\n",
+      "deepspeed examples/pytorch/translation/run_translation.py ...\n",
+      "\n",
+      "DeepSpeed stores the main full precision fp32 weights in custom checkpoint optimizer files (the glob pattern looks like global_step*/*optim_states.pt ) and are saved under the normal checkpoint.\n",
+      "fp16 fp32\n",
+      "A model trained with ZeRO-2 saves the pytorch_model.bin weights in fp16. To save the model weights in fp16 for a model trained with ZeRO-3, you need to set \"stage3_gather_16bit_weights_on_model_save\": true because the model weights are partitioned across multiple GPUs. Otherwise, the Trainer won’t save the weights in fp16 and it won’t create a pytorch_model.bin file. This is because DeepSpeed’s state_dict contains a placeholder instead of the real weights and you won’t be able to load them. Copied { \"zero_optimization\": { \"stage3_gather_16bit_weights_on_model_save\": true }\n",
+      "}\n",
+      "\n",
+      "ZeRO Inference places the model weights in CPU or NVMe memory to avoid burdening the GPU which makes it possible to run inference with huge models on a GPU. Inference doesn’t require any large additional amounts of memory for the optimizer states and gradients so you can fit much larger batches and/or sequence lengths on the same hardware.\n",
+      "ZeRO Inference shares the same configuration file as ZeRO-3 , and ZeRO-2 and ZeRO-1 configs won’t work because they don’t provide any benefits for inference.\n",
+      "To run ZeRO Inference, pass your usual training arguments to the TrainingArguments class and add the --do_eval argument.\n",
+      "Copied deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json\n",
+      "\n",
+      "DeepSpeed also works with Transformers without the Trainer class. This is handled by the HfDeepSpeedConfig which only takes care of gathering ZeRO-3 parameters and splitting a model across multiple GPUs when you call from_pretrained() .\n",
+      "If you want everything automatically taken care of for you, try using DeepSpeed with the Trainer ! You’ll need to follow the DeepSpeed documentation , and manually configure the parameter values in the config file (you can’t use the \"auto\" value).\n",
+      "To efficiently deploy ZeRO-3, you must instantiate the HfDeepSpeedConfig object before the model and keep that object alive:\n",
+      "pretrained model non-pretrained model\n",
+      "Copied from transformers.integrations import HfDeepSpeedConfig from transformers import AutoModel import deepspeed\n",
+      "\n",
+      "ds_config = {...} # deepspeed config object or path to the file # must run before instantiating the model to detect zero 3 dschf = HfDeepSpeedConfig(ds_config) # keep this object alive model = AutoModel.from_pretrained( \"openai-community/gpt2\" )\n",
+      "engine = deepspeed.initialize(model=model, config_params=ds_config, ...)\n",
+      "\n",
+      "To run ZeRO Inference without the Trainer in cases where you can’t fit a model onto a single GPU, try using additional GPUs or/and offloading to CPU memory. The important nuance to understand here is that the way ZeRO is designed, you can process different inputs on different GPUs in parallel.\n",
+      "Make sure to:\n",
+      "disable CPU offload if you have enough GPU memory (since it slows things down). enable bf16 if you have an Ampere or newer GPU to make things faster. If you don’t have one of these GPUs, you may enable fp16 as long as you don’t use a model pretrained in bf16 (T5 models) because it may lead to an overflow error.\n",
+      "Take a look at the following script to get a better idea of how to run ZeRO Inference without the Trainer on a model that won’t fit on a single GPU.\n",
+      "Copied #!/usr/bin/env python # This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model # into a single GPU # # 1. Use 1 GPU with CPU offload # 2. Or use multiple GPUs instead # # First you need to install deepspeed: pip install deepspeed # # Here we use a 3B \"bigscience/T0_3B\" model which needs about 15GB GPU RAM - so 1 largish or 2 # small GPUs can handle it. or 1 small GPU and a lot of CPU memory. # # To use a larger model like \"bigscience/T0\" which needs about 50GB, unless you have an 80GB GPU - # you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to # process multiple inputs at once. # # The provided deepspeed config also activates CPU memory offloading, so chances are that if you # have a lot of available CPU memory and you don't mind a slowdown you should be able to load a # model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will # run faster if you don't want offload to CPU - so disable that section then. # # To deploy on 1 gpu: # # deepspeed --num_gpus 1 t0.py # or: # python -m torch.distributed.run --nproc_per_node=1 t0.py # # To deploy on 2 gpus: # # deepspeed --num_gpus 2 t0.py # or: # python -m torch.distributed.run --nproc_per_node=2 t0.py from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM from transformers.integrations import HfDeepSpeedConfig import deepspeed import os import torch\n",
+      "\n",
+      "os.environ[ \"TOKENIZERS_PARALLELISM\" ] = \"false\" # To avoid warnings about parallelism in tokenizers # distributed setup local_rank = int (os.getenv( \"LOCAL_RANK\" , \"0\" ))\n",
+      "world_size = int (os.getenv( \"WORLD_SIZE\" , \"1\" ))\n",
+      "torch.cuda.set_device(local_rank)\n",
+      "deepspeed.init_distributed()\n",
+      "\n",
+      "model_name = \"bigscience/T0_3B\" config = AutoConfig.from_pretrained(model_name)\n",
+      "model_hidden_size = config.d_model # batch size has to be divisible by world_size, but can be bigger than world_size train_batch_size = 1 * world_size # ds_config notes # # - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be # faster. # # - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g. # all official t5 models are bf16-pretrained # # - set offload_param.device to \"none\" or completely remove the `offload_param` section if you don't # - want CPU offload # # - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control # - which params should remain on gpus - the larger the value the smaller the offload size # # For in-depth info on Deepspeed config see # https://huggingface.co/docs/transformers/main/main_classes/deepspeed # keeping the same format as json for consistency, except it uses lower case for true/false # fmt: off ds_config = { \"fp16\" : { \"enabled\" : False }, \"bf16\" : { \"enabled\" : False }, \"zero_optimization\" : { \"stage\" : 3 , \"offload_param\" : { \"device\" : \"cpu\" , \"pin_memory\" : True }, \"overlap_comm\" : True , \"contiguous_gradients\" : True , \"reduce_bucket_size\" : model_hidden_size * model_hidden_size, \"stage3_prefetch_bucket_size\" : 0.9 * model_hidden_size * model_hidden_size, \"stage3_param_persistence_threshold\" : 10 * model_hidden_size\n",
+      "    }, \"steps_per_print\" : 2000 , \"train_batch_size\" : train_batch_size, \"train_micro_batch_size_per_gpu\" : 1 , \"wall_clock_breakdown\" : False } # fmt: on # next line instructs transformers to partition the model directly over multiple gpus using # deepspeed.zero.Init when model's `from_pretrained` method is called. # # **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)** # # otherwise the model will first be loaded normally and only partitioned at forward time which is # less efficient and when there is little CPU RAM may fail dschf = HfDeepSpeedConfig(ds_config) # keep this object alive # now a model can be loaded. model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # initialise Deepspeed ZeRO and store only the engine object ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[ 0 ]\n",
+      "ds_engine.module. eval () # inference # Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once. # If you use more GPUs adjust for more. # And of course if you have just one input to process you then need to pass the same string to both gpus # If you use only one GPU, then you will have only rank 0. rank = torch.distributed.get_rank() if rank == 0 :\n",
+      "    text_in = \"Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy\" elif rank == 1 :\n",
+      "    text_in = \"Is this review positive or negative? Review: this is the worst restaurant ever\" tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+      "inputs = tokenizer.encode(text_in, return_tensors= \"pt\" ).to(device=local_rank) with torch.no_grad():\n",
+      "    outputs = ds_engine.module.generate(inputs, synced_gpus= True )\n",
+      "text_out = tokenizer.decode(outputs[ 0 ], skip_special_tokens= True ) print ( f\"rank {rank} :\\n   in= {text_in} \\n  out= {text_out} \" )\n",
+      "Save the script as t0.py and launch it:\n",
+      "Copied $ deepspeed --num_gpus 2 t0.py\n",
+      "rank0: in =Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy\n",
+      "  out=Positive\n",
+      "rank1: in =Is this review positive or negative? Review: this is the worst restaurant ever\n",
+      "  out=negative\n",
+      "This is a very basic example and you’ll want to adapt it to your use case.\n",
+      "\n",
+      "Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the generate() method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven’t received the weight shard from the GPU that finished first.\n",
+      "For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation.\n",
+      "\n",
+      "When you encounter an issue, you should consider whether DeepSpeed is the cause of the problem because often it isn’t (unless it’s super obviously and you can see DeepSpeed modules in the exception)! The first step should be to retry your setup without DeepSpeed, and if the problem persists, then you can report the issue. If the issue is a core DeepSpeed problem and unrelated to the Transformers integration, open an Issue on the DeepSpeed repository .\n",
+      "For issues related to the Transformers integration, please provide the following information:\n",
+      "the full DeepSpeed config file the command line arguments of the Trainer , or TrainingArguments arguments if you’re scripting the Trainer setup yourself (don’t dump the TrainingArguments which has dozens of irrelevant entries) the outputs of:\n",
+      "Copied python -c 'import torch; print(f\"torch: {torch.__version__}\")' python -c 'import transformers; print(f\"transformers: {transformers.__version__}\")' python -c 'import deepspeed; print(f\"deepspeed: {deepspeed.__version__}\")'\n",
+      "a link to a Google Colab notebook to reproduce the issue if impossible, a standard and non-custom dataset we can use and also try to use an existing example to reproduce the issue with\n",
+      "The following sections provide a guide for resolving two of the most common issues.\n",
+      "\n",
+      "When the DeepSpeed process is killed during launch without a traceback, that usually means the program tried to allocate more CPU memory than your system has or your process tried to allocate more CPU memory than allowed leading the OS kernel to terminate the process. In this case, check whether your configuration file has either offload_optimizer , offload_param or both configured to offload to the CPU.\n",
+      "If you have NVMe and ZeRO-3 setup, experiment with offloading to the NVMe ( estimate the memory requirements for your model).\n",
+      "\n",
+      "NaN loss often occurs when a model is pretrained in bf16 and then you try to use it with fp16 (especially relevant for TPU trained models). To resolve this, use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer).\n",
+      "The other issue may be related to using fp16. For example, if this is your fp16 configuration:\n",
+      "Copied { \"fp16\": { \"enabled\": \"auto\" , \"loss_scale\": 0 , \"loss_scale_window\": 1000 , \"initial_scale_power\": 16 , \"hysteresis\": 2 , \"min_loss_scale\": 1 }\n",
+      "}\n",
+      "You might see the following OVERFLOW! messages in the logs:\n",
+      "Copied 0%|                                                                                                                             | 0/189 [00:00<?, ?it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 262144\n",
+      "  1%|▌                                                                                                                    | 1/189 [00:00<01:26,  2.17it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072.0\n",
+      "  1%|█▏\n",
+      " [...]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      " 14%|████████████████▌                                                                                                   | 27/189 [00:14<01:13,  2.21it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      " 15%|█████████████████▏                                                                                                  | 28/189 [00:14<01:13,  2.18it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      " 15%|█████████████████▊                                                                                                  | 29/189 [00:15<01:13,  2.18it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      "[...]\n",
+      "This means the DeepSpeed loss scaler is unable to find a scaling coefficient to overcome loss overflow. To fix it, try a higher initial_scale_power value (32 usually works).\n",
+      "\n",
+      "DeepSpeed ZeRO is a powerful technology for training and loading very large models for inference with limited GPU resources, making it more accessible to everyone. To learn more about DeepSpeed, feel free to read the blog posts , documentation , and GitHub repository .\n",
+      "The following papers are also a great resource for learning more about ZeRO:\n",
+      "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ZeRO-Offload: Democratizing Billion-Scale Model Training ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning\n",
+      "< > Update on GitHub\n",
+      "HTML_TAG_END\n",
+      "The Embedding model sees this: \n",
+      " DeepSpeed is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it’s core is the Zero Redundancy Optimizer (ZeRO) which enables training large models at scale. ZeRO works in several stages:\n",
+      "ZeRO-1, optimizer state partioning across GPUs ZeRO-2, gradient partitioning across GPUs ZeRO-3, parameteter partitioning across GPUs\n",
+      "In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers Trainer class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models.\n",
+      "This guide will walk you through how to deploy DeepSpeed training, the features you can enable, how to setup the config files for different ZeRO stages, offloading, inference, and using DeepSpeed without the Trainer .\n",
+      "\n",
+      "DeepSpeed is available to install from PyPI or Transformers (for more detailed installation options, take a look at the DeepSpeed installation details or the GitHub README ).\n",
+      "If you’re having difficulties installing DeepSpeed, check the DeepSpeed CUDA installation guide. While DeepSpeed has a pip installable PyPI package, it is highly recommended to install it from source to best match your hardware and to support certain features, like 1-bit Adam, which aren’t available in the PyPI distribution.\n",
+      "PyPI Transformers\n",
+      "Copied pip install deepspeed\n",
+      "\n",
+      "Before you begin, it is a good idea to check whether you have enough GPU and CPU memory to fit your model. DeepSpeed provides a tool for estimating the required CPU/GPU memory. For example, to estimate the memory requirements for the bigscience/T0_3B model on a single GPU:\n",
+      "Copied $ python -c 'from transformers import AutoModel; \\\n",
+      "from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \\\n",
+      "model = AutoModel.from_pretrained(\"bigscience/T0_3B\"); \\\n",
+      "estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' [...]\n",
+      "Estimated memory needed for params, optim states and gradients for a:\n",
+      "HW: Setup with 1 node, 1 GPU per node.\n",
+      "SW: Model with 2783M total params, 65M largest layer params.\n",
+      "  per CPU  |  per GPU |   Options\n",
+      "   70.00GB |   0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1\n",
+      "   70.00GB |   0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0\n",
+      "   62.23GB |   5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1\n",
+      "   62.23GB |   5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0\n",
+      "    0.37GB |  46.91GB | offload_param=none, offload_optimizer=none, zero_init=1\n",
+      "   15.56GB |  46.91GB | offload_param=none, offload_optimizer=none, zero_init=0\n",
+      "This means you either need a single 80GB GPU without CPU offload or a 8GB GPU and a ~60GB CPU to offload to (these are just the memory requirements for the parameters, optimizer states and gradients, and you’ll need a bit more for the CUDA kernels and activations). You should also consider the tradeoff between cost and speed because it’ll be cheaper to rent or buy a smaller GPU but it’ll take longer to train your model.\n",
+      "If you have enough GPU memory make sure you disable CPU/NVMe offload to make everything faster.\n",
+      "\n",
+      "After you’ve installed DeepSpeed and have a better idea of your memory requirements, the next step is selecting a ZeRO stage to use. In order of fastest and most memory-efficient:\n",
+      "| Fastest          | Memory efficient   |\n",
+      "|------------------|--------------------|\n",
+      "| ZeRO-1           | ZeRO-3 + offload   |\n",
+      "| ZeRO-2           | ZeRO-3             |\n",
+      "| ZeRO-2 + offload | ZeRO-2 + offload   |\n",
+      "| ZeRO-3           | ZeRO-2             |\n",
+      "| ZeRO-3 + offload | ZeRO-1             |\n",
+      "To find what works best for you, start with the fastest approach and if you run out of memory, try the next stage which is slower but more memory efficient. Feel free to work in whichever direction you prefer (starting with the most memory efficient or fastest) to discover the appropriate balance between speed and memory usage.\n",
+      "A general process you can use is (start with batch size of 1):\n",
+      "enable gradient checkpointing try ZeRO-2 try ZeRO-2 and offload the optimizer try ZeRO-3 try ZeRO-3 and offload parameters to the CPU try ZeRO-3 and offload parameters and the optimizer to the CPU try lowering various default values like a narrower search beam if you’re using the generate() method try mixed half-precision (fp16 on older GPU architectures and bf16 on Ampere) over full-precision weights add more hardware if possible or enable Infinity to offload parameters and the optimizer to a NVMe once you’re not running out of memory, measure effective throughput and then try to increase the batch size as large as you can to maximize GPU efficiency lastly, try to optimize your training setup by disabling some offload features or use a faster ZeRO stage and increasing/decreasing the batch size to find the best tradeoff between speed and memory usage\n",
+      "\n",
+      "DeepSpeed works with the Trainer class by way of a config file containing all the parameters for configuring how you want setup your training run. When you execute your training script, DeepSpeed logs the configuration it received from Trainer to the console so you can see exactly what configuration was used.\n",
+      "Find a complete list of DeepSpeed configuration options on the DeepSpeed Configuration JSON reference. You can also find more practical examples of various DeepSpeed configuration examples on the DeepSpeedExamples repository or the main DeepSpeed repository. To quickly find specific examples, you can: Copied git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples\n",
+      "find . -name '*json' # find examples with the Lamb optimizer grep -i Lamb $(find . -name '*json' )\n",
+      "The DeepSpeed configuration file is passed as a path to a JSON file if you’re training from the command line interface or as a nested dict object if you’re using the Trainer in a notebook setting.\n",
+      "path to file nested dict\n",
+      "Copied TrainingArguments(..., deepspeed= \"path/to/deepspeed_config.json\" )\n",
+      "\n",
+      "There are three types of configuration parameters:\n",
+      "Some of the configuration parameters are shared by Trainer and DeepSpeed, and it can be difficult to identify errors when there are conflicting definitions. To make it easier, these shared configuration parameters are configured from the Trainer command line arguments. Some configuration parameters that are automatically derived from the model configuration so you don’t need to manually adjust these values. The Trainer uses a configuration value auto to determine set the most correct or efficient value. You could set your own configuration parameters explicitly, but you must take care to ensure the Trainer arguments and DeepSpeed configuration parameters agree. Mismatches may cause the training to fail in very difficult to detect ways! Some configuration parameters specific to DeepSpeed only which need to be manually set based on your training needs.\n",
+      "You could also modify the DeepSpeed configuration and edit TrainingArguments from it:\n",
+      "Create or load a DeepSpeed configuration to used as the main configuration Create a TrainingArguments object based on these DeepSpeed configuration values\n",
+      "Some values, such as scheduler.params.total_num_steps are calculated by the Trainer during training.\n",
+      "\n",
+      "There are three configurations, each corresponding to a different ZeRO stage. Stage 1 is not as interesting for scalability, and this guide focuses on stages 2 and 3. The zero_optimization configuration contains all the options for what to enable and how to configure them. For a more detailed explanation of each parameter, take a look at the DeepSpeed Configuration JSON reference.\n",
+      "DeepSpeed doesn’t validate parameter names and any typos fallback on the parameter's default setting. You can watch the DeepSpeed engine startup log messages to see what values it is going to use.\n",
+      "The following configurations must be setup with DeepSpeed because the Trainer doesn’t provide equivalent command line arguments.\n",
+      "ZeRO-1 ZeRO-2 ZeRO-3\n",
+      "ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed up. The ZeRO-1 config can be setup like this: Copied { \"zero_optimization\": { \"stage\": 1 }\n",
+      "}\n",
+      "\n",
+      "ZeRO-Infinity allows offloading model states to the CPU and/or NVMe to save even more memory. Smart partitioning and tiling algorithms allow each GPU to send and receive very small amounts of data during offloading such that a modern NVMe can fit an even larger total memory pool than is available to your training process. ZeRO-Infinity requires ZeRO-3.\n",
+      "Depending on the CPU and/or NVMe memory available, you can offload both the optimizer states and parameters , just one of them, or none. You should also make sure the nvme_path is pointing to an NVMe device, because while it still works with a normal hard drive or solid state drive, it’ll be significantly slower. With a modern NVMe, you can expect peak transfer speeds of ~3.5GB/s for read and ~3GB/s for write operations. Lastly, run a benchmark on your training setup to determine the optimal aio configuration.\n",
+      "The example ZeRO-3/Infinity configuration file below sets most of the parameter values to auto , but you could also manually add these values.\n",
+      "Copied { \"fp16\": { \"enabled\": \"auto\" , \"loss_scale\": 0 , \"loss_scale_window\": 1000 , \"initial_scale_power\": 16 , \"hysteresis\": 2 , \"min_loss_scale\": 1 }, \"optimizer\": { \"type\": \"AdamW\" , \"params\": { \"lr\": \"auto\" , \"betas\": \"auto\" , \"eps\": \"auto\" , \"weight_decay\": \"auto\" }\n",
+      "    }, \"scheduler\": { \"type\": \"WarmupLR\" , \"params\": { \"warmup_min_lr\": \"auto\" , \"warmup_max_lr\": \"auto\" , \"warmup_num_steps\": \"auto\" }\n",
+      "    }, \"zero_optimization\": { \"stage\": 3 , \"offload_optimizer\": { \"device\": \"nvme\" , \"nvme_path\": \"/local_nvme\" , \"pin_memory\": true , \"buffer_count\": 4 , \"fast_init\": false }, \"offload_param\": { \"device\": \"nvme\" , \"nvme_path\": \"/local_nvme\" , \"pin_memory\": true , \"buffer_count\": 5 , \"buffer_size\": 1e8 , \"max_in_cpu\": 1e9 }, \"aio\": { \"block_size\": 262144 , \"queue_depth\": 32 , \"thread_count\": 1 , \"single_submit\": false , \"overlap_events\": true }, \"overlap_comm\": true , \"contiguous_gradients\": true , \"sub_group_size\": 1e9 , \"reduce_bucket_size\": \"auto\" , \"stage3_prefetch_bucket_size\": \"auto\" , \"stage3_param_persistence_threshold\": \"auto\" , \"stage3_max_live_parameters\": 1e9 , \"stage3_max_reuse_distance\": 1e9 , \"stage3_gather_16bit_weights_on_model_save\": true }, \"gradient_accumulation_steps\": \"auto\" , \"gradient_clipping\": \"auto\" , \"steps_per_print\": 2000 , \"train_batch_size\": \"auto\" , \"train_micro_batch_size_per_gpu\": \"auto\" , \"wall_clock_breakdown\": false }\n",
+      "\n",
+      "There are a number of important parameters to specify in the DeepSpeed configuration file which are briefly described in this section.\n",
+      "\n",
+      "Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. To enable this feature:\n",
+      "For a Hugging Face model, set model.gradient_checkpointing_enable() or --gradient_checkpointing in the Trainer . For a non-Hugging Face model, use the DeepSpeed Activation Checkpointing API . You could also replace the Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed API. This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them.\n",
+      "\n",
+      "DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don’t enable offload_optimizer . When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation.\n",
+      "The optimizer and scheduler parameters for the config file can be set from the command line to avoid hard to find errors. For example, if the learning rate is set to a different value in another place you can override it from the command line. Aside from the optimizer and scheduler parameters, you’ll need to ensure your Trainer command line arguments match the DeepSpeed configuration.\n",
+      "optimizer scheduler\n",
+      "DeepSpeed offers several optimizers (Adam, AdamW, OneBitAdam, and LAMB) but you can also import other optimizers from PyTorch. If you don’t configure the optimizer in the config, the Trainer automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr , adam_beta1 , adam_beta2 , adam_epsilon , weight_decay . You can set the parameters to \"auto\" or manually input your own desired values. Copied { \"optimizer\": { \"type\": \"AdamW\" , \"params\": { \"lr\": \"auto\" , \"betas\": \"auto\" , \"eps\": \"auto\" , \"weight_decay\": \"auto\" }\n",
+      "   }\n",
+      "} You can also use an unsupported optimizer by adding the following to the top level configuration. Copied { \"zero_allow_untested_optimizer\": true } From DeepSpeed==0.8.3 on, if you want to use offload, you’ll also need to the following to the top level configuration because offload works best with DeepSpeed’s CPU Adam optimizer. Copied { \"zero_force_ds_cpu_optimizer\": false }\n",
+      "\n",
+      "Deepspeed supports fp32, fp16, and bf16 mixed precision.\n",
+      "fp32 fp16 bf16\n",
+      "If your model doesn’t work well with mixed precision, for example if it wasn’t pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode. Copied { \"fp16\": { \"enabled\": false }\n",
+      "} For Ampere GPUs and PyTorch > 1.7, it automatically switches to the more efficient tf32 format for some operations but the results are still in fp32. You can control it from the Trainer by setting --tf32 to enable it, and --tf32 0 or --no_tf32 to disable it.\n",
+      "\n",
+      "The batch size can be auto-configured or explicitly set. If you choose to use the \"auto\" option, Trainer sets train_micro_batch_size_per_gpu to the value of args. per_device_train_batch_size and train_batch_size to args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps .\n",
+      "Copied { \"train_micro_batch_size_per_gpu\": \"auto\" , \"train_batch_size\": \"auto\" }\n",
+      "\n",
+      "Gradient accumulation can be auto-configured or explicitly set. If you choose to use the \"auto\" option, Trainer sets it to the value of args.gradient_accumulation_steps .\n",
+      "Copied { \"gradient_accumulation_steps\": \"auto\" }\n",
+      "\n",
+      "Gradient clipping can be auto-configured or explicitly set. If you choose to use the \"auto\" option, Trainer sets it to the value of args.max_grad_norm .\n",
+      "Copied { \"gradient_clipping\": \"auto\" }\n",
+      "\n",
+      "For communication collectives like reduction, gathering and scattering operations, a separate data type is used.\n",
+      "All gather and scatter operations are performed in the same data type the data is in. For example, if you’re training with bf16, the data is also gathered in bf16 because gathering is a non-lossy operation.\n",
+      "Reduce operations are lossy, for example when gradients are averaged across multiple GPUs. When the communication is done in fp16 or bf16, it is more likely to be lossy because adding multiple numbers in low precision isn’t exact. This is especially the case with bf16 which has a lower precision than fp16. For this reason, fp16 is the default for reduction operations because the loss is minimal when averaging gradients.\n",
+      "You can choose the communication data type by setting the communication_data_type parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you’re training in.\n",
+      "Copied { \"communication_data_type\": \"fp32\" }\n",
+      "\n",
+      "DeepSpeed can be deployed by different launchers such as torchrun , the deepspeed launcher, or Accelerate . To deploy, add --deepspeed ds_config.json to the Trainer command line. It’s recommended to use DeepSpeed’s add_config_arguments utility to add any necessary command line arguments to your code.\n",
+      "This guide will show you how to deploy DeepSpeed with the deepspeed launcher for different training setups. You can check out this post for more practical usage examples.\n",
+      "multi-GPU single-GPU\n",
+      "To deploy DeepSpeed on multiple GPUs, add the --num_gpus parameter. If you want to use all available GPUs, you don’t need to add --num_gpus . The example below uses 2 GPUs. Copied deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \\\n",
+      "--deepspeed tests/deepspeed/ds_config_zero3.json \\\n",
+      "--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \\\n",
+      "--output_dir output_dir --overwrite_output_dir --fp16 \\\n",
+      "--do_train --max_train_samples 500 --num_train_epochs 1 \\\n",
+      "--dataset_name wmt16 --dataset_config \"ro-en\" \\\n",
+      "--source_lang en --target_lang ro\n",
+      "\n",
+      "A node is one or more GPUs for running a workload. A more powerful setup is a multi-node setup which can be launched with the deepspeed launcher. For this guide, let’s assume there are two nodes with 8 GPUs each. The first node can be accessed ssh hostname1 and the second node with ssh hostname2 . Both nodes must be able to communicate with each other locally over ssh without a password.\n",
+      "By default, DeepSpeed expects your multi-node environment to use a shared storage. If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a checkpoint to allow loading without access to a shared filesystem:\n",
+      "Copied { \"checkpoint\": { \"use_node_local_storage\": true }\n",
+      "}\n",
+      "You could also use the Trainer ’s --save_on_each_node argument to automatically add the above checkpoint to your config.\n",
+      "torchrun deepspeed\n",
+      "For torchrun , you have to ssh to each node and run the following command on both of them. The launcher waits until both nodes are synchronized before launching the training. Copied torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \\\n",
+      "--master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json\n",
+      "\n",
+      "In a SLURM environment, you’ll need to adapt your SLURM script to your specific SLURM environment. An example SLURM script may look like:\n",
+      "Copied #SBATCH --job-name=test-nodes        # name #SBATCH --nodes=2                    # nodes #SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node! #SBATCH --cpus-per-task=10           # number of cores per tasks #SBATCH --gres=gpu:8                 # number of gpus #SBATCH --time 20:00:00              # maximum execution time (HH:MM:SS) #SBATCH --output=%x-%j.out           # output file name export GPUS_PER_NODE=8 export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=9901\n",
+      "\n",
+      "srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \\\n",
+      " --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \\\n",
+      " --master_addr $MASTER_ADDR --master_port $MASTER_PORT \\\n",
+      "your_program.py <normal cl args> --deepspeed ds_config.json'\n",
+      "Then you can schedule your multi-node deployment with the following command which launches training simultaneously on all nodes.\n",
+      "Copied sbatch launch.slurm\n",
+      "\n",
+      "The deepspeed launcher doesn’t support deployment from a notebook so you’ll need to emulate the distributed environment. However, this only works for 1 GPU. If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. This means you have to use the deepspeed launcher which can’t be emulated as shown here.\n",
+      "Copied # DeepSpeed requires a distributed environment even when only one process is used. # This emulates a launcher in the notebook import os\n",
+      "\n",
+      "os.environ[ \"MASTER_ADDR\" ] = \"localhost\" os.environ[ \"MASTER_PORT\" ] = \"9994\" # modify if RuntimeError: Address already in use os.environ[ \"RANK\" ] = \"0\" os.environ[ \"LOCAL_RANK\" ] = \"0\" os.environ[ \"WORLD_SIZE\" ] = \"1\" # Now proceed as normal, plus pass the DeepSpeed config file training_args = TrainingArguments(..., deepspeed= \"ds_config_zero3.json\" )\n",
+      "trainer = Trainer(...)\n",
+      "trainer.train()\n",
+      "If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated cell.\n",
+      "Copied %%bash\n",
+      "cat << 'EOT' > ds_config_zero3.json\n",
+      "{ \"fp16\" : { \"enabled\" : \"auto\" , \"loss_scale\" : 0 , \"loss_scale_window\" : 1000 , \"initial_scale_power\" : 16 , \"hysteresis\" : 2 , \"min_loss_scale\" : 1 }, \"optimizer\" : { \"type\" : \"AdamW\" , \"params\" : { \"lr\" : \"auto\" , \"betas\" : \"auto\" , \"eps\" : \"auto\" , \"weight_decay\" : \"auto\" }\n",
+      "    }, \"scheduler\" : { \"type\" : \"WarmupLR\" , \"params\" : { \"warmup_min_lr\" : \"auto\" , \"warmup_max_lr\" : \"auto\" , \"warmup_num_steps\" : \"auto\" }\n",
+      "    }, \"zero_optimization\" : { \"stage\" : 3 , \"offload_optimizer\" : { \"device\" : \"cpu\" , \"pin_memory\" : true\n",
+      "        }, \"offload_param\" : { \"device\" : \"cpu\" , \"pin_memory\" : true\n",
+      "        }, \"overlap_comm\" : true, \"contiguous_gradients\" : true, \"sub_group_size\" : 1e9 , \"reduce_bucket_size\" : \"auto\" , \"stage3_prefetch_bucket_size\" : \"auto\" , \"stage3_param_persistence_threshold\" : \"auto\" , \"stage3_max_live_parameters\" : 1e9 , \"stage3_max_reuse_distance\" : 1e9 , \"stage3_gather_16bit_weights_on_model_save\" : true\n",
+      "    }, \"gradient_accumulation_steps\" : \"auto\" , \"gradient_clipping\" : \"auto\" , \"steps_per_print\" : 2000 , \"train_batch_size\" : \"auto\" , \"train_micro_batch_size_per_gpu\" : \"auto\" , \"wall_clock_breakdown\" : false\n",
+      "}\n",
+      "EOT\n",
+      "If the training script is in a file and not in a notebook cell, you can launch deepspeed normally from the shell in a notebook cell. For example, to launch run_translation.py :\n",
+      "Copied !git clone https://github.com/huggingface/transformers\n",
+      "!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...\n",
+      "You could also use %%bash magic and write multi-line code to run the shell program, but you won’t be able to view the logs until training is complete. With %%bash magic, you don’t need to emulate a distributed environment.\n",
+      "Copied %%bash\n",
+      "\n",
+      "git clone https://github.com/huggingface/transformers\n",
+      "cd transformers\n",
+      "deepspeed examples/pytorch/translation/run_translation.py ...\n",
+      "\n",
+      "DeepSpeed stores the main full precision fp32 weights in custom checkpoint optimizer files (the glob pattern looks like global_step*/*optim_states.pt ) and are saved under the normal checkpoint.\n",
+      "fp16 fp32\n",
+      "A model trained with ZeRO-2 saves the pytorch_model.bin weights in fp16. To save the model weights in fp16 for a model trained with ZeRO-3, you need to set \"stage3_gather_16bit_weights_on_model_save\": true because the model weights are partitioned across multiple GPUs. Otherwise, the Trainer won’t save the weights in fp16 and it won’t create a pytorch_model.bin file. This is because DeepSpeed’s state_dict contains a placeholder instead of the real weights and you won’t be able to load them. Copied { \"zero_optimization\": { \"stage3_gather_16bit_weights_on_model_save\": true }\n",
+      "}\n",
+      "\n",
+      "ZeRO Inference places the model weights in CPU or NVMe memory to avoid burdening the GPU which makes it possible to run inference with huge models on a GPU. Inference doesn’t require any large additional amounts of memory for the optimizer states and gradients so you can fit much larger batches and/or sequence lengths on the same hardware.\n",
+      "ZeRO Inference shares the same configuration file as ZeRO-3 , and ZeRO-2 and ZeRO-1 configs won’t work because they don’t provide any benefits for inference.\n",
+      "To run ZeRO Inference, pass your usual training arguments to the TrainingArguments class and add the --do_eval argument.\n",
+      "Copied deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json\n",
+      "\n",
+      "DeepSpeed also works with Transformers without the Trainer class. This is handled by the HfDeepSpeedConfig which only takes care of gathering ZeRO-3 parameters and splitting a model across multiple GPUs when you call from_pretrained() .\n",
+      "If you want everything automatically taken care of for you, try using DeepSpeed with the Trainer ! You’ll need to follow the DeepSpeed documentation , and manually configure the parameter values in the config file (you can’t use the \"auto\" value).\n",
+      "To efficiently deploy ZeRO-3, you must instantiate the HfDeepSpeedConfig object before the model and keep that object alive:\n",
+      "pretrained model non-pretrained model\n",
+      "Copied from transformers.integrations import HfDeepSpeedConfig from transformers import AutoModel import deepspeed\n",
+      "\n",
+      "ds_config = {...} # deepspeed config object or path to the file # must run before instantiating the model to detect zero 3 dschf = HfDeepSpeedConfig(ds_config) # keep this object alive model = AutoModel.from_pretrained( \"openai-community/gpt2\" )\n",
+      "engine = deepspeed.initialize(model=model, config_params=ds_config, ...)\n",
+      "\n",
+      "To run ZeRO Inference without the Trainer in cases where you can’t fit a model onto a single GPU, try using additional GPUs or/and offloading to CPU memory. The important nuance to understand here is that the way ZeRO is designed, you can process different inputs on different GPUs in parallel.\n",
+      "Make sure to:\n",
+      "disable CPU offload if you have enough GPU memory (since it slows things down). enable bf16 if you have an Ampere or newer GPU to make things faster. If you don’t have one of these GPUs, you may enable fp16 as long as you don’t use a model pretrained in bf16 (T5 models) because it may lead to an overflow error.\n",
+      "Take a look at the following script to get a better idea of how to run ZeRO Inference without the Trainer on a model that won’t fit on a single GPU.\n",
+      "Copied #!/usr/bin/env python # This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model # into a single GPU # # 1. Use 1 GPU with CPU offload # 2. Or use multiple GPUs instead # # First you need to install deepspeed: pip install deepspeed # # Here we use a 3B \"bigscience/T0_3B\" model which needs about 15GB GPU RAM - so 1 largish or 2 # small GPUs can handle it. or 1 small GPU and a lot of CPU memory. # # To use a larger model like \"bigscience/T0\" which needs about 50GB, unless you have an 80GB GPU - # you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to # process multiple inputs at once. # # The provided deepspeed config also activates CPU memory offloading, so chances are that if you # have a lot of available CPU memory and you don't mind a slowdown you should be able to load a # model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will # run faster if you don't want offload to CPU - so disable that section then. # # To deploy on 1 gpu: # # deepspeed --num_gpus 1 t0.py # or: # python -m torch.distributed.run --nproc_per_node=1 t0.py # # To deploy on 2 gpus: # # deepspeed --num_gpus 2 t0.py # or: # python -m torch.distributed.run --nproc_per_node=2 t0.py from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM from transformers.integrations import HfDeepSpeedConfig import deepspeed import os import torch\n",
+      "\n",
+      "os.environ[ \"TOKENIZERS_PARALLELISM\" ] = \"false\" # To avoid warnings about parallelism in tokenizers # distributed setup local_rank = int (os.getenv( \"LOCAL_RANK\" , \"0\" ))\n",
+      "world_size = int (os.getenv( \"WORLD_SIZE\" , \"1\" ))\n",
+      "torch.cuda.set_device(local_rank)\n",
+      "deepspeed.init_distributed()\n",
+      "\n",
+      "model_name = \"bigscience/T0_3B\" config = AutoConfig.from_pretrained(model_name)\n",
+      "model_hidden_size = config.d_model # batch size has to be divisible by world_size, but can be bigger than world_size train_batch_size = 1 * world_size # ds_config notes # # - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be # faster. # # - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g. # all official t5 models are bf16-pretrained # # - set offload_param.device to \"none\" or completely remove the `offload_param` section if you don't # - want CPU offload # # - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control # - which params should remain on gpus - the larger the value the smaller the offload size # # For in-depth info on Deepspeed config see # https://huggingface.co/docs/transformers/main/main_classes/deepspeed # keeping the same format as json for consistency, except it uses lower case for true/false # fmt: off ds_config = { \"fp16\" : { \"enabled\" : False }, \"bf16\" : { \"enabled\" : False }, \"zero_optimization\" : { \"stage\" : 3 , \"offload_param\" : { \"device\" : \"cpu\" , \"pin_memory\" : True }, \"overlap_comm\" : True , \"contiguous_gradients\" : True , \"reduce_bucket_size\" : model_hidden_size * model_hidden_size, \"stage3_prefetch_bucket_size\" : 0.9 * model_hidden_size * model_hidden_size, \"stage3_param_persistence_threshold\" : 10 * model_hidden_size\n",
+      "    }, \"steps_per_print\" : 2000 , \"train_batch_size\" : train_batch_size, \"train_micro_batch_size_per_gpu\" : 1 , \"wall_clock_breakdown\" : False } # fmt: on # next line instructs transformers to partition the model directly over multiple gpus using # deepspeed.zero.Init when model's `from_pretrained` method is called. # # **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)** # # otherwise the model will first be loaded normally and only partitioned at forward time which is # less efficient and when there is little CPU RAM may fail dschf = HfDeepSpeedConfig(ds_config) # keep this object alive # now a model can be loaded. model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # initialise Deepspeed ZeRO and store only the engine object ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[ 0 ]\n",
+      "ds_engine.module. eval () # inference # Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once. # If you use more GPUs adjust for more. # And of course if you have just one input to process you then need to pass the same string to both gpus # If you use only one GPU, then you will have only rank 0. rank = torch.distributed.get_rank() if rank == 0 :\n",
+      "    text_in = \"Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy\" elif rank == 1 :\n",
+      "    text_in = \"Is this review positive or negative? Review: this is the worst restaurant ever\" tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+      "inputs = tokenizer.encode(text_in, return_tensors= \"pt\" ).to(device=local_rank) with torch.no_grad():\n",
+      "    outputs = ds_engine.module.generate(inputs, synced_gpus= True )\n",
+      "text_out = tokenizer.decode(outputs[ 0 ], skip_special_tokens= True ) print ( f\"rank {rank} :\\n   in= {text_in} \\n  out= {text_out} \" )\n",
+      "Save the script as t0.py and launch it:\n",
+      "Copied $ deepspeed --num_gpus 2 t0.py\n",
+      "rank0: in =Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy\n",
+      "  out=Positive\n",
+      "rank1: in =Is this review positive or negative? Review: this is the worst restaurant ever\n",
+      "  out=negative\n",
+      "This is a very basic example and you’ll want to adapt it to your use case.\n",
+      "\n",
+      "Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the generate() method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven’t received the weight shard from the GPU that finished first.\n",
+      "For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation.\n",
+      "\n",
+      "When you encounter an issue, you should consider whether DeepSpeed is the cause of the problem because often it isn’t (unless it’s super obviously and you can see DeepSpeed modules in the exception)! The first step should be to retry your setup without DeepSpeed, and if the problem persists, then you can report the issue. If the issue is a core DeepSpeed problem and unrelated to the Transformers integration, open an Issue on the DeepSpeed repository .\n",
+      "For issues related to the Transformers integration, please provide the following information:\n",
+      "the full DeepSpeed config file the command line arguments of the Trainer , or TrainingArguments arguments if you’re scripting the Trainer setup yourself (don’t dump the TrainingArguments which has dozens of irrelevant entries) the outputs of:\n",
+      "Copied python -c 'import torch; print(f\"torch: {torch.__version__}\")' python -c 'import transformers; print(f\"transformers: {transformers.__version__}\")' python -c 'import deepspeed; print(f\"deepspeed: {deepspeed.__version__}\")'\n",
+      "a link to a Google Colab notebook to reproduce the issue if impossible, a standard and non-custom dataset we can use and also try to use an existing example to reproduce the issue with\n",
+      "The following sections provide a guide for resolving two of the most common issues.\n",
+      "\n",
+      "When the DeepSpeed process is killed during launch without a traceback, that usually means the program tried to allocate more CPU memory than your system has or your process tried to allocate more CPU memory than allowed leading the OS kernel to terminate the process. In this case, check whether your configuration file has either offload_optimizer , offload_param or both configured to offload to the CPU.\n",
+      "If you have NVMe and ZeRO-3 setup, experiment with offloading to the NVMe ( estimate the memory requirements for your model).\n",
+      "\n",
+      "NaN loss often occurs when a model is pretrained in bf16 and then you try to use it with fp16 (especially relevant for TPU trained models). To resolve this, use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer).\n",
+      "The other issue may be related to using fp16. For example, if this is your fp16 configuration:\n",
+      "Copied { \"fp16\": { \"enabled\": \"auto\" , \"loss_scale\": 0 , \"loss_scale_window\": 1000 , \"initial_scale_power\": 16 , \"hysteresis\": 2 , \"min_loss_scale\": 1 }\n",
+      "}\n",
+      "You might see the following OVERFLOW! messages in the logs:\n",
+      "Copied 0%|                                                                                                                             | 0/189 [00:00<?, ?it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 262144\n",
+      "  1%|▌                                                                                                                    | 1/189 [00:00<01:26,  2.17it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072.0\n",
+      "  1%|█▏\n",
+      " [...]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      " 14%|████████████████▌                                                                                                   | 27/189 [00:14<01:13,  2.21it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      " 15%|█████████████████▏                                                                                                  | 28/189 [00:14<01:13,  2.18it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      " 15%|█████████████████▊                                                                                                  | 29/189 [00:15<01:13,  2.18it/s]\n",
+      " [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1\n",
+      "[...]\n",
+      "This means the DeepSpeed loss scaler is unable to find a scaling coefficient to overcome loss overflow. To fix it, try a higher initial_scale_power value (32 usually works).\n",
+      "\n",
+      "DeepSpeed ZeRO is a powerful technology for training and loading very large models for inference with limited GPU resources, making it more accessible to everyone. To learn more about DeepSpeed, feel free to read the blog posts , documentation , and GitHub repository .\n",
+      "The following papers are also a great resource for learning more about ZeRO:\n",
+      "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ZeRO-Offload: Democratizing Billion-Scale Model Training ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning\n",
+      "< > Update on GitHub\n",
+      "HTML_TAG_END\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\n",
+    "    \"The LLM sees this: \\n\",\n",
+    "    documents[0].get_content(metadata_mode=MetadataMode.LLM),\n",
+    ")\n",
+    "print(\n",
+    "    \"The Embedding model sees this: \\n\",\n",
+    "    documents[0].get_content(metadata_mode=MetadataMode.EMBED),\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
     ")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\n",
+    "    \"The LLM sees this: \\n\",\n",
+    "    document.get_content(metadata_mode=MetadataMode.LLM),\n",
+    ")\n",
+    "print(\n",
+    "    \"The Embedding model sees this: \\n\",\n",
+    "    document.get_content(metadata_mode=MetadataMode.EMBED),\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

scripts/custom_retriever.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import logging
+from typing import List
+from llama_index.core import QueryBundle
+from llama_index.core.retrievers import BaseRetriever, VectorIndexRetriever
+from llama_index.core.schema import NodeWithScore, TextNode
+logger = logging.getLogger(__name__)
+logging.basicConfig(level=logging.INFO)
+class CustomRetriever(BaseRetriever):
+    """Custom retriever that performs both semantic search and hybrid search."""
+    def __init__(
+        self,
+        vector_retriever: VectorIndexRetriever,
+        document_dict: dict,
+    ) -> None:
+        """Init params."""
+        self._vector_retriever = vector_retriever
+        self._document_dict = document_dict
+        super().__init__()
+    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
+        """Retrieve nodes given query."""
+        logger.info(f"Retrieving nodes for query: {query_bundle.query_str}")
+        nodes = self._vector_retriever.retrieve(query_bundle)
+        # Filter out nodes with the same ref_doc_id
+        def filter_nodes_by_unique_doc_id(nodes):
+            unique_nodes = {}
+            for node in nodes:
+                doc_id = node.node.ref_doc_id
+                if doc_id is not None and doc_id not in unique_nodes:
+                    unique_nodes[doc_id] = node
+            return list(unique_nodes.values())
+        nodes = filter_nodes_by_unique_doc_id(nodes)
+        print(f"number of nodes after filtering: {len(nodes)}")
+        nodes_context = []
+        for node in nodes:
+            # print("Node ID\t", node.node_id)
+            # print("Title\t", node.metadata["title"])
+            # print("Text\t", node.text)
+            # print("Score\t", node.score)
+            # print("Metadata\t", node.metadata)
+            # print("-_" * 20)
+            if node.metadata["retrieve_doc"] == True:
+                # print("This node will be replaced by the document")
+                doc = self._document_dict[node.node.ref_doc_id]
+                # print(doc.text)
+                new_node = NodeWithScore(
+                    node=TextNode(text=doc.text, metadata=node.metadata),
+                    score=node.score,
+                )
+                nodes_context.append(new_node)
+            else:
+                nodes_context.append(node)
+        return nodes_context

scripts/gradio-ui.py CHANGED Viewed

@@ -7,25 +7,57 @@ from typing import Optional
 import chromadb
 import gradio as gr
 from dotenv import load_dotenv
 from llama_index.agent.openai import OpenAIAgent
 from llama_index.core import VectorStoreIndex, get_response_synthesizer
 from llama_index.core.data_structs import Node
 from llama_index.core.node_parser import SentenceSplitter
 from llama_index.core.schema import BaseNode, MetadataMode, NodeWithScore, TextNode
 from llama_index.embeddings.openai import OpenAIEmbedding
 from llama_index.llms.gemini import Gemini
 from llama_index.llms.openai import OpenAI
 from llama_index.vector_stores.chroma import ChromaVectorStore
 from tutor_prompts import (
     TEXT_QA_TEMPLATE,
     QueryValidation,
     system_message_openai_agent,
     system_message_validation,
 )
 load_dotenv(".env")
 # from utils import init_mongo_db
 logger = logging.getLogger(__name__)
@@ -35,7 +67,7 @@ logging.getLogger("httpx").setLevel(logging.WARNING)
 # # This variables are used to intercept API calls
 # # launch mitmweb
-# cert_file = "/Users/omar/Downloads/mitmproxy-ca-cert.pem"
 # os.environ["REQUESTS_CA_BUNDLE"] = cert_file
 # os.environ["SSL_CERT_FILE"] = cert_file
 # os.environ["HTTPS_PROXY"] = "http://127.0.0.1:8080"
@@ -100,17 +132,21 @@ index = VectorStoreIndex.from_vector_store(
     show_progress=True,
     use_async=True,
 )
-retriever = index.as_retriever(
     similarity_top_k=10,
     use_async=True,
     embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
 )
 with open("scripts/ai-tutor-vector-db/document_dict.pkl", "rb") as f:
     document_dict = pickle.load(f)
 def format_sources(completion) -> str:
     if len(completion.source_nodes) == 0:
@@ -151,7 +187,9 @@ def add_sources(answer_str, completion):
     if formatted_sources == "":
         yield answer_str
-    answer_str += "\n\n" + formatted_sources
     yield answer_str
@@ -165,38 +203,6 @@ def generate_completion(
     print(f"query: {query}")
     print(model)
     print(sources)
-    nodes = retriever.retrieve(query)
-    # Filter out nodes with the same ref_doc_id
-    def filter_nodes_by_unique_doc_id(nodes):
-        unique_nodes = {}
-        for node in nodes:
-            doc_id = node.node.ref_doc_id
-            if doc_id is not None and doc_id not in unique_nodes:
-                unique_nodes[doc_id] = node
-        return list(unique_nodes.values())
-    nodes = filter_nodes_by_unique_doc_id(nodes)
-    print(f"number of nodes after filtering: {len(nodes)}")
-    nodes_context = []
-    for node in nodes:
-        print("Node ID\t", node.node_id)
-        print("Title\t", node.metadata["title"])
-        print("Text\t", node.text)
-        print("Score\t", node.score)
-        print("Metadata\t", node.metadata)
-        print("-_" * 20)
-        if node.metadata["retrieve_doc"] == True:
-            print("This node will be replaced by the document")
-            doc = document_dict[node.node.ref_doc_id]
-            print(doc.text)
-            new_node = NodeWithScore(
-                node=TextNode(text=doc.text, metadata=node.metadata), score=node.score
-            )
-            nodes_context.append(new_node)
-        else:
-            nodes_context.append(node)
     if model == "gemini-1.5-flash" or model == "gemini-1.5-pro":
         llm = Gemini(
@@ -215,7 +221,73 @@ def generate_completion(
         streaming=True,
     )
-    completion = response_synthesizer.synthesize(query, nodes=nodes_context)
     answer_str = ""
     for token in completion.response_gen:
@@ -246,9 +318,11 @@ model = gr.Dropdown(
         "gemini-1.5-pro",
         "gemini-1.5-flash",
         "gpt-3.5-turbo",
     ],
     label="Model",
-    value="gemini-1.5-pro",
     interactive=True,
 )

 import chromadb
 import gradio as gr
+from custom_retriever import CustomRetriever
 from dotenv import load_dotenv
 from llama_index.agent.openai import OpenAIAgent
 from llama_index.core import VectorStoreIndex, get_response_synthesizer
+from llama_index.core.agent import AgentRunner, ReActAgent
+from llama_index.core.chat_engine import (
+    CondensePlusContextChatEngine,
+    CondenseQuestionChatEngine,
+    ContextChatEngine,
+)
 from llama_index.core.data_structs import Node
+from llama_index.core.memory import ChatMemoryBuffer
 from llama_index.core.node_parser import SentenceSplitter
+from llama_index.core.query_engine import RetrieverQueryEngine
+from llama_index.core.retrievers import VectorIndexRetriever
 from llama_index.core.schema import BaseNode, MetadataMode, NodeWithScore, TextNode
+from llama_index.core.tools import (
+    FunctionTool,
+    QueryEngineTool,
+    RetrieverTool,
+    ToolMetadata,
+)
 from llama_index.embeddings.openai import OpenAIEmbedding
 from llama_index.llms.gemini import Gemini
 from llama_index.llms.openai import OpenAI
+from llama_index.llms.openai.utils import GPT4_MODELS
 from llama_index.vector_stores.chroma import ChromaVectorStore
 from tutor_prompts import (
     TEXT_QA_TEMPLATE,
     QueryValidation,
     system_message_openai_agent,
     system_message_validation,
+    system_prompt,
 )
 load_dotenv(".env")
+GPT4_MODELS.update(
+    {
+        "gpt-4-1106-preview": 128000,
+        "gpt-4-0125-preview": 128000,
+        "gpt-4-turbo-preview": 128000,
+        "gpt-4-turbo-2024-04-09": 128000,
+        "gpt-4-turbo": 128000,
+        "gpt-4o": 128000,
+        "gpt-4o-2024-05-13": 128000,
+        "gpt-4o-mini": 128000,
+        # Add any other models you need
+    }
+)
 # from utils import init_mongo_db
 logger = logging.getLogger(__name__)
 # # This variables are used to intercept API calls
 # # launch mitmweb
+# cert_file = "/Users/omar/Documents/mitmproxy-ca-cert.pem"
 # os.environ["REQUESTS_CA_BUNDLE"] = cert_file
 # os.environ["SSL_CERT_FILE"] = cert_file
 # os.environ["HTTPS_PROXY"] = "http://127.0.0.1:8080"
     show_progress=True,
     use_async=True,
 )
+vector_retriever = VectorIndexRetriever(
+    index=index,
     similarity_top_k=10,
     use_async=True,
     embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
 )
+memory = ChatMemoryBuffer.from_defaults(token_limit=150000)
 with open("scripts/ai-tutor-vector-db/document_dict.pkl", "rb") as f:
     document_dict = pickle.load(f)
+custom_retriever = CustomRetriever(vector_retriever, document_dict)
 def format_sources(completion) -> str:
     if len(completion.source_nodes) == 0:
     if formatted_sources == "":
         yield answer_str
+    if formatted_sources != "":
+        answer_str += "\n\n" + formatted_sources
     yield answer_str
     print(f"query: {query}")
     print(model)
     print(sources)
     if model == "gemini-1.5-flash" or model == "gemini-1.5-pro":
         llm = Gemini(
         streaming=True,
     )
+    # completion = response_synthesizer.synthesize(query, nodes=nodes_context)
+    custom_query_engine = RetrieverQueryEngine(
+        retriever=custom_retriever,
+        response_synthesizer=response_synthesizer,
+    )
+    # agent = CondensePlusContextChatEngine.from_defaults(
+    # agent = CondenseQuestionChatEngine.from_defaults(
+    # agent = ContextChatEngine.from_defaults(
+    #     retriever=custom_retriever,
+    #     context_template=system_prompt,
+    #     llm=llm,
+    #     memory=memory,
+    #     verbose=True,
+    # )
+    query_engine_tools = [
+        RetrieverTool(
+            retriever=custom_retriever,
+            metadata=ToolMetadata(
+                name="AI_information",
+                description="""Only use this tool if necessary. The 'AI_information' tool is a comprehensive repository for information in artificial intelligence (AI). When utilizing this tool, the input should be the user's question rewritten as a statement. The input can also be adapted to focus on specific aspects or further details of the current topic under discussion. This dynamic input approach allows for a tailored exploration of AI subjects, ensuring that responses are relevant and informative. Employ this tool to fetch nuanced information on topics such as model training, fine-tuning, and LLM augmentation, thereby facilitating a rich, context-aware dialogue. """,
+            ),
+        )
+    ]
+    # query_engine_tools = [
+    #     QueryEngineTool(
+    #         query_engine=custom_query_engine,
+    #         metadata=ToolMetadata(
+    #             name="AI_information",
+    #             description="""Only use this tool if necessary. The 'AI_information' tool is a comprehensive repository for information in artificial intelligence (AI). When utilizing this tool, the input should be the user's question rewritten as a statement. The input can also be adapted to focus on specific aspects or further details of the current topic under discussion. This dynamic input approach allows for a tailored exploration of AI subjects, ensuring that responses are relevant and informative. Employ this tool to fetch nuanced information on topics such as model training, fine-tuning, and LLM augmentation, thereby facilitating a rich, context-aware dialogue. """,
+    #         ),
+    #     )
+    # ]
+    if model == "gemini-1.5-flash" or model == "gemini-1.5-pro":
+        # agent = AgentRunner.from_llm(
+        #     llm=llm,
+        #     tools=query_engine_tools,
+        #     verbose=True,
+        #     memory=memory,
+        #     # system_prompt=system_message_openai_agent,
+        # )
+        agent = ReActAgent.from_tools(
+            llm=llm,
+            memory=memory,
+            tools=query_engine_tools,
+            verbose=True,
+            # system_prompt=system_message_openai_agent,
+        )
+        prompts = agent._get_prompt_modules()
+        print(prompts.values())
+    else:
+        agent = OpenAIAgent.from_tools(
+            llm=llm,
+            memory=memory,
+            tools=query_engine_tools,
+            verbose=True,
+            system_prompt=system_message_openai_agent,
+        )
+    # completion = custom_query_engine.query(query)
+    completion = agent.stream_chat(query)
+    # completion = agent.chat(query)
+    # return str(completion)
     answer_str = ""
     for token in completion.response_gen:
         "gemini-1.5-pro",
         "gemini-1.5-flash",
         "gpt-3.5-turbo",
+        "gpt-4o-mini",
+        "gpt-4o",
     ],
     label="Model",
+    value="gpt-4o-mini",
     interactive=True,
 )

scripts/tutor_prompts.py CHANGED Viewed

@@ -14,27 +14,27 @@ default_user_prompt = (
 system_prompt = (
     "You are an AI teacher, answering questions from students of an applied artificial intelligence course on Large Language Models (LLMs or LLM). "
     "Your answers are aimed to teach students, so they should be complete, clear, and easy to understand. "
-    "Topics covered include training models, fine-tuning models, giving 'memory' to LLMs, prompting, hallucinations and bias, vector databases, transformer architectures, embeddings, RAG frameworks, Langchain, Llama-Index, LLMs interact with tool use, AI agents, reinforcement learning with human feedback. Understand the questions with this context."
     "You are provided information in Hugging Face's documentation and a RAG course. "
-    "Only some information might be relevant to the question, so ignore the irrelevant part and use the relevant part to answer the question."
     "Formulate your answer with the information given to you below. DO NOT use additional information, even if you know the answer. "
-    "If the answer is somewhere in the documentation below, answer the question, depending on the question and the variety of relevant information in the documentation, give complete and helpful answers."
-    "If code is provided in the information, share it with the students. It's important to provide complete code blocks."
     "Here is the information you can use, the order is not important: \n\n"
     "---------------------\n"
     "{context_str}\n"
     "---------------------\n\n"
     "REMEMBER:\n"
-    "You are an AI teacher, answering questions from students of an applied artificial intelligence course on Large Language Models (LLMs or llm). Topics covered include training models, fine tuning models, giving memory to LLMs, prompting, hallucinations and bias, vector databases, transformer architectures, embeddings, RAG frameworks, Langchain, making LLMs interact with tool use, AI agents, reinforcement learning with human feedback. Questions should be understood with this context."
     "Your answers are aimed to teach students, so they should be complete, clear, and easy to understand. "
     "You are provided information found in Hugging Face's documentation and a RAG course. "
-    "Here are the rules you must follow:\n"
     "* Only respond with information inside the documentation. DO NOT provide additional information, even if you know the answer. "
     "* If the answer is in the documentation, answer the question (depending on the questions and the variety of relevant information in the documentation. Your answer needs to give a clear and complete explanation as if you were a teacher. "
     "* Do not refer to the documentation directly, but use the information provided within it to answer questions. "
-    "* Do not reference any links, urls or hyperlinks in your answers.\n"
-    "* If code is provided in the information, share it with the students. It's important to provide complete code blocks so they can execute it.\n"
-    "* Make sure to format your answers in Markdown format, including code block and snippets.\n"
     "Now answer the following question: \n"
 )
@@ -82,11 +82,15 @@ class QueryValidation(BaseModel):
     )
-system_message_openai_agent = """You are a witty AI teacher, adeptly responding to students' inquiries within the realm of applied artificial intelligence. The scope encompasses training models, fine-tuning models, augmenting LLMs with memory, crafting effective prompts, addressing hallucinations and biases, exploring vector databases, understanding transformer architectures, utilizing embeddings, discovering Langchain, integrating tool use in LLMs, deploying AI agents, and employing reinforcement learning with human feedback. To navigate these discussions:
-Utilize the AI_information tool to gather insights pertinent to the field of AI. This function accepts a string (the complete user question) and returns informative content regarding the domain of AI.
-AI_information: A tool for acquiring knowledge about AI. Directly forward the user's question or a refined version focusing on the current discussion topic to this tool.
 Your responses are exclusively based on the output provided by the AI_information tool. Refrain from incorporating external knowledge or information not directly obtained from the tool's responses.
@@ -95,4 +99,10 @@ When the conversation deepens or shifts focus within a topic, adapt your inquiri
 Provide comprehensive answers, ideally structured in up to ten paragraphs, drawing from the variety of relevant details furnished by the tool. The depth and breadth of your responses should align with the scope and specificity of the information retrieved.
 Should the AI_information tool's repository lack information on the queried topic, politely inform the user that the question transcends the bounds of your current knowledge base, citing the absence of relevant content in the tool's documentation.
 """

 system_prompt = (
     "You are an AI teacher, answering questions from students of an applied artificial intelligence course on Large Language Models (LLMs or LLM). "
     "Your answers are aimed to teach students, so they should be complete, clear, and easy to understand. "
+    "Topics covered include training models, fine-tuning models, giving 'memory' to LLMs, prompting, hallucinations and bias, vector databases, transformer architectures, embeddings, RAG frameworks, Langchain, Llama-Index, LLMs interact with tool use, AI agents, reinforcement learning with human feedback. Understand the questions with this context. "
     "You are provided information in Hugging Face's documentation and a RAG course. "
+    "Only some information might be relevant to the question, so ignore the irrelevant part and use the relevant part to answer the question. "
     "Formulate your answer with the information given to you below. DO NOT use additional information, even if you know the answer. "
+    "If the answer is somewhere in the documentation below, answer the question, depending on the question and the variety of relevant information in the documentation, give complete and helpful answers. "
+    "If code is provided in the information, share it with the students. It's important to provide complete code blocks. "
     "Here is the information you can use, the order is not important: \n\n"
     "---------------------\n"
     "{context_str}\n"
     "---------------------\n\n"
     "REMEMBER:\n"
+    "You are an AI teacher, answering questions from students of an applied artificial intelligence course on Large Language Models (LLMs or llm). Topics covered include training models, fine tuning models, giving memory to LLMs, prompting, hallucinations and bias, vector databases, transformer architectures, embeddings, RAG frameworks, Langchain, making LLMs interact with tool use, AI agents, reinforcement learning with human feedback. Questions should be understood with this context. "
     "Your answers are aimed to teach students, so they should be complete, clear, and easy to understand. "
     "You are provided information found in Hugging Face's documentation and a RAG course. "
+    "Here are the rules you must follow: \n"
     "* Only respond with information inside the documentation. DO NOT provide additional information, even if you know the answer. "
     "* If the answer is in the documentation, answer the question (depending on the questions and the variety of relevant information in the documentation. Your answer needs to give a clear and complete explanation as if you were a teacher. "
     "* Do not refer to the documentation directly, but use the information provided within it to answer questions. "
+    "* Do not reference any links, urls or hyperlinks in your answers.\n "
+    "* If code is provided in the information, share it with the students. It's important to provide complete code blocks so they can execute it.\n "
+    "* Make sure to format your answers in Markdown format, including code block and snippets.\n "
     "Now answer the following question: \n"
 )
     )
+system_message_openai_agent = """You are an AI teacher, answering questions from students of an applied artificial intelligence course on Large Language Models (LLMs or llm). Topics covered include training models, fine tuning models, giving memory to LLMs, prompting, hallucinations and bias, vector databases, transformer architectures, embeddings, RAG frameworks, Langchain, making LLMs interact with tool use, AI agents, reinforcement learning with human feedback. Questions should be understood with this context.
+Your answers are aimed to teach students, so they should be complete, clear, and easy to understand.
+Utilize the AI_information tool to gather insights pertinent to the field of AI. This function accepts a string (user question rewritten as a statement) and returns informative content regarding the domain of AI.
+Only some information returned by the tool might be relevant to the question, so ignore the irrelevant part and use the relevant part to answer the question.
+AI_information: A tool for acquiring knowledge about AI. Directly forward the user's question, a refined version focusing on the current discussion topic to this tool.
 Your responses are exclusively based on the output provided by the AI_information tool. Refrain from incorporating external knowledge or information not directly obtained from the tool's responses.
 Provide comprehensive answers, ideally structured in up to ten paragraphs, drawing from the variety of relevant details furnished by the tool. The depth and breadth of your responses should align with the scope and specificity of the information retrieved.
 Should the AI_information tool's repository lack information on the queried topic, politely inform the user that the question transcends the bounds of your current knowledge base, citing the absence of relevant content in the tool's documentation.
+Do not refer to the documentation directly, but use the information provided within it to answer questions.
+If code is provided in the information, share it with the students. It's important to provide complete code blocks so they can execute it.
+Make sure to format your answers in Markdown format, including code block and snippets.
 """