# LLM Inference This folder contains helper scripts for exporting and generating outputs with your Composer-trained LLMs. Table of Contents: - [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder) - [Interactive Generation with HF models](#interactive-generation-with-hf-models) - [Interactive Chat with HF models](#interactive-chat-with-hf-models) - [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx) - [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer) - [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer) ## Converting a Composer checkpoint to an HF checkpoint folder The LLMs trained with this codebase are all HuggingFace (HF) `PreTrainedModel`s, which we wrap with a `HuggingFaceModel` wrapper class to make compatible with Composer. See [docs](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.models.HuggingFaceModel.html#huggingfacemodel) and an [example](https://docs.mosaicml.com/projects/composer/en/latest/examples/pretrain_finetune_huggingface.html) for more details. At the end of your training runs, you will see a collection of Composer `Trainer` checkpoints such as `ep0-ba2000-rank0.pt`. These checkpoints contain the entire training state, including the model, tokenizer, optimizer state, schedulers, timestamp, metrics, etc. Though these Composer checkpoints are useful during training, at inference time we usually just want the model, tokenizer, and metadata. To extract these pieces, we provide a script `convert_composer_to_hf.py` that converts a Composer checkpoint directly to a standard HF checkpoint folder. For example: ```bash python convert_composer_to_hf.py --composer_path ep0-ba2000-rank0.pt --hf_output_path my_hf_model/ --output_precision bf16 ``` This will produce a folder like: ``` my_hf_model/ config.json merges.txt pytorch_model.bin special_tokens_map.json tokenizer.json tokenizer_config.json vocab.json modeling_code.py ``` which can be loaded with standard HF utilities like `AutoModelForCausalLM.from_pretrained('my_hf_model')`. You can also pass object store URIs for both `--composer_path` and `--hf_output_path` to easily convert checkpoints stored in S3, OCI, etc. ## Interactive Generation with HF models To make it easy to inspect the generations produced by your HF model, we include a script `hf_generate.py` that allows you to run custom prompts through your HF model, like so: ```bash python hf_generate.py \ --name_or_path gpt2 \ --temperature 1.0 \ --top_p 0.95 \ --top_k 50 \ --seed 1 \ --max_new_tokens 256 \ --prompts \ "The answer to life, the universe, and happiness is" \ "MosaicML is an ML training efficiency startup that is known for" \ "Here's a quick recipe for baking chocolate chip cookies: Start by" \ "The best 5 cities to visit in Europe are" ``` which will produce output: ```bash Loading HF model... n_params=124439808 Loading HF tokenizer... /mnt/workdisk/llm-foundry/scripts/inference/hf_generate.py:89: UserWarning: pad_token_id is not set for the tokenizer. Using eos_token_id as pad_token_id. warnings.warn( Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Generate kwargs: {'max_new_tokens': 256, 'temperature': 1.0, 'top_p': 0.95, 'top_k': 50, 'use_cache': True, 'do_sample': True, 'eos_token_id': 50256} Moving model and inputs to device=cuda and dtype=torch.bfloat16... Tokenizing prompts... NOT using autocast... Warming up... Generating responses... #################################################################################################### The answer to life, the universe, and happiness is to love... #################################################################################################### MosaicML is an ML training efficiency startup that is known for designing and developing applications to improve training and performance efficiency... #################################################################################################### Here\'s a quick recipe for baking chocolate chip cookies: Start by making an apple crumble by yourself or bake in the microwave for 40 minutes to melt and get melted... #################################################################################################### The best 5 cities to visit in Europe are the one in Spain (Spain) and the one in Holland (Belgium)... #################################################################################################### bs=4, input_tokens=array([11, 14, 13, 9]), output_tokens=array([256, 256, 256, 41]) total_input_tokens=47, total_output_tokens=809 encode_latency=9.56ms, gen_latency=2759.02ms, decode_latency=1.72ms, total_latency=2770.31ms latency_per_output_token=3.42ms/tok output_tok_per_sec=292.03tok/sec ``` The argument for `--name_or_path` can be either the name of a model that exists on the HF Hub, such as `gpt2`, `facebook/opt-350m`, etc. or the path to a HF checkpoint folder, such as `my_hf_model/` like we exported above. The script will use HuggingFace's `device_map=auto` feature to automatically load the model on any available GPUs, or fallback to CPU. [See the docs here!](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) You can also directly specify `--device_map auto` or `--device_map balanced`, etc. You can also target a specific **single** device using `--device cuda:0` or `--device cpu`, etc. For MPT models specifically, you can pass args like `--attn_impl triton`, and `--max_seq_len 4096` to speed up generation time or alter the max generation length at inference time (thanks to ALiBi). ## Interactive Chat with HF models Chat models need to pass conversation history back to the model for multi-turn conversations. To make that easier, we include `hf_chat.py`. Chat models usually require an introductory/system prompt, as well as a wrapper around user and model messages, to fit the training format. Default values work with our ChatML-trained models, but you can set other important values like generation kwargs: ```bash # using an MPT/ChatML style model python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \ --max_new_tokens=2048 \ --temperature 0.3 \ --top_k 0 \ --model_dtype bf16 \ --trust_remote_code ``` ```bash # using an MPT/ChatML style model on > 1 GPU python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \ --max_new_tokens=1024 \ --temperature 0.3 \ --top_k 0 \ --model_dtype bf16 \ --trust_remote_code \ --device_map auto ``` The script also works with other style models. Here is an example of using it with a Vicuna-style model: ```bash python hf_chat.py -n eachadea/vicuna-7b-1.1 --system_prompt="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --user_msg_fmt="USER: {}\n" --assistant_msg_fmt="ASSISTANT: {}\n" --max_new_tokens=512 ``` The `system_prompt` is the message that gives the bot context for the conversation, and can be used to make the bot take on different personalities. In the REPL you see while using `hf_chat.py` you can enter text to interact with the model (hit return TWICE to send, this allows you to input text with single newlines), you can also enter the following commands: - `clear` — clear the conversation history, and start a new conversation (does not change system prompt) - `system` — change the system prompt - `history` — see the conversation history - `quit` — exit ## Converting an HF model to ONNX We include a script `convert_hf_to_onnx.py` that demonstrates how to convert your HF model to ONNX format. For more details and examples of exporting and working with HuggingFace models with ONNX, see . Here a couple examples of using the script: ```bash # 1) Local export python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder local/folder # 2) Remote export python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder s3://bucket/remote/folder # 3) Verify the exported model python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder local/folder --verify_export # 4) Change the batch size or max sequence length python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder local/folder --export_batch_size 1 --max_seq_len 32000 ``` Please open a Github issue if you discover any problems! ## Converting an HF MPT to FasterTransformer We include a script `convert_hf_mpt_to_ft.py` that converts HuggingFace MPT checkpoints to the [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) format. This makes the checkpoints compatible with the FasterTransformer library, which can be used to run transformer models on GPUs. You can either pre-download the model in a local dir or directly provide the HF hub model name to convert the HF MPT checkpoint to FasterTransformer format. ### Download and Convert ``` # The script handles the download python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o mpt-ft-7b --infer_gpu_num 1 ``` ### Pre-Download the Model and Convert ``` apt update apt install git-lfs git lfs install git clone https://huggingface.co/mosaicml/mpt-7b # This will convert the MPT checkpoint in mpt-7b dir and save the converted checkpoint to mpt-ft-7b dir python convert_hf_mpt_to_ft.py -i mpt-7b -o mpt-ft-7b --infer_gpu_num 1 ``` You can change `infer_gpu_num` to > 1 to prepare a FT checkpoint for multi-gpu inference. Please open a Github issue if you discover any problems! ## Converting a Composer MPT to FasterTransformer We include a script `convert_composer_mpt_to_ft.py` that directly converts a Composer MPT checkpoint to the FasterTransformer format. You can either provide a path to a local Composer checkpoint or a URI to a file stored in a cloud supported by Composer (e.g. `s3://`). Simply run: ``` python convert_composer_mpt_to_ft.py -i -o mpt-ft-7b --infer_gpu_num 1 ``` ## Running MPT with FasterTransformer This step assumes that you already have converted an MPT checkpoint to FT format by following the instructions in [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer). It also assumes that you have 1. Built FasterTransformer for PyTorch by following the instructions [here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project) 2. A PyTorch install that supports [MPI as distributed communication backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and install PyTorch from source to include MPI as a backend. Once above steps are complete, you can run MPT using the following commands: ``` # For running on a single gpu and benchmarking PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.py --ckpt_path mpt-ft-7b/1-gpu \ --lib_path /mnt/work/FasterTransformer/build/lib/libth_transformer.so --time # Run with -h to see various generation arguments PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.py -h # Run on 2 gpus. You need to create an FT checkpoint for 2-gpus first. # allow-run-as-root is only needed if you are running as root PYTHONPATH=/mnt/work/FasterTransformer mpirun -n 2 --allow-run-as-root \ python scripts/inference/run_mpt_with_ft.py \ --ckpt_path mpt-ft-7b/2-gpu --lib_path /mnt/work/FasterTransformer/build/lib/libth_transformer.so --time # Add prompts in a text file and generate text echo "Write 3 reasons why you should train an AI model on domain specific data set." > prompts.txt PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.py \ --ckpt_path mpt-ft-7b/1-gpu --lib_path /mnt/work/FasterTransformer/build/lib/libth_transformer.so \ --sample_input_file prompts.txt --sample_output_file output.txt ```