language:
- code
license: llama2
model_creator: Meta
model_name: CodeLlama 13B Instruct
inference: false
base_model:
- meta-llama/CodeLlama-13b-Instruct-hf
pipeline_tag: text-generation
tags:
- llama-2
- tensorrt-llm
- code-llama
prompt_template: >
[INST] Write code to solve the following coding problem that obeys the
constraints and passes the example test cases. Please wrap your code answer
using ```:
{prompt}
[/INST]
quantized_by: TheBloke
CodeLlama 13B Instruct - GPTQ - TensorRT-LLM - RTX4090
- Model creator: Meta
- Original model: CodeLlama 13B Instruct
- Quantized model: TheBloke CodeLlama 13B Instruct - GPTQ
Description
This repo contains TensorRT-LLM GPTQ model files for Meta's CodeLlama 13B Instruct built for a single RTX 4090 card and using tensorrt_llm version 0.15.0.dev2024101500. It's a 4-bit quantized version based on the main branch of the TheBloke CodeLlama 13B Instruct - GPTQ model.
TensorRT commands
To build this model, the following commands were run from the base folder of the TensorRT-LLM repository (see installation instructions in the repository for more information):
python examples/llama/convert_checkpoint.py \
--model_dir ./CodeLlama-13b-Instruct-hf \
--output_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
--dtype float16 \
--quant_ckpt_path ./CodeLlama-13B-Instruct-GPTQ/model.safetensors \
--use_weight_only \
--weight_only_precision int4_gptq \
--per_group
And then:
trtllm-build \
--checkpoint_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
--output_dir ./CodeLlama-13B-Instruct-GPTQ_TensorRT \
--gemm_plugin float16 \
--max_input_len 8192 \
--max_seq_len 8192
Prompt template: CodeLlama
[INST] <<SYS>>
Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
<</SYS>>
{prompt}
[/INST]
How to use this model from Python code
Using TensorRT-LLM API
Install the necessary packages
pip3 install tensorrt_llm==0.15.0.dev2024101500 -U --pre --extra-index-url https://pypi.nvidia.com
Beware that this command should not be run from a virtual environment (or twice, one time outside venv and then using venv).
Use the TensorRT-LLM API
from tensorrt_llm import LLM, SamplingParams
system_prompt = \
"[INST] <<SYS>>\n" +\
"Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:" +\
"\n<</SYS>>\n\n"
user_prompt = \
"<Your user prompt>" +\
" [/INST] "
prompts = [
system_prompt + user_prompt,
]
sampling_params = SamplingParams(max_tokens=512, temperature=1.31, top_p=0.14, top_k=49, repetition_penalty=1.17)
llm = LLM(model="./CodeLlama-13B-Instruct-GPTQ_TensorRT")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Using Oobabooga's Text Generation WebUI
Follow instructions described here: https://github.com/oobabooga/text-generation-webui/pull/5715 Use version 0.15.0.dev2024101500 of tensorrt_llm instead of 0.10.0.