edumunozsala
/

llama-2-7b-int4-GPTQ-python-code-20k

+---
+tags:
+- llama-2
+- gptq
+- quantization
+- code
+- llama-2
+model-index:
+- name: Llama-2-7b-4bit-GPTQ-python-coder
+  results: []
+license: gpl-3.0
+language:
+- code
+datasets:
+- iamtarun/python_code_instructions_18k_alpaca
+pipeline_tag: text-generation
+library_name: transformers
+---
+# LlaMa 2 7b 4-bit GPTQ Python Coder 👩‍💻
+This model is the **GPTQ Quantization of my Llama 2 7B 4-bit Python Coder**. The base model link is [here](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k)
+The quantization parameters for the GPTQ algo are:
+- 4-bit quantization
+- Group size is 128
+- Dataset C4
+- Decreasing activation is False
+## Model Description
+[Llama 2 7B 4-bit Python Coder](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k) is a fine-tuned version of the Llama 2 7B model using QLoRa in 4-bit with [PEFT](https://github.com/huggingface/peft) library and bitsandbytes.
+## Quantization
+A quick definition extracted from a great article in Medium by Benjamin Marie ["GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2"](https://medium.com/towards-data-science/gptq-or-bitsandbytes-which-quantization-method-to-use-for-llms-examples-with-llama-2-f79bc03046dc) (Only for Medium subscribers)
+*"GPTQ (Frantar et al., 2023) was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size. GPTQ can lower the weight precision to 4-bit or 3-bit.
+In practice, GPTQ is mainly used for 4-bit quantization. 3-bit has been shown very unstable (Dettmers and Zettlemoyer, 2023). It quantizes without loading the entire model into memory. Instead, GPTQ loads and quantizes the LLM module by module.
+Quantization also requires a small sample of data for calibration which can take more than one hour on a consumer GPU."*
+### Example of usage
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
+instruction="Write a Python function to display the first and last elements of a list."
+input=""
+prompt = f"""### Instruction:
+Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.
+### Task:
+{instruction}
+### Input:
+{input}
+### Response:
+"""
+input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
+# with torch.inference_mode():
+outputs = model.generate(input_ids=input_ids, max_new_tokens=128, do_sample=True, top_p=0.9,temperature=0.3)
+print(f"Prompt:\n{prompt}\n")
+print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
+```
+### Citation
+```
+@misc {edumunozsala_2023,
+	author       = { {Eduardo Muñoz} },
+	title        = { llama-2-7b-int4-GPTQ-python-coder },
+	year         = 2023,
+	url          = { https://huggingface.co/edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k },
+	publisher    = { Hugging Face }
+}
+```