Abhinav Kulkarni

Updated README

60bf695 over 1 year ago

7.01 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- Orca
	- AWQ
	inference: false
	---

	# orca_mini_v2_13b (4-bit 128g AWQ Quantized)
	An Uncensored LLaMA-13b model in collaboration with [Eric Hartford](https://huggingface.co/ehartford), trained on explain tuned datasets, created using Instructions and Input from WizardLM, Alpaca & Dolly-V2 datasets and applying Orca Research Paper dataset construction approaches.

	This model is a 4-bit 128 group size AWQ quantized model. For more information about AWQ quantization, please click [here](https://github.com/mit-han-lab/llm-awq).

	## Model Date

	July 8, 2023

	## Model License

	Please refer to original Orca Mini v2 model license ([link](https://huggingface.co/psmathur/orca_mini_v2_13b)).

	Please refer to the AWQ quantization license ([link](https://github.com/llm-awq/blob/main/LICENSE)).

	## CUDA Version

	This model was successfully tested on CUDA driver v530.30.02 and runtime v11.7 with Python v3.10.11. Please note that AWQ requires NVIDIA GPUs with compute capability of `8.0` or higher.

	For Docker users, the `nvcr.io/nvidia/pytorch:23.06-py3` image is runtime v12.1 but otherwise the same as the configuration above and has also been verified to work.

	## How to Use

	```bash
	git clone https://github.com/mit-han-lab/llm-awq \
	&& cd llm-awq \
	&& git checkout ce4a6bb1c238c014a06672cb74f6865573494d66 \
	&& pip install -e . \
	&& cd awq/kernels \
	&& python setup.py install
	```

	```python
	import time
	import torch
	from awq.quantize.quantizer import real_quantize_model_weight
	from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, TextStreamer
	from accelerate import init_empty_weights, load_checkpoint_and_dispatch
	from huggingface_hub import snapshot_download

	model_name = "abhinavkulkarni/psmathur-orca_mini_v2_13b-w4-g128-awq"

	# Config
	config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)

	# Tokenizer
	try:
	tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name, trust_remote_code=True)
	except:
	tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, trust_remote_code=True)
	streamer = TextStreamer(tokenizer, skip_special_tokens=True)

	# Model
	w_bit = 4
	q_config = {
	"zero_point": True,
	"q_group_size": 128,
	}

	load_quant = snapshot_download(model_name)

	with init_empty_weights():
	model = AutoModelForCausalLM.from_config(config=config,
	torch_dtype=torch.float16, trust_remote_code=True)

	real_quantize_model_weight(model, w_bit=w_bit, q_config=q_config, init_only=True)
	model.tie_weights()

	model = load_checkpoint_and_dispatch(model, load_quant, device_map="balanced")

	# Inference
	prompt = f'''What is the difference between nuclear fusion and fission?
	###Response:'''

	input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
	t1 = time.time()
	output = model.generate(
	inputs=input_ids,
	temperature=0.7,
	max_new_tokens=512,
	top_p=0.15,
	top_k=0,
	repetition_penalty=1.1,
	eos_token_id=tokenizer.eos_token_id,
	streamer=streamer)
	t2 = time.time()
	print(""80)
	print(f"Generated {num_tokens/(t2-t1):.2f} token/s; {(t2-t1)*1000/num_tokens:.2f} ms/token")
	```

	## Evaluation

	This evaluation was done using [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness).

	[orca_mini_v2_13b](https://huggingface.co/psmathur/orca_mini_v2_13b)

	\| Task \|Version\| Metric \| Value \| \|Stderr\|
	\|--------\|------:\|---------------\|------:\|---\|------\|
	\|wikitext\| 1\|word_perplexity\|23.8997\| \| \|
	\| \| \|byte_perplexity\| 1.8104\| \| \|
	\| \| \|bits_per_byte \| 0.8563\| \| \|

	[orca_mini_v2_13b (4-bit 128-group AWQ)](https://huggingface.co/abhinavkulkarni/psmathur-orca_mini_v2_13b-w4-g128-awq)

	\| Task \|Version\| Metric \| Value \| \|Stderr\|
	\|--------\|------:\|---------------\|------:\|---\|------\|
	\|wikitext\| 1\|word_perplexity\|27.4695\| \| \|
	\| \| \|byte_perplexity\| 1.8581\| \| \|
	\| \| \|bits_per_byte \| 0.8938\| \| \|

	## Acknowledgements

	If you found `orca_mini_v2_13b` useful in your research or applications, please kindly cite using the following BibTeX:

	```
	@misc{orca_mini_v2_13b,
	author = {Pankaj Mathur},
	title = {orca_mini_v2_13b: An explain tuned LLaMA-13b model on uncensored wizardlm, alpaca, & dolly datasets},
	year = {2023},
	publisher = {GitHub, HuggingFace},
	journal = {GitHub repository, HuggingFace repository},
	howpublished = {\url{https://https://huggingface.co/psmathur/orca_mini_v2_13b},
	}
	```
	```
	@software{touvron2023llama,
	title={LLaMA: Open and Efficient Foundation Language Models},
	author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
	journal={arXiv preprint arXiv:2302.13971},
	year={2023}
	}
	```
	```
	@misc{openalpaca,
	author = {Yixuan Su and Tian Lan and Deng Cai},
	title = {OpenAlpaca: A Fully Open-Source Instruction-Following Model Based On OpenLLaMA},
	year = {2023},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/yxuansu/OpenAlpaca}},
	}
	```
	```
	@misc{alpaca,
	author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
	title = {Stanford Alpaca: An Instruction-following LLaMA model},
	year = {2023},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
	}
	```
	```
	@online{DatabricksBlog2023DollyV2,
	author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
	title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
	year = {2023},
	url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
	urldate = {2023-06-30}
	}
	```
	```
	@misc{xu2023wizardlm,
	title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
	author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
	year={2023},
	eprint={2304.12244},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	The model was quantized with AWQ technique. If you find AWQ useful or relevant to your research, please kindly cite the paper:

	```
	@article{lin2023awq,
	title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
	author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
	journal={arXiv},
	year={2023}
	}
	```