Not-For-All-Audiences

Model card Files Files and versions Community

Pygmalion-Vicuna-7B-GGML / README.md

shadowsword

Update README.md

b89ff51 over 1 year ago

preview code

raw

history blame contribute delete

3.51 kB

	---
	license: llama2
	inference: false
	pipeline_tag: text-generation
	tags:
	- not-for-all-audiences
	language:
	- en
	---

	# GGML's of Pygmalion Vicuna 1.1 7B
	<!-- header start -->
	<div style="width: 100%;">
	<img src="https://huggingface.co/spaces/shadowsword/misc/resolve/main/huggingface_shadowsword_ggml.png" alt="Shadowsword GGML Reuploads" style="width: 100%; min-width: 400px; display: block; margin: auto;">
	</div>
	<!-- header end -->

	a GGML re-upload by Shadowsword

	https://huggingface.co/TehVenom/Pygmalion-Vicuna-1.1-7b

	ggmlv3 from TheBloke's make-ggml.py commit to huggingface repo

	```bash
	example$ python3 ./make-ggml.py --model /home/inpw/Pygmalion-1.1-7b --outname Pygmalion-Vicuna-1.1-7b --outdir /home/inpw/Pygmalion-Vicuna-1.1-7b --keep_fp16 --quants ...
	```

	It was mentioned that Pygmalion LLM are no longer allowed on Google Colabs!

	Includes `USE_POLICY.md` making sure to comply with license agreements / legalities.

	## Provided GGML Quants

	\| Quant Method \| Use Case \|
	\| ---- \| ---- \|
	\| Q2_K \| New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. \|
	\| Q3_K_S \| New k-quant method. Uses GGML_TYPE_Q3_K for all tensors \|
	\| Q3_K_M \| New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K \|
	\| Q3_K_L \| New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K \|
	\| Q4_0 \| Original quant method, 4-bit. \|
	\| Q4_1 \| Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. \|
	\| Q4_K_S \| New k-quant method. Uses GGML_TYPE_Q4_K for all tensors \|
	\| Q4_K_M \| New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K \|
	\| Q5_0 \| Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. \|
	\| Q5_1 \| Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. \|
	\| Q5_K_S \| New k-quant method. Uses GGML_TYPE_Q5_K for all tensors \|
	\| Q5_K_M \| New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K \|
	\| Q6_K \| New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization \|
	\| fp16 \| Compiled Safetensors, can be used to quantize \|

	Thanks to TheBloke for the information on quant use cases.

	\| RAM/VRAM \| Parameters \| GPU Offload (2K ctx, Q4_0, 6GB RTX 2060) \|
	\| ---- \| ---- \| ---- \|
	\| 4GB \| 3B \|
	\| 8GB \| 7B \| 32 Layers
	\| 16GB \| 13B \| 18 Layers
	\| 32GB \| 30B \| 8 Layers
	\| 64GB \| 65B \|


	Original Card:

	# Pygmalion Vicuna 1.1 7B

	The LLaMA based Pygmalion-7b model:

	https://huggingface.co/PygmalionAI/pygmalion-7b

	Merged alongside lmsys's Vicuna v1.1 deltas:

	https://huggingface.co/lmsys/vicuna-13b-delta-v1.1

	This merge was done using an weighted average merge strategy, and the end result is a model composed of:

	Pygmalion-7b [60%] + LLaMA Vicuna v1.1 [40%]


	This was done under request, but the end result is intended to lean heavily towards Pygmalion's chatting + RP tendencies, and to inherit some of Vicuna's Assistant / Instruct / Helpful properties.

	Due to the influence of Pygmalion, this model will very likely generate content that is considered NSFW.

	The specific prompting is unknown, but try Pygmalion's prompt styles first,
	then a mix of the two to see what brings most interesting results.