|
--- |
|
license: llama2 |
|
inference: false |
|
pipeline_tag: text-generation |
|
tags: |
|
- not-for-all-audiences |
|
language: |
|
- en |
|
--- |
|
|
|
# GGML's of Pygmalion Vicuna 1.1 7B |
|
<!-- header start --> |
|
<div style="width: 100%;"> |
|
<img src="https://huggingface.co/spaces/shadowsword/misc/resolve/main/huggingface_shadowsword_ggml.png" alt="Shadowsword GGML Reuploads" style="width: 100%; min-width: 400px; display: block; margin: auto;"> |
|
</div> |
|
<!-- header end --> |
|
|
|
a GGML re-upload by Shadowsword |
|
|
|
https://huggingface.co/TehVenom/Pygmalion-Vicuna-1.1-7b |
|
|
|
ggmlv3 from TheBloke's make-ggml.py commit to huggingface repo |
|
|
|
```bash |
|
example$ python3 ./make-ggml.py --model /home/inpw/Pygmalion-1.1-7b --outname Pygmalion-Vicuna-1.1-7b --outdir /home/inpw/Pygmalion-Vicuna-1.1-7b --keep_fp16 --quants ... |
|
``` |
|
|
|
It was mentioned that Pygmalion LLM are no longer allowed on Google Colabs! |
|
|
|
Includes `USE_POLICY.md` making sure to comply with license agreements / legalities. |
|
|
|
## Provided GGML Quants |
|
|
|
| Quant Method | Use Case | |
|
| ---- | ---- | |
|
| Q2_K | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. | |
|
| Q3_K_S | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors | |
|
| Q3_K_M | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K | |
|
| Q3_K_L | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K | |
|
| Q4_0 | Original quant method, 4-bit. | |
|
| Q4_1 | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. | |
|
| Q4_K_S | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors | |
|
| Q4_K_M | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K | |
|
| Q5_0 | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. | |
|
| Q5_1 | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. | |
|
| Q5_K_S | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors | |
|
| Q5_K_M | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K | |
|
| Q6_K | New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization | |
|
| fp16 | Compiled Safetensors, can be used to quantize | |
|
|
|
Thanks to TheBloke for the information on quant use cases. |
|
|
|
| RAM/VRAM | Parameters | GPU Offload (2K ctx, Q4_0, 6GB RTX 2060) | |
|
| ---- | ---- | ---- | |
|
| 4GB | 3B | |
|
| 8GB | 7B | 32 Layers |
|
| 16GB | 13B | 18 Layers |
|
| 32GB | 30B | 8 Layers |
|
| 64GB | 65B | |
|
|
|
|
|
Original Card: |
|
|
|
# Pygmalion Vicuna 1.1 7B |
|
|
|
The LLaMA based Pygmalion-7b model: |
|
|
|
https://huggingface.co/PygmalionAI/pygmalion-7b |
|
|
|
Merged alongside lmsys's Vicuna v1.1 deltas: |
|
|
|
https://huggingface.co/lmsys/vicuna-13b-delta-v1.1 |
|
|
|
This merge was done using an weighted average merge strategy, and the end result is a model composed of: |
|
|
|
Pygmalion-7b [60%] + LLaMA Vicuna v1.1 [40%] |
|
|
|
|
|
This was done under request, but the end result is intended to lean heavily towards Pygmalion's chatting + RP tendencies, and to inherit some of Vicuna's Assistant / Instruct / Helpful properties. |
|
|
|
Due to the influence of Pygmalion, this model will very likely generate content that is considered NSFW. |
|
|
|
The specific prompting is unknown, but try Pygmalion's prompt styles first, |
|
then a mix of the two to see what brings most interesting results. |