File size: 6,712 Bytes
f640524
 
0b71789
 
 
 
 
 
 
 
 
 
 
 
 
f640524
0b71789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8efa373
 
 
 
 
b7ce965
 
 
 
 
3a4ac1e
0b71789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
735431d
0b71789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: llama2
tags:
- llama
- llama-2
- facebook
- meta
- text-generation-inference
- quantized
- gguf
- 32k-context
- togethercomputer
language:
- en
pipeline_tag: text-generation
---

# LLaMA-2-7B-32K-Instruct_GGUF #

[Together Computer, Inc.](https://together.ai/) has released
[Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct), a model based on
[Meta AI](https://ai.meta.com)'s [LLaMA-2-7B](https://huggingface.co/meta-llama/Llama-2-7b),
but fine-tuned for context lengths up to 32K using "Position Interpolation" and "Rotary Position Embeddings"
(RoPE).

While the current version of [llama.cpp](https://github.com/ggerganov/llama.cpp) already supports such large
context lengths, it requires quantized files in the new GGUF format - and that's where this repo comes in:
it contains the following quantizations of the original weights from Together's fined-tuned model

* [Q2_K](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q2_K.gguf)
* [Q3_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_S.gguf),
[Q3_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_M.gguf) (aka Q3_K) and
[Q3_K_L](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_L.gguf)
* [Q4_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_0.gguf),
[Q4_1](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_1.gguf),
[Q4_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_K_S.gguf) and
[Q4_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_K_M.gguf) (aka Q4_K)
* [Q5_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_0.gguf),
[Q5_1](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_1.gguf),
[Q5_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_K_S.gguf) and
[Q5_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_K_M.gguf) (aka Q5_K)
* [Q6_K](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q6_K.gguf),
* [Q8_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q8_0.gguf) and
* [F16](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-f16.gguf) (unquantized)

> Nota bene: while RoPE makes inferences with large contexts possible, you still need an awful lot of RAM
> when doing so. And since "32K" does not mean that you always have to use a context size of 32768 (only that
> the model was fine-tuned for that size), it is recommended that you keep your context as small as possible

> If you need quantizations for Together Computer's
> [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K)
> model, then look for
> [LLaMA-2-7B-32K_GGUF](https://huggingface.co/rozek/LLaMA-2-7B-32K_GGUF)

## How Quantization was done ##

Since the author does not want arbitrary Python stuff to loiter on his computer, the quantization was done
using [Docker](https://www.docker.com/).

Assuming that you have the [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed on
your system and also have a basic knowledge of how to use it, you may just follow the instructions shown
below in order to generate your own quantizations:

> Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization

1. create a new folder called `llama.cpp_in_Docker`<br>this folder will later be mounted into the Docker
container and store the quantization results
2. download the weights for the fine-tuned LLaMA-2 model from
[Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K-Instruct) into a subfolder of
`llama.cpp_in_Docker` (let's call the new folder `LLaMA-2-7B-32K-Instruct`)
4. within the <u>Docker Desktop</u>, search for and download a `basic-python` image - just use one of
the most popular ones
5. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container
for the downloaded image which mounts the folder we created before:<br>
```
docker run --rm \
  -v ./llama.cpp_in_Docker:/llama.cpp \
  -t basic-python /bin/bash
```

(you may have to adjust the path to your local folder)

5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the
following commands (one after the other - copying the complete list and pasting it into the terminal
as a whole does not always seems to work properly):<br>
```
apt update
apt-get install software-properties-common -y
apt-get update
apt-get install g++ git make -y
cd /llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
```
6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and
choose "Edit file"
7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) - 
change `ifneq` to `ifeq`
8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal"
tab again
9. now enter the following commands:<br>
```
make
python3 -m pip install -r requirements.txt
python3 convert.py ../LLaMA-2-7B-32K-Instruct
```
10. you are now ready to run the actual quantization, e.g., using<br>
```
./quantize ../LLaMA-2-7B-32K-Instruct/ggml-model-f16.gguf \
   ../LLaMA-2-7B-32K-Instruct/LLaMA-2-7B-32K-Instruct-Q4_0.gguf Q4_0
```
11. run any quantizations you need and stop the container when finished (the container will automatically
be deleted but the generated files will remain available on your host computer)
12. the `basic-python` image may also be deleted (manually) unless you plan to use it again in the near future

You are now free to move the quanitization results to where you need them and run inferences with context
lengths up to 32K (depending on the amount of memory you will have available - long contexts need a
lot of RAM)

## License ##

Concerning the license(s):

* the [original model](https://ai.meta.com/llama/) (from Meta AI) was released under a rather [permissive
license](https://ai.meta.com/llama/license/)
* the fine tuned model from Together Computer uses the
[same license](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K-Instruct/blob/main/README.md)
* as a consequence, this repo does so as well