TheBloke commited on
Commit
7a7d4a6
1 Parent(s): c72b7be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -23
README.md CHANGED
@@ -35,13 +35,22 @@ tags:
35
 
36
  This repo contains GGML format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b).
37
 
38
- GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
39
- * [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling.
40
- * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with GPU acceleration via the c_transformers backend.
41
- * [LM Studio](https://lmstudio.ai/), a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel.
42
- * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend.
43
- * [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
44
- * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
 
 
 
 
 
 
 
 
 
45
 
46
  ## Repositories available
47
 
@@ -58,15 +67,11 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
58
  <!-- compatibility_ggml start -->
59
  ## Compatibility
60
 
61
- ### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
62
-
63
- These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.
64
-
65
- ### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
66
 
67
- These new quantisation methods are compatible with llama.cpp as of June 6th, commit `2d43387`.
68
 
69
- They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.
70
 
71
  ## Explanation of the new k-quant methods
72
  <details>
@@ -106,17 +111,11 @@ Refer to the Provided Files table below to see what files use which methods, and
106
  I use the following command line; adjust for your tastes and needs:
107
 
108
  ```
109
- ./main -t 10 -ngl 32 -m llama-2-70b.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
110
  ```
111
- Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
112
-
113
- Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
114
-
115
- If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
116
-
117
- ## How to run in `text-generation-webui`
118
 
119
- Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
120
 
121
  <!-- footer start -->
122
  ## Discord
 
35
 
36
  This repo contains GGML format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b).
37
 
38
+ ## Only compatible with latest llama.cpp
39
+
40
+ To use these files you need:
41
+
42
+ 1. llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
43
+ - For users who don't want to compile from source, you can use the binaries from [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630)
44
+ 2. to add new command line parameter `-gqa 8`
45
+
46
+ Example command:
47
+ ```
48
+ /workspace/git/llama.cpp/main -m llama-2-70b-chat/ggml/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
49
+ ```
50
+
51
+ There is no CUDA support at this time, but it should be coming soon.
52
+
53
+ There is no support in third-party UIs or Python libraries (llama-cpp-python, ctransformers) yet. That will come in due course.
54
 
55
  ## Repositories available
56
 
 
67
  <!-- compatibility_ggml start -->
68
  ## Compatibility
69
 
70
+ ### Only compatible with llama.cpp as of commit `e76d630`
 
 
 
 
71
 
72
+ Compatible with llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
73
 
74
+ For a pre-compiled release, use [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630) or later.
75
 
76
  ## Explanation of the new k-quant methods
77
  <details>
 
111
  I use the following command line; adjust for your tastes and needs:
112
 
113
  ```
114
+ ./main -m llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are"
115
  ```
116
+ Change `-t 13` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
 
 
 
 
 
 
117
 
118
+ No GPU support is possible yet, but it is coming soon.
119
 
120
  <!-- footer start -->
121
  ## Discord