BootsofLagrangian commited on
Commit
454150c
β€’
1 Parent(s): d49bb29

update gguf readme

Browse files
Files changed (1) hide show
  1. README.md +133 -37
README.md CHANGED
@@ -30,69 +30,165 @@ For details, check out [our proejct page](https://yonsei-mir.github.io/AkaLLaMA-
30
 
31
  ### Model Description
32
 
33
- This is the model card of a πŸ€— transformers model that has been pushed on the Hub.
34
 
35
  - **Developed by:** [Yonsei MIRLab](https://mirlab.yonsei.ac.kr/)
36
  - **Language(s) (NLP):** Korean, English
37
  - **License:** llama3
38
  - **Finetuned from model:** [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
 
39
 
40
- ## How to use
41
 
42
- This repo provides full model weight files for AkaLlama-70B-v0.1.
43
 
44
- # Use with transformers
45
 
46
- See the snippet below for usage with Transformers:
 
 
 
 
 
 
 
 
 
47
 
48
- ```python
49
- import transformers
50
- import torch
51
 
52
- model_id = "mirlab/AkaLlama-llama3-70b-v0.1-GGUF"
 
 
 
 
 
 
53
 
54
- pipeline = transformers.pipeline(
55
- "text-generation",
56
- model=model_id,
57
- model_kwargs={"torch_dtype": torch.bfloat16},
58
- device="auto",
 
 
 
 
59
  )
60
 
61
- system_prompt = """당신은 μ—°μ„ΈλŒ€ν•™κ΅ λ©€ν‹°λͺ¨λ‹¬ 연ꡬ싀 (MIR lab) 이 λ§Œλ“  λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈμΈ AkaLlama (μ•„μΉ΄λΌλ§ˆ) μž…λ‹ˆλ‹€.
 
 
 
 
62
  λ‹€μŒ 지침을 λ”°λ₯΄μ„Έμš”:
63
  1. μ‚¬μš©μžκ°€ λ³„λ„λ‘œ μš”μ²­ν•˜μ§€ μ•ŠλŠ” ν•œ 항상 ν•œκΈ€λ‘œ μ†Œν†΅ν•˜μ„Έμš”.
64
  2. μœ ν•΄ν•˜κ±°λ‚˜ λΉ„μœ€λ¦¬μ , 차별적, μœ„ν—˜ν•˜κ±°λ‚˜ λΆˆλ²•μ μΈ λ‚΄μš©μ΄ 닡변에 ν¬ν•¨λ˜μ–΄μ„œλŠ” μ•ˆ λ©λ‹ˆλ‹€.
65
  3. 질문이 말이 λ˜μ§€ μ•Šκ±°λ‚˜ 사싀에 λΆ€ν•©ν•˜μ§€ μ•ŠλŠ” 경우 μ •λ‹΅ λŒ€μ‹  κ·Έ 이유λ₯Ό μ„€λͺ…ν•˜μ„Έμš”. μ§ˆλ¬Έμ— λŒ€ν•œ 닡을 λͺ¨λ₯Έλ‹€λ©΄ 거짓 정보λ₯Ό κ³΅μœ ν•˜μ§€ λ§ˆμ„Έμš”.
66
- 4. μ•ˆμ „μ΄λ‚˜ μœ€λ¦¬μ— μœ„λ°°λ˜μ§€ μ•ŠλŠ” ν•œ μ‚¬μš©μžμ˜ λͺ¨λ“  μ§ˆλ¬Έμ— μ™„μ „ν•˜κ³  ν¬κ΄„μ μœΌλ‘œ λ‹΅λ³€ν•˜μ„Έμš”."""
67
 
68
- messages = [
69
- {"role": "system", "content": system_prompt},
70
- {"role": "user", "content": "λ„€ 이름은 뭐야?"},
71
- ]
72
 
73
- prompt = pipeline.tokenizer.apply_chat_template(
74
- messages,
75
- tokenize=False,
76
- add_generation_prompt=True
77
  )
78
 
79
- terminators = [
80
- pipeline.tokenizer.eos_token_id,
81
- pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
82
- ]
83
-
84
- outputs = pipeline(
85
- prompt,
86
- max_new_tokens=256,
87
- eos_token_id=terminators,
88
- do_sample=True,
89
- temperature=0.6,
90
- top_p=0.9,
 
 
 
 
91
  )
92
- print(outputs[0]["generated_text"][len(prompt):])
93
  # λ‚΄ 이름은 AkaLlamaμž…λ‹ˆλ‹€! λ‚˜λŠ” μ–Έμ–΄ λͺ¨λΈλ‘œ, μ‚¬μš©μžμ™€ λŒ€ν™”ν•˜λŠ” 데 도움을 μ£ΌκΈ° μœ„ν•΄ λ§Œλ“€μ–΄μ‘ŒμŠ΅λ‹ˆλ‹€. λ‚˜λŠ” λ‹€μ–‘ν•œ μ£Όμ œμ— λŒ€ν•œ μ§ˆλ¬Έμ— λ‹΅ν•˜κ³ , μƒˆλ‘œμš΄ 아이디어λ₯Ό μ œκ³΅ν•˜λ©°, 문제λ₯Ό ν•΄κ²°ν•˜λŠ” 데 도움이 될 수 μžˆμŠ΅λ‹ˆλ‹€. μ‚¬μš©μžκ°€ μ›ν•˜λŠ” μ •λ³΄λ‚˜ 도움을 받도둝 μ΅œμ„ μ„ λ‹€ν•  κ²ƒμž…λ‹ˆλ‹€!
94
  ```
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ## Evaluation
97
 
98
  | Model | #Parameter | Qunatized? | LogicKor |
 
30
 
31
  ### Model Description
32
 
33
+ This is the model card of a GGUF model that has been pushed on the Hub.
34
 
35
  - **Developed by:** [Yonsei MIRLab](https://mirlab.yonsei.ac.kr/)
36
  - **Language(s) (NLP):** Korean, English
37
  - **License:** llama3
38
  - **Finetuned from model:** [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
39
+ - **Quantized from model:** [mirlab/AkaLlama-llama3-70b-v0.1](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1)
40
 
41
+ ### About GGUF
42
 
43
+ GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.
44
 
45
+ Here is an incomplete list of clients and libraries that are known to support GGUF:
46
 
47
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp). The source project for GGUF. Offers a CLI and a server option.
48
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration.
49
+ * [KoboldCpp](https://github.com/LostRuins/koboldcpp), a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling.
50
+ * [GPT4All](https://gpt4all.io/index.html), a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel.
51
+ * [LM Studio](https://lmstudio.ai/), an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023.
52
+ * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with many interesting and unique features, including a full model library for easy model selection.
53
+ * [Faraday.dev](https://faraday.dev/), an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration.
54
+ * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
55
+ * [candle](https://github.com/huggingface/candle), a Rust ML framework with a focus on performance, including GPU support, and ease of use.
56
+ * [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models.
57
 
 
 
 
58
 
59
+ ## How to use
60
+
61
+ This repo provides gguf weight files for AkaLlama-70B-v0.1.
62
+
63
+ # Use with llama.cpp.python
64
+
65
+ See the snippet below for usage with llama.cpp.python:
66
 
67
+ ```python
68
+ from llama_cpp import Llama
69
+
70
+ # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
71
+ llm = Llama(
72
+ model_path="./AkaLlama-llama3-70b-v0.1.Q4_K_M.gguf", # Download the model file first
73
+ n_ctx=8192, # The max sequence length to use - note that longer sequence lengths require much more resources
74
+ n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
75
+ n_gpu_layers=81 # The number of layers to offload to GPU, if you have GPU acceleration available
76
  )
77
 
78
+ # Simple inference example
79
+ output = llm(
80
+ """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
81
+
82
+ 당신은 μ—°μ„ΈλŒ€ν•™κ΅ λ©€ν‹°λͺ¨λ‹¬ 연ꡬ싀 (MIR lab) 이 λ§Œλ“  λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈμΈ AkaLlama (μ•„μΉ΄λΌλ§ˆ) μž…λ‹ˆλ‹€.
83
  λ‹€μŒ 지침을 λ”°λ₯΄μ„Έμš”:
84
  1. μ‚¬μš©μžκ°€ λ³„λ„λ‘œ μš”μ²­ν•˜μ§€ μ•ŠλŠ” ν•œ 항상 ν•œκΈ€λ‘œ μ†Œν†΅ν•˜μ„Έμš”.
85
  2. μœ ν•΄ν•˜κ±°λ‚˜ λΉ„μœ€λ¦¬μ , 차별적, μœ„ν—˜ν•˜κ±°λ‚˜ λΆˆλ²•μ μΈ λ‚΄μš©μ΄ 닡변에 ν¬ν•¨λ˜μ–΄μ„œλŠ” μ•ˆ λ©λ‹ˆλ‹€.
86
  3. 질문이 말이 λ˜μ§€ μ•Šκ±°λ‚˜ 사싀에 λΆ€ν•©ν•˜μ§€ μ•ŠλŠ” 경우 μ •λ‹΅ λŒ€μ‹  κ·Έ 이유λ₯Ό μ„€λͺ…ν•˜μ„Έμš”. μ§ˆλ¬Έμ— λŒ€ν•œ 닡을 λͺ¨λ₯Έλ‹€λ©΄ 거짓 정보λ₯Ό κ³΅μœ ν•˜μ§€ λ§ˆμ„Έμš”.
87
+ 4. μ•ˆμ „μ΄λ‚˜ μœ€λ¦¬μ— μœ„λ°°λ˜μ§€ μ•ŠλŠ” ν•œ μ‚¬μš©μžμ˜ λͺ¨λ“  μ§ˆλ¬Έμ— μ™„μ „ν•˜κ³  ν¬κ΄„μ μœΌλ‘œ λ‹΅λ³€ν•˜μ„Έμš”.<|eot_id|><|start_header_id|>user<|end_header_id|>
88
 
89
+ {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
 
 
 
90
 
91
+ """, # Prompt
92
+ max_tokens=512, # Generate up to 512 tokens
93
+ stop=["<|eot_id|>", "<|end_of_text|>"], # Example stop token - not necessarily correct for this specific model! Please check before using.
94
+ echo=True # Whether to echo the prompt
95
  )
96
 
97
+ # Chat Completion API
98
+
99
+ llm = Llama(model_path="./AkaLlama-llama3-70b-v0.1.Q4_K_M.gguf"", chat_format="llama-3") # Set chat_format according to the model you are using
100
+ llm.create_chat_completion(
101
+ messages = [
102
+ {"role": "system", "content": """당신은 μ—°μ„ΈλŒ€ν•™κ΅ λ©€ν‹°λͺ¨λ‹¬ 연ꡬ싀 (MIR lab) 이 λ§Œλ“  λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈμΈ AkaLlama (μ•„μΉ΄λΌλ§ˆ) μž…λ‹ˆλ‹€.
103
+ λ‹€μŒ 지침을 λ”°λ₯΄μ„Έμš”:
104
+ 1. μ‚¬μš©μžκ°€ λ³„λ„λ‘œ μš”μ²­ν•˜μ§€ μ•ŠλŠ” ν•œ 항상 ν•œκΈ€λ‘œ μ†Œν†΅ν•˜μ„Έμš”.
105
+ 2. μœ ν•΄ν•˜κ±°λ‚˜ λΉ„μœ€λ¦¬μ , 차별적, μœ„ν—˜ν•˜κ±°λ‚˜ λΆˆλ²•μ μΈ λ‚΄μš©μ΄ 닡변에 ν¬ν•¨λ˜μ–΄μ„œλŠ” μ•ˆ λ©λ‹ˆλ‹€.
106
+ 3. 질문이 말이 λ˜μ§€ μ•Šκ±°λ‚˜ 사싀에 λΆ€ν•©ν•˜μ§€ μ•ŠλŠ” 경우 μ •λ‹΅ λŒ€μ‹  κ·Έ 이유λ₯Ό μ„€λͺ…ν•˜μ„Έμš”. μ§ˆλ¬Έμ— λŒ€ν•œ 닡을 λͺ¨λ₯Έλ‹€λ©΄ 거짓 정보λ₯Ό κ³΅μœ ν•˜μ§€ λ§ˆμ„Έμš”.
107
+ 4. μ•ˆμ „μ΄λ‚˜ μœ€λ¦¬μ— μœ„λ°°λ˜μ§€ μ•ŠλŠ” ν•œ μ‚¬μš©μžμ˜ λͺ¨λ“  μ§ˆλ¬Έμ— μ™„μ „ν•˜κ³  ν¬κ΄„μ μœΌλ‘œ λ‹΅λ³€ν•˜μ„Έμš”."""},
108
+ {
109
+ "role": "user",
110
+ "content": "λ„€ 이름은 뭐야?."
111
+ }
112
+ ]
113
  )
114
+
115
  # λ‚΄ 이름은 AkaLlamaμž…λ‹ˆλ‹€! λ‚˜λŠ” μ–Έμ–΄ λͺ¨λΈλ‘œ, μ‚¬μš©μžμ™€ λŒ€ν™”ν•˜λŠ” 데 도움을 μ£ΌκΈ° μœ„ν•΄ λ§Œλ“€μ–΄μ‘ŒμŠ΅λ‹ˆλ‹€. λ‚˜λŠ” λ‹€μ–‘ν•œ μ£Όμ œμ— λŒ€ν•œ μ§ˆλ¬Έμ— λ‹΅ν•˜κ³ , μƒˆλ‘œμš΄ 아이디어λ₯Ό μ œκ³΅ν•˜λ©°, 문제λ₯Ό ν•΄κ²°ν•˜λŠ” 데 도움이 될 수 μžˆμŠ΅λ‹ˆλ‹€. μ‚¬μš©μžκ°€ μ›ν•˜λŠ” μ •λ³΄λ‚˜ 도움을 받도둝 μ΅œμ„ μ„ λ‹€ν•  κ²ƒμž…λ‹ˆλ‹€!
116
  ```
117
 
118
+
119
+
120
+ ## Compatibility
121
+
122
+ These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221)
123
+
124
+ They are also compatible with many third party UIs and libraries - please see the list at the top of this README.
125
+
126
+ ## Explanation of quantisation methods
127
+
128
+ <details>
129
+ <summary>Click to see details</summary>
130
+
131
+ The new methods available are:
132
+
133
+ * GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
134
+ * GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
135
+ * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
136
+ * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
137
+ * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
138
+
139
+ Refer to the Provided Files table below to see what files use which methods, and how.
140
+ </details>
141
+
142
+ ## Provided files
143
+
144
+ | Name | Quant method | Bits | Size | Max RAM required | Use case |
145
+ | ---- | ---- | ---- | ---- | ---- | ----- |
146
+ | [AkaLlama-llama3-70b-v0.1.Q2_K.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q2_K.gguf) | Q2_K | 2 | 26.4 GB| 28.9 GB | smallest, significant quality loss - not recommended for most purposes |
147
+ | [AkaLlama-llama3-70b-v0.1.Q3_K_S.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q3_K_S.gguf) | Q3_K_S | 3 | 30.9 GB| 33.4 GB | very small, high quality loss |
148
+ | [AkaLlama-llama3-70b-v0.1.Q3_K_M.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q3_K_M.gguf) | Q3_K_M | 3 | 34.3 GB| 36.8 GB | very small, high quality loss |
149
+ | [AkaLlama-llama3-70b-v0.1.Q3_K_L.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q3_K_L.gguf) | Q3_K_L | 3 | 37.1 GB| 39.6 GB | small, substantial quality loss |
150
+ | [AkaLlama-llama3-70b-v0.1.Q4_K_S.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q4_K_S.gguf) | Q4_K_S | 4 | 40.3 GB| 42.8 GB | small, greater quality loss |
151
+ | [AkaLlama-llama3-70b-v0.1.Q4_K_M.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q4_K_M.gguf) | Q4_K_M | 4 | 42.5 GB| 45.0 GB | medium, balanced quality - recommended |
152
+ | [AkaLlama-llama3-70b-v0.1.Q5_K_S.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q5_K_S.gguf) | Q5_K_S | 5 | 48.7 GB| 50.2 GB | large, low quality loss - recommended |
153
+ | [AkaLlama-llama3-70b-v0.1.Q5_K_M.gguf](https://huggingface.co/mirlab/AkaLlama-llama3-70b-v0.1-GGUF/blob/main/AkaLlama-llama3-70b-v0.1.Q5_K_M.gguf) | Q5_K_M | 5 | 50.0 GB| 52.5 GB | large, very low quality loss - recommended |
154
+ | [AkaLlama-llama3-70b-v0.1.Q6_K.gguf] | Q6_K | 6 | 54.4 GB| 59.9 GB | very large, extremely low quality loss |
155
+ | [AkaLlama-llama3-70b-v0.1.Q8_0.gguf] | Q8_0 | 8 | 70.0 GB| 72.5 GB | very large, extremely low quality loss - not recommended |
156
+
157
+ **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
158
+
159
+ ### Q6_K and Q8_0 files are split and require joining
160
+
161
+ **Note:** HF does not support uploading files larger than 50GB. Therefore I have uploaded the Q6_K and Q8_0 files as split files.
162
+
163
+ ### q6_K
164
+ Please download:
165
+ * `AkaLlama-llama3-70b-v0.1.Q6_K.00001-of-00002.gguf`
166
+ * `AkaLlama-llama3-70b-v0.1.Q6_K.00002-of-00002.gguf`
167
+
168
+ ### q8_0
169
+ Please download:
170
+ * `AkaLlama-llama3-70b-v0.1.Q8_0.00001-of-00002.gguf`
171
+ * `AkaLlama-llama3-70b-v0.1.Q8_0.00002-of-00002.gguf`
172
+
173
+
174
+ To join the files, do the following:
175
+
176
+ Linux and macOS:
177
+ ```
178
+ cat AkaLlama-llama3-70b-v0.1.Q6_K.*-of-00002.gguf > AkaLlama-llama3-70b-v0.1.Q6_K.gguf && rm AkaLlama-llama3-70b-v0.1.Q6_K.*-of-00002.gguf
179
+ cat AkaLlama-llama3-70b-v0.1.Q8_0.*-of-00002.gguf > AkaLlama-llama3-70b-v0.1.Q8_0.gguf && rm AkaLlama-llama3-70b-v0.1.Q8_0.*-of-00002.gguf
180
+ ```
181
+ Windows command line:
182
+ ```
183
+ COPY /B AkaLlama-llama3-70b-v0.1.Q6_K.00001-of-00002.gguf + AkaLlama-llama3-70b-v0.1.Q6_K.00002-of-00002.gguf AkaLlama-llama3-70b-v0.1.Q6_K.gguf
184
+ del AkaLlama-llama3-70b-v0.1.Q6_K.00001-of-00002.gguf AkaLlama-llama3-70b-v0.1.Q6_K.00002-of-00002.gguf
185
+
186
+ COPY /B AkaLlama-llama3-70b-v0.1.Q8_0.00001-of-00002.gguf + AkaLlama-llama3-70b-v0.1.Q8_0.00002-of-00002.gguf AkaLlama-llama3-70b-v0.1.Q8_0.gguf
187
+ del AkaLlama-llama3-70b-v0.1.Q8_0.00001-of-00002.gguf AkaLlama-llama3-70b-v0.1.Q8_0.00002-of-00002.gguf
188
+ ```
189
+
190
+
191
+
192
  ## Evaluation
193
 
194
  | Model | #Parameter | Qunatized? | LogicKor |