pek111 commited on
Commit
a953725
1 Parent(s): fe31026

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -33
README.md CHANGED
@@ -103,56 +103,154 @@ Refer to the Provided Files table below to see what files use which methods, and
103
  | [tc-instruct-dpo.Q8_0.gguf](https://huggingface.co/pek111/TC-instruct-DPO-GGUF/blob/main/tc-instruct-dpo.Q8_0.gguf) | Q8_0 | 8 | 7.19 GB | very large, extremely low quality loss - not recommended |
104
  | [tc-instruct-dpo.QF16.gguf](https://huggingface.co/pek111/TC-instruct-DPO-GGUF/blob/main/tc-instruct-dpo.Q8_0.gguf) | QF16 | 16 | 13.53 GB | largest, lowest quality loss - highly not recommended |
105
 
106
- # Inference Code
107
 
108
- Here is example code using HuggingFace Transformers to inference the model (note: in 4bit, it will require around 5GB of VRAM)
109
 
110
- Note: To use function calling, you should see the github repo above.
111
 
112
- ```python
113
- # Requires pytorch, transformers, bitsandbytes, sentencepiece, protobuf, and flash-attn packages
 
114
 
115
- import torch
116
- from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GenerationConfig
117
- import time
118
 
119
- base_model_id = "tanamettpk/TC-instruct-DPO"
120
 
121
- input_text = """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  ### Instruction:
123
  ด่าฉันด้วยคำหยาบคายหน่อย
124
 
125
  ### Response:
126
  """
127
 
128
- model = AutoModelForCausalLM.from_pretrained(
129
- base_model_id,
130
- low_cpu_mem_usage=True,
131
- return_dict=True,
132
- device_map={"": 0},
 
 
133
  )
134
- tokenizer = AutoTokenizer.from_pretrained(base_model_id)
135
 
136
- generation_config = GenerationConfig(
137
- do_sample=True,
138
- top_k=1,
139
- temperature=0.5,
140
- max_new_tokens=300,
141
- repetition_penalty=1.1,
142
- pad_token_id=tokenizer.eos_token_id)
143
 
144
- # Tokenize input
145
- inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
146
 
147
- # Generate outputs
148
- st_time = time.time()
149
- outputs = model.generate(**inputs, generation_config=generation_config)
150
 
151
- # Decode and print response
152
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
153
- print(f"Response time: {time.time() - st_time} seconds")
154
- print(response)
155
- ```
156
 
157
  # Original model card: tanamettpk's TC Instruct DPO - Typhoon 7B
158
 
 
103
  | [tc-instruct-dpo.Q8_0.gguf](https://huggingface.co/pek111/TC-instruct-DPO-GGUF/blob/main/tc-instruct-dpo.Q8_0.gguf) | Q8_0 | 8 | 7.19 GB | very large, extremely low quality loss - not recommended |
104
  | [tc-instruct-dpo.QF16.gguf](https://huggingface.co/pek111/TC-instruct-DPO-GGUF/blob/main/tc-instruct-dpo.Q8_0.gguf) | QF16 | 16 | 13.53 GB | largest, lowest quality loss - highly not recommended |
105
 
106
+ ## How to download GGUF files
107
 
108
+ **Note for manual downloaders:** You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file.
109
 
110
+ The following clients/libraries will automatically download models for you, providing a list of available models to choose from:
111
 
112
+ - LM Studio
113
+ - LoLLMS Web UI
114
+ - Faraday.dev
115
 
116
+ ### In `text-generation-webui`
 
 
117
 
118
+ Under Download Model, you can enter the model repo: TheBloke/Llama-2-13B-GGUF and below it, a specific filename to download, such as: llama-2-13b.q4_K_M.gguf.
119
 
120
+ Then click Download.
121
+
122
+ ### On the command line, including multiple files at once
123
+
124
+ I recommend using the `huggingface-hub` Python library:
125
+
126
+ ```shell
127
+ pip3 install huggingface-hub>=0.17.1
128
+ ```
129
+
130
+ Then you can download any individual model file to the current directory, at high speed, with a command like this:
131
+
132
+ ```shell
133
+ huggingface-cli download pek111/TC-instruct-DPO-GGUF tc-instruct-dpo.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
134
+ ```
135
+
136
+ <details>
137
+ <summary>More advanced huggingface-cli download usage</summary>
138
+
139
+
140
+ You can also download multiple files at once with a pattern:
141
+
142
+ ```shell
143
+ huggingface-cli download pek111/TC-instruct-DPO-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K*gguf'
144
+ ```
145
+
146
+ For more documentation on downloading with `huggingface-cli`, please see: [HF -> Hub Python Library -> Download files -> Download from the CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli).
147
+
148
+ To accelerate downloads on fast connections (1Gbit/s or higher), install `hf_transfer`:
149
+
150
+ ```shell
151
+ pip3 install hf_transfer
152
+ ```
153
+
154
+ And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
155
+
156
+ ```shell
157
+ HUGGINGFACE_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/Llama-2-13B-GGUF llama-2-13b.q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
158
+ ```
159
+
160
+ Windows CLI users: Use `set HUGGINGFACE_HUB_ENABLE_HF_TRANSFER=1` or `$env:HUGGINGFACE_HUB_ENABLE_HF_TRANSFER=1` before running the download command.
161
+ </details>
162
+
163
+ ## Example `llama.cpp` command
164
+
165
+ Make sure you are using `llama.cpp` from commit [d0cee0d36d5be95a0d9088b674dbb27354107221](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
166
+
167
+ ```shell
168
+ ./main -ngl 32 -m tc-instruct-dpo.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}"
169
+ ```
170
+
171
+ Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
172
+
173
+ Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically.
174
+
175
+ If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
176
+
177
+ For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
178
+
179
+ ## How to run in `text-generation-webui`
180
+
181
+ Further instructions here: [text-generation-webui/docs/llama.cpp.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp.md).
182
+
183
+ ## How to run from Python code
184
+
185
+ You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) or [ctransformers](https://github.com/marella/ctransformers) libraries.
186
+
187
+ ### How to load this model from Python using ctransformers
188
+
189
+ #### First install the package
190
+
191
+ ```shell
192
+ # Base llama-cpp-python with no GPU acceleration
193
+ pip install llama-cpp-python
194
+ # With NVidia CUDA acceleration
195
+ CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
196
+ # Or with OpenBLAS acceleration
197
+ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
198
+ # Or with CLBLast acceleration
199
+ CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
200
+ # Or with AMD ROCm GPU acceleration (Linux only)
201
+ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
202
+ # Or with Metal GPU acceleration for macOS systems only
203
+ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
204
+
205
+ # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for Nvidia CUDA:
206
+ $env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
207
+ pip install llama_cpp_python --verbose
208
+ # If BLAS = 0 try installing with these command instead (Windows + CUDA)
209
+ set CMAKE_ARGS="-DLLAMA_CUDA=on"
210
+ set FORCE_CMAKE=1
211
+ $env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
212
+ $env:FORCE_CMAKE = 1
213
+ python -m pip install llama_cpp_python>=0.2.26 --verbose --force-reinstall --no-cache-dir
214
+ ```
215
+
216
+ #### Simple example code to load one of these GGUF models
217
+
218
+ ```python
219
+ import llama_cpp
220
+
221
+ llm_cpp = llama_cpp.Llama(
222
+ model_path="tc-instruct-dpo.Q4_K_M.gguf", # Path to the model
223
+ n_threads=10, # CPU cores
224
+ n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
225
+ n_gpu_layers=35, # Change this value based on your model and your GPU VRAM pool.
226
+ n_ctx=4096, # Max context length
227
+ )
228
+
229
+ prompt = """
230
  ### Instruction:
231
  ด่าฉันด้วยคำหยาบคายหน่อย
232
 
233
  ### Response:
234
  """
235
 
236
+ response = llm_cpp(
237
+ prompt=prompt,
238
+ max_tokens=1024,
239
+ temperature=0.5,
240
+ top_k=1,
241
+ repeat_penalty=1.1,
242
+ echo=True
243
  )
 
244
 
245
+ print(response)
246
+ ```
 
 
 
 
 
247
 
248
+ ## How to use with LangChain
 
249
 
250
+ Here's guides on using llama-cpp-python or ctransformers with LangChain:
 
 
251
 
252
+ * [LangChain + llama-cpp-python](https://python.langchain.com/docs/integrations/llms/llamacpp)
253
+ * [LangChain + ctransformers](https://python.langchain.com/docs/integrations/providers/ctransformers)
 
 
 
254
 
255
  # Original model card: tanamettpk's TC Instruct DPO - Typhoon 7B
256