Text Generation
Transformers
GGUF
PyTorch
Safetensors
mistral
quantized
2-bit
3-bit
4-bit precision
5-bit
6-bit
8-bit precision
GGUF
llama
en
dataset:HuggingFaceH4/ultrafeedback_binarized
dataset:allenai/tulu-v2-sft-mixture
arxiv:2305.18290
arxiv:2311.10702
Inference Endpoints
has_space
text-generation-inference
38d3509ce766513f2f506d576c12c8ee6150c9208cc80644b969f9b9a2f7d5d5
Browse files
README.md
CHANGED
@@ -10,6 +10,7 @@ tags:
|
|
10 |
- GGUF
|
11 |
- transformers
|
12 |
- pytorch
|
|
|
13 |
- llama
|
14 |
- text-generation
|
15 |
- en
|
@@ -98,7 +99,7 @@ pip3 install huggingface-hub
|
|
98 |
Then you can download any individual model file to the current directory, at high speed, with a command like this:
|
99 |
|
100 |
```shell
|
101 |
-
huggingface-cli download MaziyarPanahi/tulu-2-dpo-13b-GGUF tulu-2-dpo-13b
|
102 |
```
|
103 |
</details>
|
104 |
<details>
|
@@ -121,7 +122,7 @@ pip3 install hf_transfer
|
|
121 |
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
122 |
|
123 |
```shell
|
124 |
-
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download MaziyarPanahi/tulu-2-dpo-13b-GGUF tulu-2-dpo-13b
|
125 |
```
|
126 |
|
127 |
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
|
@@ -132,7 +133,7 @@ Windows Command Line users: You can set the environment variable by running `set
|
|
132 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
133 |
|
134 |
```shell
|
135 |
-
./main -ngl 35 -m tulu-2-dpo-13b
|
136 |
{system_message}<|im_end|>
|
137 |
<|im_start|>user
|
138 |
{prompt}<|im_end|>
|
@@ -149,7 +150,7 @@ For other parameters and how to use them, please refer to [the llama.cpp documen
|
|
149 |
|
150 |
## How to run in `text-generation-webui`
|
151 |
|
152 |
-
Further instructions can be found in the text-generation-webui documentation, here: [text-generation-webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/04%20
|
153 |
|
154 |
## How to run from Python code
|
155 |
|
@@ -157,7 +158,7 @@ You can use GGUF models from Python using the [llama-cpp-python](https://github.
|
|
157 |
|
158 |
### How to load this model in Python code, using llama-cpp-python
|
159 |
|
160 |
-
For full documentation, please see: [llama-cpp-python docs](https://
|
161 |
|
162 |
#### First install the package
|
163 |
|
@@ -189,7 +190,7 @@ from llama_cpp import Llama
|
|
189 |
|
190 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
191 |
llm = Llama(
|
192 |
-
model_path="./tulu-2-dpo-13b
|
193 |
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
|
194 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
195 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
@@ -209,7 +210,7 @@ output = llm(
|
|
209 |
|
210 |
# Chat Completion API
|
211 |
|
212 |
-
llm = Llama(model_path="./tulu-2-dpo-13b
|
213 |
llm.create_chat_completion(
|
214 |
messages = [
|
215 |
{"role": "system", "content": "You are a story writing assistant."},
|
|
|
10 |
- GGUF
|
11 |
- transformers
|
12 |
- pytorch
|
13 |
+
- safetensors
|
14 |
- llama
|
15 |
- text-generation
|
16 |
- en
|
|
|
99 |
Then you can download any individual model file to the current directory, at high speed, with a command like this:
|
100 |
|
101 |
```shell
|
102 |
+
huggingface-cli download MaziyarPanahi/tulu-2-dpo-13b-GGUF tulu-2-dpo-13b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
|
103 |
```
|
104 |
</details>
|
105 |
<details>
|
|
|
122 |
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
123 |
|
124 |
```shell
|
125 |
+
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download MaziyarPanahi/tulu-2-dpo-13b-GGUF tulu-2-dpo-13b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
|
126 |
```
|
127 |
|
128 |
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
|
|
|
133 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
134 |
|
135 |
```shell
|
136 |
+
./main -ngl 35 -m tulu-2-dpo-13b.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system
|
137 |
{system_message}<|im_end|>
|
138 |
<|im_start|>user
|
139 |
{prompt}<|im_end|>
|
|
|
150 |
|
151 |
## How to run in `text-generation-webui`
|
152 |
|
153 |
+
Further instructions can be found in the text-generation-webui documentation, here: [text-generation-webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/04%20-%20Model%20Tab.md#llamacpp).
|
154 |
|
155 |
## How to run from Python code
|
156 |
|
|
|
158 |
|
159 |
### How to load this model in Python code, using llama-cpp-python
|
160 |
|
161 |
+
For full documentation, please see: [llama-cpp-python docs](https://github.com/abetlen/llama-cpp-python/).
|
162 |
|
163 |
#### First install the package
|
164 |
|
|
|
190 |
|
191 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
192 |
llm = Llama(
|
193 |
+
model_path="./tulu-2-dpo-13b.Q4_K_M.gguf", # Download the model file first
|
194 |
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
|
195 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
196 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
|
|
210 |
|
211 |
# Chat Completion API
|
212 |
|
213 |
+
llm = Llama(model_path="./tulu-2-dpo-13b.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
|
214 |
llm.create_chat_completion(
|
215 |
messages = [
|
216 |
{"role": "system", "content": "You are a story writing assistant."},
|