File size: 6,658 Bytes
ffa5028 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
language:
- en
- ko
pipeline_tag: text-generation
inference: false
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- llama-2-chat
license: apache-2.0
library_name: peft
---
# komt-Llama-2-13b-hf-ggml
https://github.com/davidkim205/komt
This model quantized the [korean Llama 2 13B](https://huggingface.co/davidkim205/komt-Llama-2-13b-hf) using [llama.cpp](https://github.com/ggerganov/llama.cpp) to 4-bit quantization.
Our model, being in the same format as TheBloke's ggml, supports the following libraries or UI.
The following content references [TheBloke/Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML#metas-llama-2-13b-chat-ggml).
GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling.
* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with GPU acceleration via the c_transformers backend.
* [LM Studio](https://lmstudio.ai/), a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel.
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend.
* [ctransformers](https://github.com/marella/ctransformers), a Python library with LangChain support and OpenAI-compatible AI server.
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with OpenAI-compatible API server.
## Model Details
* **Model Developers** : davidkim(changyeon kim)
* **Repository** : https://github.com/davidkim205/komt
* **quant methods** : q4_0, q4_1, q5_0, q5_1, q2_k, q3_k, q3_k_m, q3_k_l, q4_k, q4_k_s, q4_k_m, q5_k, q5_k_s, q5_k_m, q8_0, q4_0
## Prompt Template
```
### instruction: {prompt}
### Response:
```
Examples:
```
### instruction: ์๋์ฐจ ์ข
ํฉ(์ ๊ธฐ)๊ฒ์ฌ ์๋ฌด๊ธฐ๊ฐ์ ์ผ๋ง์ธ๊ฐ์?
### Response:
```
response:
```
### instruction: ์๋์ฐจ ์ข
ํฉ(์ ๊ธฐ)๊ฒ์ฌ ์๋ฌด๊ธฐ๊ฐ์ ์ผ๋ง์ธ๊ฐ์?
### Response:์๋์ฐจ ์ข
ํฉ(์ ๊ธฐ)๊ฒ์ฌ๋ 2๋
1991๋
7์ 1์ผ์ ๊ณ ์๋ '์๋์ฐจ ๋ณดํ๋ฃ ์กฐ์ ๊ธฐ์ค'์์ ์ทจ๋ฆฌ๋ก๋ถํฐ ์ ์ ๋ ๊ธฐ์ค ์ ๊ฒฝ๋ ์ด์์ฐจ๋ฅผ ์ ์ธํ ์๋์ฐจ ๋ชจ๋ ์น์ฉ์๋์ฐจ๋ 2๋
๋ง๋ค ํ์ํ๋ค. ์ด ๋ฒ์ ์ฐจ๋์ ๊ด๊ณ์์ด 2๋
๋ง๋ค ์ ๊ธฐ๊ฒ์ฌ๋ฅผ ํด์ผํ๋ค๊ณ ๊ท์ ํ๋ค.
```
## Usage
When using the original [llama.cpp](https://github.com/ggerganov/llama.cpp)
```
make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "### instruction: ์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ\n\n### Response:"
```
When using the modified llama.cpp for korean multi-task (recommended):
Refer https://github.com/davidkim205/komt/tree/main/llama.cpp
```
make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ"
```
response:
```
$ make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ"
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:
I CC: cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX: g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
make: Nothing to be done for 'default'.
main: build = 6 (01a61bf)
main: seed = 1692190774
llama.cpp: loading model from models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 6912
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 13152.13 MB (+ 400.00 MB per state)
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 75.35 MB
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
### instruction: ์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ
### Response:์๋ ๋ธ๋ผ๋์ค
ํด๋ฆฌํฌํฐ(Harry Potter)๋ J. K. ๋กค๋ง์ด ์ด ํํ์ง ์์ค์ด๋ค. 1997๋
๋ถํฐ 2007๋
๊น์ง ์ด 7๊ถ์ผ๋ก ๋ฐํ๋์๊ณ , ์ ์ธ๊ณ์ ์ผ๋ก ๋ง์ ์ธ๊ธฐ๋ฅผ ๋์๋ค. ์๊ตญ์์๋ ๋ธ๋ฃธ๋ฒ๊ทธ(Bloomsbury), ๋ฏธ๊ตญ์์๋ ์๋ ๋ธ๋ผ๋์ค(Warner Brothers)๊ฐ ๊ฐ๊ฐ ์ถํํ์๋ค. ํ์ฌ ์ ์ธ๊ณ์ ์ผ๋ก 2์ต 4,000๋ง ๋ถ ์ด์์ ํ๋งค๊ณ ๋ฅผ ์ฌ๋ฆฌ๊ณ ์์ผ๋ฉฐ, ์ ์ธ๊ณ ๋๋ถ๋ถ์ ๋ฌธํ๊ฐ๋ค์๊ฒ ์ํฅ์ ์ฃผ์๋ค. ### check_end_of_text [end of text]
llama_print_timings: load time = 801.73 ms
llama_print_timings: sample time = 108.54 ms / 308 runs ( 0.35 ms per token, 2837.66 tokens per second)
llama_print_timings: prompt eval time = 2651.47 ms / 43 tokens ( 61.66 ms per token, 16.22 tokens per second)
llama_print_timings: eval time = 120629.25 ms / 307 runs ( 392.93 ms per token, 2.54 tokens per second)
llama_print_timings: total time = 123440.86 ms
```
|