File size: 6,658 Bytes
ffa5028
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
language:
- en
- ko
pipeline_tag: text-generation
inference: false
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- llama-2-chat
license: apache-2.0
library_name: peft
---
# komt-Llama-2-13b-hf-ggml

https://github.com/davidkim205/komt

This model quantized the [korean Llama 2 13B](https://huggingface.co/davidkim205/komt-Llama-2-13b-hf) using [llama.cpp](https://github.com/ggerganov/llama.cpp) to 4-bit quantization.


Our model, being in the same format as TheBloke's ggml, supports the following libraries or UI.


The following content references [TheBloke/Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML#metas-llama-2-13b-chat-ggml).

GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling.
* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with GPU acceleration via the c_transformers backend.
* [LM Studio](https://lmstudio.ai/), a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel.
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend.
* [ctransformers](https://github.com/marella/ctransformers), a Python library with LangChain support and OpenAI-compatible AI server.
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with OpenAI-compatible API server.


## Model Details

* **Model Developers** :  davidkim(changyeon kim)
* **Repository** : https://github.com/davidkim205/komt
* **quant methods** : q4_0, q4_1, q5_0, q5_1, q2_k, q3_k, q3_k_m, q3_k_l, q4_k, q4_k_s, q4_k_m, q5_k, q5_k_s, q5_k_m, q8_0, q4_0

## Prompt Template
```
### instruction: {prompt}

### Response: 
```
Examples:
```
### instruction: ์ž๋™์ฐจ ์ข…ํ•ฉ(์ •๊ธฐ)๊ฒ€์‚ฌ ์˜๋ฌด๊ธฐ๊ฐ„์€ ์–ผ๋งˆ์ธ๊ฐ€์š”?

### Response:

```
response:
``` 
### instruction: ์ž๋™์ฐจ ์ข…ํ•ฉ(์ •๊ธฐ)๊ฒ€์‚ฌ ์˜๋ฌด๊ธฐ๊ฐ„์€ ์–ผ๋งˆ์ธ๊ฐ€์š”?

### Response:์ž๋™์ฐจ ์ข…ํ•ฉ(์ •๊ธฐ)๊ฒ€์‚ฌ๋Š” 2๋…„
1991๋…„ 7์›” 1์ผ์— ๊ณ ์‹œ๋œ '์ž๋™์ฐจ ๋ณดํ—˜๋ฃŒ ์กฐ์ •๊ธฐ์ค€'์—์„œ ์ทจ๋ฆฌ๋กœ๋ถ€ํ„ฐ ์ œ์ •๋œ ๊ธฐ์ค€ ์ƒ ๊ฒฝ๋Ÿ‰ ์‚ด์ˆ˜์ฐจ๋ฅผ ์ œ์™ธํ•œ ์ž๋™์ฐจ ๋ชจ๋“  ์Šน์šฉ์ž๋™์ฐจ๋Š” 2๋…„๋งˆ๋‹ค ํ•„์š”ํ•˜๋‹ค. ์ด ๋ฒ•์€ ์ฐจ๋Ÿ‰์— ๊ด€๊ณ„์—†์ด 2๋…„๋งˆ๋‹ค ์ •๊ธฐ๊ฒ€์‚ฌ๋ฅผ ํ•ด์•ผํ•œ๋‹ค๊ณ  ๊ทœ์ œํ–ˆ๋‹ค.
```


## Usage

When using the original [llama.cpp](https://github.com/ggerganov/llama.cpp) 
``` 
make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "### instruction: ์˜ํ™” ํ•ด๋ฆฌํฌํ„ฐ ์‹œ๋ฆฌ์ฆˆ ๋ฐฐ๊ธ‰์‚ฌ๊ฐ€ ์–ด๋””์•ผ\n\n### Response:"
```
When using the modified llama.cpp for korean multi-task (recommended):
Refer https://github.com/davidkim205/komt/tree/main/llama.cpp
``` 
make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "์˜ํ™” ํ•ด๋ฆฌํฌํ„ฐ ์‹œ๋ฆฌ์ฆˆ ๋ฐฐ๊ธ‰์‚ฌ๊ฐ€ ์–ด๋””์•ผ"
```
response:
``` 
 $ make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "์˜ํ™” ํ•ด๋ฆฌํฌํ„ฐ ์‹œ๋ฆฌ์ฆˆ ๋ฐฐ๊ธ‰์‚ฌ๊ฐ€ ์–ด๋””์•ผ"
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: build = 6 (01a61bf)
main: seed  = 1692190774
llama.cpp: loading model from models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 6912
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 13152.13 MB (+  400.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.35 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 ### instruction: ์˜ํ™” ํ•ด๋ฆฌํฌํ„ฐ ์‹œ๋ฆฌ์ฆˆ ๋ฐฐ๊ธ‰์‚ฌ๊ฐ€ ์–ด๋””์•ผ

### Response:์›Œ๋„ˆ ๋ธŒ๋ผ๋”์Šค
ํ•ด๋ฆฌํฌํ„ฐ(Harry Potter)๋Š” J. K. ๋กค๋ง์ด ์“ด ํŒํƒ€์ง€ ์†Œ์„ค์ด๋‹ค. 1997๋…„๋ถ€ํ„ฐ 2007๋…„๊นŒ์ง€ ์ด 7๊ถŒ์œผ๋กœ ๋ฐœํ–‰๋˜์—ˆ๊ณ , ์ „ ์„ธ๊ณ„์ ์œผ๋กœ ๋งŽ์€ ์ธ๊ธฐ๋ฅผ ๋Œ์—ˆ๋‹ค. ์˜๊ตญ์—์„œ๋Š” ๋ธ”๋ฃธ๋ฒ„๊ทธ(Bloomsbury), ๋ฏธ๊ตญ์—์„œ๋Š” ์›Œ๋„ˆ ๋ธŒ๋ผ๋”์Šค(Warner Brothers)๊ฐ€ ๊ฐ๊ฐ ์ถœํŒํ•˜์˜€๋‹ค. ํ˜„์žฌ ์ „ ์„ธ๊ณ„์ ์œผ๋กœ 2์–ต 4,000๋งŒ ๋ถ€ ์ด์ƒ์˜ ํŒ๋งค๊ณ ๋ฅผ ์˜ฌ๋ฆฌ๊ณ  ์žˆ์œผ๋ฉฐ, ์ „ ์„ธ๊ณ„ ๋Œ€๋ถ€๋ถ„์˜ ๋ฌธํ•™๊ฐ€๋“ค์—๊ฒŒ ์˜ํ–ฅ์„ ์ฃผ์—ˆ๋‹ค. ### check_end_of_text [end of text]

llama_print_timings:        load time =   801.73 ms
llama_print_timings:      sample time =   108.54 ms /   308 runs   (    0.35 ms per token,  2837.66 tokens per second)
llama_print_timings: prompt eval time =  2651.47 ms /    43 tokens (   61.66 ms per token,    16.22 tokens per second)
llama_print_timings:        eval time = 120629.25 ms /   307 runs   (  392.93 ms per token,     2.54 tokens per second)
llama_print_timings:       total time = 123440.86 ms

```