File size: 3,106 Bytes
29bdfa3
 
36dec03
 
29bdfa3
29880fd
 
3d159bc
11912af
3105ad4
36dec03
1f3a8cc
 
3105ad4
 
 
0a9efbe
3849685
1f3a8cc
 
3105ad4
c1b63f5
36dec03
 
40f8719
3105ad4
40f8719
3849685
40f8719
3849685
40f8719
3849685
 
 
 
 
 
 
 
 
 
 
 
 
 
17c39c9
 
 
3849685
 
 
 
40f8719
3849685
f218129
29880fd
 
 
 
f218129
 
29880fd
 
 
17c39c9
29880fd
 
 
 
 
 
 
 
094ec32
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
inference: false
language:
  - ja
---
# weblab-10b-instruction-sft-GPTQ 

original model [weblab-10b-instruction-sft](https://huggingface.co/matsuo-lab/weblab-10b-instruction-sft) which is a Japanese-centric multilingual GPT-NeoX model of 10 billion parameters.

This model is A quantized(miniaturized) version of the original model(21.42GB).

There are currently two well-known quantization version of original model.  
(1)GPTQ version(This model. 6.3 GB)  
The size is smaller and the execution speed is faster, but the inference performance may be a little worse than original model.  
At least one GPU is currently required due to a limitation of the Accelerate library.  
So this model cannot be run with the huggingface space free version.  
You need autoGPTQ library to use this model.  

(2)llama.cpp version(gguf)([matsuolab-weblab-10b-instruction-sft-gguf](https://huggingface.co/mmnga/matsuolab-weblab-10b-instruction-sft-gguf) 6.03GB)  
created by mmnga.  
You can use gguf model with llama.cpp at cpu only machine.  
But maybe gguf model little bit slower then GPTQ especialy long text.


### sample code
 
```
pip install auto-gptq
```

```
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

quantized_model_dir = "dahara1/weblab-10b-instruction-sft-GPTQ"
model_basename = "gptq_model-4bit-128g"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

model = AutoGPTQForCausalLM.from_quantized(
        quantized_model_dir,
        model_basename=model_basename,
        use_safetensors=True,
        device="cuda:0")


prompt_text = "スタジオジブリの作品を5つ教えてください"
prompt_template = f'以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{prompt_text}\n\n### 応答:'

tokens = tokenizer(prompt_template, return_tensors="pt").to("cuda:0").input_ids
output = model.generate(input_ids=tokens, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0]))
```

### Other documents
https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md

### Benchmark

The results below are preliminary. The blank part is under measurement.  
Also, the score may change as a result of tuning after this.

* **Japanese benchmark**

    - *We used [Stability-AI/lm-evaluation-harness + gptq patch](https://github.com/webbigdata-jp/lm-evaluation-harness) for evaluation.*
    - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
    - *model loading is performed with gptq_use_triton=True, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
    - *The number of few-shots is 3,3,3,2.*
   
    | Model | Average | JCommonsenseQA | JNLI | MARC-ja | JSQuAD |
    | :-- | :-- | :-- | :-- | :-- | :-- |
    | weblab-10b-instruction-sft | 78.78 | 74.35 | 65.65 | 96.06 | 79.04 |
    | weblab-10b | 66.38 | 65.86 | 54.19 | 84.49 | 60.98 |
    | *weblab-10b-instruction-sft-GPTQ* | 69.72 | 74.53 | 41.70 | 89.95 | 72.69 |