File size: 3,098 Bytes
29bdfa3 36dec03 29bdfa3 29880fd 3d159bc 11912af 36dec03 0a9efbe 3849685 36dec03 40f8719 29880fd 9b22ede 29880fd 36dec03 40f8719 3849685 40f8719 3849685 40f8719 3849685 17c39c9 3849685 40f8719 3849685 f218129 29880fd f218129 29880fd 17c39c9 29880fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
inference: false
language:
- ja
---
# weblab-10b-instruction-sft-GPTQ
original model [weblab-10b-instruction-sft](https://huggingface.co/matsuo-lab/weblab-10b-instruction-sft) which is a Japanese-centric multilingual GPT-NeoX model of 10 billion parameters.
This model is A quantized(miniaturized) version of the original model.
There are currently two well-known quantization methods.
(1)GPTQ(This model)
The size is smaller and the execution speed is faster, but the inference performance may be a little worse.
You need autoGPTQ library to use this model.
(2)llama.cpp([matsuolab-weblab-10b-instruction-sft-gguf](https://huggingface.co/mmnga/matsuolab-weblab-10b-instruction-sft-gguf)) created by mmnga.
You can use cpu only machine. but little bit slow especialy long text.
### sample code
At least one GPU is currently required due to a limitation of the Accelerate library.
So this model cannot be run with the huggingface space free version.
Try it on [Google Colab Under development](https://github.com/webbigdata-jp/python_sample/blob/main/weblab_10b_instruction_sft_GPTQ_sample.ipynb)
```
pip install auto-gptq
```
```
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
quantized_model_dir = "dahara1/weblab-10b-instruction-sft-GPTQ"
model_basename = "gptq_model-4bit-128g"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoGPTQForCausalLM.from_quantized(
quantized_model_dir,
model_basename=model_basename,
use_safetensors=True,
device="cuda:0")
prompt_text = "スタジオジブリの作品を5つ教えてください"
prompt_template = f'以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{prompt_text}\n\n### 応答:'
tokens = tokenizer(prompt_template, return_tensors="pt").to("cuda:0").input_ids
output = model.generate(input_ids=tokens, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0]))
```
### Other documents
https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md
### Benchmark
The results below are preliminary. The blank part is under measurement.
Also, the score may change as a result of tuning after this.
* **Japanese benchmark**
- *We used [Stability-AI/lm-evaluation-harness + gptq patch](https://github.com/webbigdata-jp/lm-evaluation-harness) for evaluation.*
- *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
- *model loading is performed with gptq_use_triton=True, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
- *The number of few-shots is 3,3,3,2.*
| Model | Average | JCommonsenseQA | JNLI | MARC-ja | JSQuAD |
| :-- | :-- | :-- | :-- | :-- | :-- |
| weblab-10b-instruction-sft | 78.78 | 74.35 | 65.65 | 96.06 | 79.04 |
| weblab-10b | 66.38 | 65.86 | 54.19 | 84.49 | 60.98 |
| *weblab-10b-instruction-sft-GPTQ* | - | 74.53 | 41.70 | - | 72.69 |
|