uploaded readme
Browse files
README.md
ADDED
@@ -0,0 +1,334 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Quantization made by Richard Erkhov.
|
2 |
+
|
3 |
+
[Github](https://github.com/RichardErkhov)
|
4 |
+
|
5 |
+
[Discord](https://discord.gg/pvy7H8DZMG)
|
6 |
+
|
7 |
+
[Request more models](https://github.com/RichardErkhov/quant_request)
|
8 |
+
|
9 |
+
|
10 |
+
Jellyfish-8B - GGUF
|
11 |
+
- Model creator: https://huggingface.co/NECOUDBFM/
|
12 |
+
- Original model: https://huggingface.co/NECOUDBFM/Jellyfish-8B/
|
13 |
+
|
14 |
+
|
15 |
+
| Name | Quant method | Size |
|
16 |
+
| ---- | ---- | ---- |
|
17 |
+
| [Jellyfish-8B.Q2_K.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q2_K.gguf) | Q2_K | 2.96GB |
|
18 |
+
| [Jellyfish-8B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.IQ3_XS.gguf) | IQ3_XS | 3.28GB |
|
19 |
+
| [Jellyfish-8B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.IQ3_S.gguf) | IQ3_S | 3.43GB |
|
20 |
+
| [Jellyfish-8B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q3_K_S.gguf) | Q3_K_S | 3.41GB |
|
21 |
+
| [Jellyfish-8B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.IQ3_M.gguf) | IQ3_M | 3.52GB |
|
22 |
+
| [Jellyfish-8B.Q3_K.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q3_K.gguf) | Q3_K | 3.74GB |
|
23 |
+
| [Jellyfish-8B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q3_K_M.gguf) | Q3_K_M | 3.74GB |
|
24 |
+
| [Jellyfish-8B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q3_K_L.gguf) | Q3_K_L | 4.03GB |
|
25 |
+
| [Jellyfish-8B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.IQ4_XS.gguf) | IQ4_XS | 4.18GB |
|
26 |
+
| [Jellyfish-8B.Q4_0.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q4_0.gguf) | Q4_0 | 4.34GB |
|
27 |
+
| [Jellyfish-8B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.IQ4_NL.gguf) | IQ4_NL | 4.38GB |
|
28 |
+
| [Jellyfish-8B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q4_K_S.gguf) | Q4_K_S | 4.37GB |
|
29 |
+
| [Jellyfish-8B.Q4_K.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q4_K.gguf) | Q4_K | 4.58GB |
|
30 |
+
| [Jellyfish-8B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q4_K_M.gguf) | Q4_K_M | 4.58GB |
|
31 |
+
| [Jellyfish-8B.Q4_1.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q4_1.gguf) | Q4_1 | 4.78GB |
|
32 |
+
| [Jellyfish-8B.Q5_0.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q5_0.gguf) | Q5_0 | 5.21GB |
|
33 |
+
| [Jellyfish-8B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q5_K_S.gguf) | Q5_K_S | 5.21GB |
|
34 |
+
| [Jellyfish-8B.Q5_K.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q5_K.gguf) | Q5_K | 5.34GB |
|
35 |
+
| [Jellyfish-8B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q5_K_M.gguf) | Q5_K_M | 5.34GB |
|
36 |
+
| [Jellyfish-8B.Q5_1.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q5_1.gguf) | Q5_1 | 5.65GB |
|
37 |
+
| [Jellyfish-8B.Q6_K.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q6_K.gguf) | Q6_K | 6.14GB |
|
38 |
+
| [Jellyfish-8B.Q8_0.gguf](https://huggingface.co/RichardErkhov/NECOUDBFM_-_Jellyfish-8B-gguf/blob/main/Jellyfish-8B.Q8_0.gguf) | Q8_0 | 7.95GB |
|
39 |
+
|
40 |
+
|
41 |
+
|
42 |
+
|
43 |
+
Original model description:
|
44 |
+
---
|
45 |
+
license: cc-by-nc-4.0
|
46 |
+
language:
|
47 |
+
- en
|
48 |
+
---
|
49 |
+
# Jellyfish-8B
|
50 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
51 |
+
<!--
|
52 |
+
<img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
|
53 |
+
-->
|
54 |
+
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
55 |
+
|
56 |
+
Jellyfish models with other sizes are available here:
|
57 |
+
[Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
|
58 |
+
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
|
59 |
+
|
60 |
+
## Model Details
|
61 |
+
Jellyfish-8B is a large language model equipped with 8 billion parameters.
|
62 |
+
We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.
|
63 |
+
|
64 |
+
<!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
|
65 |
+
|
66 |
+
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
|
67 |
+
|
68 |
+
- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
|
69 |
+
- **Contact: dongyuyang@nec.com**
|
70 |
+
- **Funded by:** NEC Corporation, Osaka University
|
71 |
+
- **Language(s) (NLP):** English
|
72 |
+
- **License:** Non-Commercial Creative Commons license (CC BY-NC-4.0)
|
73 |
+
- **Finetuned from model:** [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
|
74 |
+
|
75 |
+
## Citation
|
76 |
+
|
77 |
+
If you find our work useful, please give us credit by citing:
|
78 |
+
|
79 |
+
```
|
80 |
+
@article{zhang2023jellyfish,
|
81 |
+
title={Jellyfish: A Large Language Model for Data Preprocessing},
|
82 |
+
author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
|
83 |
+
journal={arXiv preprint arXiv:2312.01678},
|
84 |
+
year={2023}
|
85 |
+
}
|
86 |
+
```
|
87 |
+
|
88 |
+
## Performance on seen tasks
|
89 |
+
|
90 |
+
| Task | Type | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | GPT-4o | Table-GPT | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|
91 |
+
|-----------------|--------|-------------------|-----------------|--------|--------|--------|-----------|--------------|--------------|---------------|
|
92 |
+
| Error Detection | Seen | Adult | *99.10* | 99.10 | 92.01 | 83.58 | -- | 77.40 | 73.74 | **99.33** |
|
93 |
+
| Error Detection | Seen | Hospital | 94.40 | **97.80** | 90.74 | 44.76 | -- | 94.51 | 93.40 | *95.59* |
|
94 |
+
| Error Detection | Unseen | Flights | 81.00 | -- | **83.48** | 66.01 | -- | 69.15 | 66.21 | *82.52* |
|
95 |
+
| Error Detection | Unseen | Rayyan | 79.00 | -- | *81.95* | 68.53 | -- | 75.07 | 81.06 | **90.65** |
|
96 |
+
| Data Imputation | Seen | Buy | 96.50 | 98.50 | **100** | **100** | -- | 98.46 | 98.46 | **100** |
|
97 |
+
| Data Imputation | Seen | Restaurant | 77.20 | 88.40 | **97.67** | 90.70 | -- | 89.53 | 87.21 | 89.53 |
|
98 |
+
| Data Imputation | Unseen | Flipkart | 68.00 | -- | **89.94** | 83.20 | -- | 87.14 | *87.48* | 81.68 |
|
99 |
+
| Data Imputation | Unseen | Phone | 86.70 | -- | **90.79** | 86.78 | -- | 86.52 | 85.68 | *87.21* |
|
100 |
+
| Schema Matching | Seen | MIMIC-III | 20.00 | -- | 40.00 | 29.41 | -- | **53.33** | *45.45* | 40.00 |
|
101 |
+
| Schema Matching | Seen | Synthea | 38.50 | 45.20 | **66.67** | 6.56 | -- | 55.56 | 47.06 | 56.00 |
|
102 |
+
| Schema Matching | Unseen | CMS | *50.00* | -- | 19.35 | 22.22 | -- | 42.86 | 38.10 | **59.29** |
|
103 |
+
| Entity Matching | Seen | Amazon-Google | 75.58 | 63.50 | 74.21 | 70.91 | 70.10 | **81.69** | *81.42* | 81.34 |
|
104 |
+
| Entity Matching | Seen | Beer | 94.37 | **100** | **100** | 90.32 | 96.30 | **100.00** | **100.00** | 96.77 |
|
105 |
+
| Entity Matching | Seen | DBLP-ACM | **98.99** | 96.60 | 97.44 | 95.87 | 93.80 | 98.65 | 98.77 | *98.98* |
|
106 |
+
| Entity Matching | Seen | DBLP-GoogleScholar| *95.70* | 83.80 | 91.87 | 90.45 | 92.40 | 94.88 | 95.03 | **98.51** |
|
107 |
+
| Entity Matching | Seen | Fodors-Zagats | **100** | **100** | **100** | 93.62 | **100** | **100** | **100** | **100** |
|
108 |
+
| Entity Matching | Seen | iTunes-Amazon | 97.06 | *98.20*| **100** | 98.18 | 94.30 | 96.30 | 96.30 | 98.11 |
|
109 |
+
| Entity Matching | Unseen | Abt-Buy | 89.33 | -- | **92.77** | 78.73 | -- | 86.06 | 88.84 | *89.58* |
|
110 |
+
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
111 |
+
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
112 |
+
|
113 |
+
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
|
114 |
+
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
115 |
+
|
116 |
+
1.
|
117 |
+
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
|
118 |
+
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
|
119 |
+
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
|
120 |
+
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
121 |
+
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
122 |
+
3.
|
123 |
+
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
124 |
+
|
125 |
+
## Performance on unseen tasks
|
126 |
+
|
127 |
+
### Column Type Annotation
|
128 |
+
|
129 |
+
| Dataset | RoBERTa (159 shots)<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|
130 |
+
|--------|-----------------|--------|--------|--------|--------------|--------------|---------------|
|
131 |
+
| SOTAB | 79.20 | 89.47 | 91.55 | 65.05 | 83 | 76.33 | 82 |
|
132 |
+
|
133 |
+
_Few-shot is disabled for Jellyfish models._
|
134 |
+
|
135 |
+
1. Results from [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745)
|
136 |
+
|
137 |
+
### Attribute Value Extraction
|
138 |
+
|
139 |
+
| Dataset |Stable Beluga 2 70B<sup>1</sup> | SOLAR 70B<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 <sup>1</sup>| GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|
140 |
+
| ---- | ---- | ---- | ---- | ---- | ---- | ----| ----| ----|
|
141 |
+
| AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 55.77 | 56.09 |59.55 | 58.12 |
|
142 |
+
| OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 60.20 | 51.98 | 59.22 | 55.96 |
|
143 |
+
|
144 |
+
_Few-shot is disabled for Jellyfish models._
|
145 |
+
|
146 |
+
1. Results from [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537)
|
147 |
+
|
148 |
+
## Prompt Template
|
149 |
+
```
|
150 |
+
<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>
|
151 |
+
<|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|>
|
152 |
+
<|start_header_id|>assistant<|end_header_id|>
|
153 |
+
```
|
154 |
+
|
155 |
+
## Training Details
|
156 |
+
|
157 |
+
### Training Method
|
158 |
+
|
159 |
+
We used LoRA to speed up the training process, targeting the q_proj, k_proj, v_proj, and o_proj modules.
|
160 |
+
|
161 |
+
## Uses
|
162 |
+
|
163 |
+
To accelerate the inference, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
|
164 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
165 |
+
|
166 |
+
### Python Script
|
167 |
+
We provide two simple Python code examples for inference using the Jellyfish model.
|
168 |
+
|
169 |
+
#### Using Transformers and Torch Modules
|
170 |
+
<div style="height: auto; max-height: 400px; overflow-y: scroll;">
|
171 |
+
|
172 |
+
```python
|
173 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
|
174 |
+
import torch
|
175 |
+
|
176 |
+
if torch.cuda.is_available():
|
177 |
+
device = "cuda"
|
178 |
+
else:
|
179 |
+
device = "cpu"
|
180 |
+
|
181 |
+
# Model will be automatically downloaded from HuggingFace model hub if not cached.
|
182 |
+
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
|
183 |
+
# You can also download the model manually and replace the model name with the path to the model files.
|
184 |
+
model = AutoModelForCausalLM.from_pretrained(
|
185 |
+
"NECOUDBFM/Jellyfish",
|
186 |
+
torch_dtype=torch.float16,
|
187 |
+
device_map="auto",
|
188 |
+
)
|
189 |
+
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")
|
190 |
+
|
191 |
+
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
|
192 |
+
|
193 |
+
# You need to define the user_message variable based on the task and the data you want to test on.
|
194 |
+
user_message = "Hello, world."
|
195 |
+
|
196 |
+
prompt = f"<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>{user_message}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>"
|
197 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
198 |
+
input_ids = inputs["input_ids"].to(device)
|
199 |
+
|
200 |
+
# You can modify the sampling parameters according to your needs.
|
201 |
+
generation_config = GenerationConfig(
|
202 |
+
do_samples=True,
|
203 |
+
temperature=0.35,
|
204 |
+
top_p=0.9,
|
205 |
+
)
|
206 |
+
|
207 |
+
with torch.no_grad():
|
208 |
+
generation_output = model.generate(
|
209 |
+
input_ids=input_ids,
|
210 |
+
generation_config=generation_config,
|
211 |
+
return_dict_in_generate=True,
|
212 |
+
output_scores=True,
|
213 |
+
max_new_tokens=1024,
|
214 |
+
pad_token_id=tokenizer.eos_token_id,
|
215 |
+
repetition_penalty=1.15,
|
216 |
+
)
|
217 |
+
|
218 |
+
output = generation_output[0]
|
219 |
+
response = tokenizer.decode(
|
220 |
+
output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
|
221 |
+
).strip()
|
222 |
+
|
223 |
+
print(response)
|
224 |
+
|
225 |
+
```
|
226 |
+
</div>
|
227 |
+
|
228 |
+
#### Using vLLM
|
229 |
+
<div style="height: auto; max-height: 400px; overflow-y: scroll;">
|
230 |
+
|
231 |
+
```python
|
232 |
+
from vllm import LLM, SamplingParams
|
233 |
+
|
234 |
+
# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
|
235 |
+
# You should modify the path to the model according to your local environment.
|
236 |
+
path_to_model = (
|
237 |
+
"/workspace/models/Jellyfish"
|
238 |
+
)
|
239 |
+
|
240 |
+
model = LLM(model=path_to_model)
|
241 |
+
|
242 |
+
# You can modify the sampling parameters according to your needs.
|
243 |
+
# Caution: The stop parameter should not be changed.
|
244 |
+
sampling_params = SamplingParams(
|
245 |
+
temperature=0.35,
|
246 |
+
top_p=0.9,
|
247 |
+
max_tokens=1024,
|
248 |
+
stop=["<|eot_id|>"],
|
249 |
+
)
|
250 |
+
|
251 |
+
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
|
252 |
+
|
253 |
+
# You need to define the user_message variable based on the task and the data you want to test on.
|
254 |
+
user_message = "Hello, world."
|
255 |
+
|
256 |
+
prompt = ff"<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>{user_message}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>"
|
257 |
+
outputs = model.generate(prompt, sampling_params)
|
258 |
+
response = outputs[0].outputs[0].text.strip()
|
259 |
+
print(response)
|
260 |
+
|
261 |
+
```
|
262 |
+
</div>
|
263 |
+
|
264 |
+
## Prompts
|
265 |
+
|
266 |
+
We provide the prompts used for both fine-tuning and inference.
|
267 |
+
You can structure your data according to these prompts.
|
268 |
+
|
269 |
+
### System Message
|
270 |
+
```
|
271 |
+
You are an AI assistant that follows instruction extremely well.
|
272 |
+
User will give you a question. Your task is to answer as faithfully as you can.
|
273 |
+
```
|
274 |
+
|
275 |
+
### For Error Detection
|
276 |
+
_There are two forms of the error detection task.
|
277 |
+
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
278 |
+
In the second form, only the value of a specific attribute is given, and the decision about its correctness is based solely on the attribute's name and value.
|
279 |
+
The subsequent prompt examples pertain to these two forms, respectively._
|
280 |
+
```
|
281 |
+
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
|
282 |
+
The attributes may include {attribute 1}, {attribute 2}, ...
|
283 |
+
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
|
284 |
+
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
285 |
+
Attribute for Verification: [{attribute X}: {attribute X value}]
|
286 |
+
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
287 |
+
```
|
288 |
+
```
|
289 |
+
Your task is to determine if there is an error in the value of a specific attribute.
|
290 |
+
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
|
291 |
+
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.
|
292 |
+
Note: Missing values (N/A or \"nan\") are not considered errors.
|
293 |
+
Attribute for Verification: [{attribute X}: {attribute X value}]
|
294 |
+
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
295 |
+
```
|
296 |
+
### For Data Imputation
|
297 |
+
```
|
298 |
+
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
299 |
+
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
300 |
+
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
301 |
+
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
302 |
+
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
303 |
+
Answer only the value of {attribute X}.
|
304 |
+
```
|
305 |
+
|
306 |
+
### For Schema Matching
|
307 |
+
```
|
308 |
+
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
309 |
+
Each attribute will be provided by its name and a brief description.
|
310 |
+
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
|
311 |
+
Attribute A is [name: {value of name}, description: {value of description}].
|
312 |
+
Attribute B is [name: {value of name}, description: {value of description}].
|
313 |
+
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
314 |
+
```
|
315 |
+
|
316 |
+
### For Entity Matching
|
317 |
+
```
|
318 |
+
You are tasked with determining whether two records listed below are the same based on the information provided.
|
319 |
+
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
320 |
+
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
321 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
322 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
323 |
+
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
324 |
+
```
|
325 |
+
|
326 |
+
### For Column Type Annotation
|
327 |
+
|
328 |
+
We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
|
329 |
+
|
330 |
+
### For Attribute Value Extraction
|
331 |
+
|
332 |
+
We follow the prompt in [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537) (textual, w/o examples).
|
333 |
+
|
334 |
+
|