---
license: cc-by-nc-4.0
language:
- en
---
# Jellyfish-8B
<!-- Provide a quick summary of what the model is/does. -->
<!--
<img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
-->
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>

Jellyfish models with other sizes are available here:  
[Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)  
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)

## Model Details
Jellyfish-8B is a large language model equipped with 8 billion parameters.   
We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.

<!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->

More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).

- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada  
- **Contact: dongyuyang@nec.com**  
- **Funded by:** NEC Corporation, Osaka University  
- **Language(s) (NLP):** English  
- **License:** Non-Commercial Creative Commons license (CC BY-NC-4.0)  
- **Finetuned from model:** [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) 

## Citation

If you find our work useful, please give us credit by citing:

```
@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}
```

## Performance on seen tasks

| Task            | Type   | Dataset           | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup>  | GPT-4o | Table-GPT | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|-----------------|--------|-------------------|-----------------|--------|--------|--------|-----------|--------------|--------------|---------------|
| Error Detection | Seen   | Adult             | *99.10*         | 99.10  | 92.01  | 83.58  | --        | 77.40        | 73.74        | **99.33**     |
| Error Detection | Seen   | Hospital          | 94.40           | **97.80** | 90.74  | 44.76  | --        | 94.51        | 93.40        | *95.59*       |
| Error Detection | Unseen | Flights           | 81.00           | --     | **83.48** | 66.01  | --        | 69.15        | 66.21        | *82.52*       |
| Error Detection | Unseen | Rayyan            | 79.00           | --     | *81.95* | 68.53  | --        | 75.07        | 81.06        | **90.65**     |
| Data Imputation | Seen   | Buy               | 96.50           | 98.50  | **100** | **100** | --        | 98.46        | 98.46        | **100**       |
| Data Imputation | Seen   | Restaurant        | 77.20           | 88.40  | **97.67** | 90.70  | --        | 89.53        | 87.21        | 89.53         |
| Data Imputation | Unseen | Flipkart          | 68.00           | --     | **89.94** | 83.20  | --        | 87.14        | *87.48*      | 81.68         |
| Data Imputation | Unseen | Phone             | 86.70           | --     | **90.79** | 86.78  | --        | 86.52        | 85.68        | *87.21*       |
| Schema Matching | Seen   | MIMIC-III         | 20.00           | --     | 40.00   | 29.41  | --        | **53.33**    | *45.45*      | 40.00         |
| Schema Matching | Seen   | Synthea           | 38.50           | 45.20  | **66.67** | 6.56   | --        | 55.56        | 47.06        | 56.00         |
| Schema Matching | Unseen | CMS               | *50.00*         | --     | 19.35   | 22.22  | --        | 42.86        | 38.10        | **59.29**     |
| Entity Matching | Seen   | Amazon-Google     | 75.58           | 63.50  | 74.21  | 70.91  | 70.10     | **81.69**    | *81.42*      | 81.34         |
| Entity Matching | Seen   | Beer              | 94.37           | **100** | **100** | 90.32  | 96.30     | **100.00**   | **100.00**   | 96.77         |
| Entity Matching | Seen   | DBLP-ACM          | **98.99**       | 96.60  | 97.44  | 95.87  | 93.80     | 98.65        | 98.77        | *98.98*       |
| Entity Matching | Seen   | DBLP-GoogleScholar| *95.70*         | 83.80  | 91.87  | 90.45  | 92.40     | 94.88        | 95.03        | **98.51**     |
| Entity Matching | Seen   | Fodors-Zagats     | **100**         | **100** | **100** | 93.62  | **100**   | **100**      | **100**      | **100**       |
| Entity Matching | Seen   | iTunes-Amazon     | 97.06           | *98.20*| **100** | 98.18  | 94.30     | 96.30        | 96.30        | 98.11         |
| Entity Matching | Unseen | Abt-Buy           | 89.33           | --     | **92.77** | 78.73  | --        | 86.06        | 88.84        | *89.58*       |
| Entity Matching | Unseen | Walmart-Amazon    | 86.89           | 87.00  | **90.27** | 79.19  | 82.40     | 84.91        | 85.24        | *89.42*       |
| Avg             |        |                   | 80.44           | -      | *84.17* | 72.58  | -         | 82.74        | 81.55        | **86.02**     |

_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._   
_Accuracy as the metric for data imputation and the F1 score for other tasks._ 

1.  
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets  
  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets  
  [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching  
  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching  
3.  
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)

## Performance on unseen tasks

### Column Type Annotation

| Dataset           | RoBERTa (159 shots)<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4  | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|--------|-----------------|--------|--------|--------|--------------|--------------|---------------|
| SOTAB | 79.20 | 89.47 | 91.55 | 65.05 | 83 | 76.33 | 82 |

_Few-shot is disabled for Jellyfish models._   

1. Results from [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745)

### Attribute Value Extraction

| Dataset |Stable Beluga 2 70B<sup>1</sup> | SOLAR 70B<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 <sup>1</sup>|  GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
| ---- | ---- | ---- | ---- | ---- | ---- | ----| ----| ----|
| AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 55.77 | 56.09 |59.55 | 58.12 |
| OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 60.20 | 51.98 | 59.22 | 55.96 |

_Few-shot is disabled for Jellyfish models._   

1. Results from [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537)

## Prompt Template
```
<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>
<|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
```

## Training Details

### Training Method

We used LoRA to speed up the training process, targeting the q_proj, k_proj, v_proj, and o_proj modules.

## Uses

To accelerate the inference, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Python Script
We provide two simple Python code examples for inference using the Jellyfish model.  

#### Using Transformers and Torch Modules
<div style="height: auto; max-height: 400px; overflow-y: scroll;">
  
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Model will be automatically downloaded from HuggingFace model hub if not cached.
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
# You can also download the model manually and replace the model name with the path to the model files.
model = AutoModelForCausalLM.from_pretrained(
    "NECOUDBFM/Jellyfish",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>{user_message}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

# You can modify the sampling parameters according to your needs.
generation_config = GenerationConfig(
    do_samples=True,
    temperature=0.35,
    top_p=0.9,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.15,
    )

output = generation_output[0]
response = tokenizer.decode(
    output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()

print(response)

```
</div>

#### Using vLLM
<div style="height: auto; max-height: 400px; overflow-y: scroll;">
  
```python
from vllm import LLM, SamplingParams

# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
# You should modify the path to the model according to your local environment.
path_to_model = (
    "/workspace/models/Jellyfish"
)

model = LLM(model=path_to_model)

# You can modify the sampling parameters according to your needs.
# Caution: The stop parameter should not be changed.
sampling_params = SamplingParams(
    temperature=0.35,
    top_p=0.9,
    max_tokens=1024,
    stop=["<|eot_id|>"],
)

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = ff"<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>{user_message}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)

```
</div>

## Prompts

We provide the prompts used for both fine-tuning and inference.
You can structure your data according to these prompts.

### System Message
```
You are an AI assistant that follows instruction extremely well.
User will give you a question. Your task is to answer as faithfully as you can.
```

### For Error Detection
_There are two forms of the error detection task.
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
In the second form, only the value of a specific attribute is given, and the decision about its correctness is based solely on the attribute's name and value.
The subsequent prompt examples pertain to these two forms, respectively._
```
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
```
```
Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
```
### For Data Imputation
```
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.  
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.  
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?  
Answer only the value of {attribute X}.
```

### For Schema Matching
```
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
```

### For Entity Matching
```
You are tasked with determining whether two records listed below are the same based on the information provided.  
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.  
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity? Choose your answer from: [Yes, No].  
```

### For Column Type Annotation

We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).  

### For Attribute Value Extraction

We follow the prompt in [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537) (textual, w/o examples).