Edit model card

Jellyfish-8B

PicToModel

Model Details

Jellyfish-8B is a large language model equipped with 8 billion parameters.
We fine-tuned the Meta-Llama-3-8B-Instruct model using the datasets pertinent to data preprocessing tasks. The training data is used a subset of the [Jellyfish-Instruct] (https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct)

More details about the model can be found in the Jellyfish paper.

  • Developed by: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
  • Contact: dongyuyang@nec.com
  • Funded by: NEC Corporation, Osaka University
  • Language(s) (NLP): English
  • License: Non-Commercial Creative Commons license (CC BY-NC-4.0)
  • Finetuned from model: Meta-Llama-3-8B-Instruct

Citation

If you find our work useful, please give us credit by citing:

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}

Performance on seen tasks

Task Type Dataset Non-LLM SoTA1 GPT-3.52 GPT-42 GPT-4o Table-GPT Jellyfish-7B Jellyfish-8B Jellyfish-13B
Error Detection Seen Adult 99.10 99.10 92.01 83.58 -- 77.40 73.74 99.33
Error Detection Seen Hospital 94.40 97.80 90.74 44.76 -- 94.51 93.40 95.59
Error Detection Unseen Flights 81.00 -- 83.48 66.01 -- 69.15 66.21 82.52
Error Detection Unseen Rayyan 79.00 -- 81.95 68.53 -- 75.07 81.06 90.65
Data Imputation Seen Buy 96.50 98.50 100 100 -- 98.46 98.46 100
Data Imputation Seen Restaurant 77.20 88.40 97.67 90.70 -- 89.53 87.21 89.53
Data Imputation Unseen Flipkart 68.00 -- 89.94 83.20 -- 87.14 87.48 81.68
Data Imputation Unseen Phone 86.70 -- 90.79 86.78 -- 86.52 85.68 87.21
Schema Matching Seen MIMIC-III 20.00 -- 40.00 29.41 -- 53.33 45.45 40.00
Schema Matching Seen Synthea 38.50 45.20 66.67 6.56 -- 55.56 47.06 56.00
Schema Matching Unseen CMS 50.00 -- 19.35 22.22 -- 42.86 38.10 59.29
Entity Matching Seen Amazon-Google 75.58 63.50 74.21 70.91 70.10 81.69 81.42 81.34
Entity Matching Seen Beer 94.37 100 100 90.32 96.30 100.00 100.00 96.77
Entity Matching Seen DBLP-ACM 98.99 96.60 97.44 95.87 93.80 98.65 98.77 98.98
Entity Matching Seen DBLP-GoogleScholar 95.70 83.80 91.87 90.45 92.40 94.88 95.03 98.51
Entity Matching Seen Fodors-Zagats 100 100 100 93.62 100 100 100 100
Entity Matching Seen iTunes-Amazon 97.06 98.20 100 98.18 94.30 96.30 96.30 98.11
Entity Matching Unseen Abt-Buy 89.33 -- 92.77 78.73 -- 86.06 88.84 89.58
Entity Matching Unseen Walmart-Amazon 86.89 87.00 90.27 79.19 82.40 84.91 85.24 89.42
Avg 80.44 - 84.17 72.58 - 82.74 81.55 86.02

For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets.
Accuracy as the metric for data imputation and the F1 score for other tasks.

  1. Ditto for Entity Matching
    SMAT for Schema Matching
    HoloDetect for Error Detection seen datasets
    RAHA for Error Detection unseen datasets
    IPM for Data Imputation
  2. Large Language Models as Data Preprocessors

Performance on unseen tasks

Column Type Annotation

Dataset RoBERTa (159 shots)1 GPT-3.51 GPT-4 GPT-4o Jellyfish-7B Jellyfish-8B Jellyfish-13B
SOTAB 79.20 89.47 91.55 65.05 83 76.33 82

Few-shot is disabled for Jellyfish models.

  1. Results from Column Type Annotation using ChatGPT

Attribute Value Extraction

Dataset Stable Beluga 2 70B1 SOLAR 70B1 GPT-3.51 GPT-4 1 GPT-4o Jellyfish-7B Jellyfish-8B Jellyfish-13B
AE-110k 52.10 49.20 61.30 55.50 55.77 56.09 59.55 58.12
OA-Mine 50.80 55.20 62.70 68.90 60.20 51.98 59.22 55.96

Few-shot is disabled for Jellyfish models.

  1. Results from Product Attribute Value Extraction using Large Language Models

Prompt Template

<|start_header_id|>system<|end_header_id|>{system message}<|eot_id|>
<|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Downloads last month
2
Safetensors
Model size
8.03B params
Tensor type
BF16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.