Update README.md
Browse files
README.md
CHANGED
@@ -11,9 +11,9 @@ language:
|
|
11 |
## Model Details
|
12 |
Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
|
13 |
|
14 |
-
We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets pertinent to data preprocessing tasks.
|
15 |
-
Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4
|
16 |
-
It is notable that as a 13B model, Jellyfish allows for cost-effective local execution without compromising data security.
|
17 |
|
18 |
| Task | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
|
19 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
@@ -30,7 +30,7 @@ It is notable that as a 13B model, Jellyfish allows for cost-effective local exe
|
|
30 |
| Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
|
31 |
| Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
|
32 |
|
33 |
-
_Accuracy as the metric for data imputation
|
34 |
_For GPT-3.5, GPT-4 we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
|
35 |
1.
|
36 |
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
@@ -71,8 +71,8 @@ On the other hand, Jellyfish-13B-Reasoning is more user-oriented, with responses
|
|
71 |
|
72 |
### Training Data
|
73 |
We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
|
74 |
-
The original datasets
|
75 |
-
We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
|
76 |
|
77 |
### Training Method
|
78 |
|
@@ -82,7 +82,7 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
|
|
82 |
|
83 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
84 |
We provide the prompts used for both the model's fine-tuning and inference.
|
85 |
-
You can structure your data
|
86 |
However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
|
87 |
|
88 |
### JellyFish-13B
|
|
|
11 |
## Model Details
|
12 |
Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
|
13 |
|
14 |
+
We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
|
15 |
+
Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
|
16 |
+
It is notable that, as a 13B model, Jellyfish allows for cost-effective local execution without compromising data security.
|
17 |
|
18 |
| Task | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
|
19 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
|
|
30 |
| Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
|
31 |
| Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
|
32 |
|
33 |
+
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
34 |
_For GPT-3.5, GPT-4 we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
|
35 |
1.
|
36 |
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
|
|
71 |
|
72 |
### Training Data
|
73 |
We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
|
74 |
+
The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
|
75 |
+
We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
|
76 |
|
77 |
### Training Method
|
78 |
|
|
|
82 |
|
83 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
84 |
We provide the prompts used for both the model's fine-tuning and inference.
|
85 |
+
You can structure your data according to these prompts.
|
86 |
However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
|
87 |
|
88 |
### JellyFish-13B
|