NECOUDBFM
/

Jellyfish-13B

@@ -11,9 +11,9 @@ language:
 ## Model Details
 Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
-We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets pertinent to data preprocessing tasks.
-Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4, ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361))
-It is notable that as a 13B model, Jellyfish allows for cost-effective local execution without compromising data security.
 |  Task  | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
 | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
@@ -30,7 +30,7 @@ It is notable that as a 13B model, Jellyfish allows for cost-effective local exe
 | Error Detection  |  Adult         | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
 | Schema Matching  |  Sythea        | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
-_Accuracy as the metric for data imputation, and the F1 score for other tasks._
 _For GPT-3.5, GPT-4 we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
 1.
   [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
@@ -71,8 +71,8 @@ On the other hand, Jellyfish-13B-Reasoning is more user-oriented, with responses
 ### Training Data
 We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
-The original datasets is [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
-We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 ### Training Method
@@ -82,7 +82,7 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 We provide the prompts used for both the model's fine-tuning and inference.
-You can structure your data accordingly to these prompts.
 However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
 ### JellyFish-13B

 ## Model Details
 Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
+We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
+Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
+It is notable that, as a 13B model, Jellyfish allows for cost-effective local execution without compromising data security.
 |  Task  | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
 | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | Error Detection  |  Adult         | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
 | Schema Matching  |  Sythea        | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
+_Accuracy as the metric for data imputation and the F1 score for other tasks._
 _For GPT-3.5, GPT-4 we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
 1.
   [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
 ### Training Data
 We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
+The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
+We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 ### Training Method
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 We provide the prompts used for both the model's fine-tuning and inference.
+You can structure your data according to these prompts.
 However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
 ### JellyFish-13B