SFT dataset details

#4
by cyzhh - opened

In Figure 2: Main Pipeline of Model Training, we see 7M Q&A in chemical corpora Chemdata, however in Appedix C hyperparameters, we see ChemData with 70.2 million entries. Can you explain for my confusion?

Thanks for your comment!
We're still working on progress for ChemData and ChemLLM.
There would be some typos in our preprint.
The correct capacity info should be shown in Table 4: Statics of Instruction Datasets.
This number would be changing in the future according to our data processing strategy settings.

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!

I'm very instrested in your work. Thanks!

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!

I'm very instrested in your work. Thanks!

Thanks for your insightful comments!

cyzhh changed discussion status to closed

Sign up or log in to comment