Text Generation
Transformers
Safetensors
English
mistral
axolotl
generated_from_trainer
Mistral
instruct
finetune
chatml
gpt4
synthetic data
science
physics
chemistry
biology
math
conversational
Eval Results
Inference Endpoints
text-generation-inference

Can you explain the purpose of merged_all.json?

#4
by nlpguy - opened

To me the axolotl config already looks like it includes all relevant data sources. After looking at previous Einstein models I suspect that the merged_all.json still contains data from those, in addition to being merged with all other datasets. But Is it still relevant? Wouldn't it be more efficient to exclude it from the training process?

merged_all.json is merged data of many alpaca format datasets. The other datasets in the data folder is mainly in sharegpt format. So merged_all.json doesn't contain any of the other data that's in the data folder.

Oh ok. Thanks for the info. Does it simply contain all the other datasets mentioned in the README datasets list but not the axolotl config?

Owner

Yes, you got it right!

Note that I filtered some of them :)

Cool, Thanks for the info and thank you for this new version of Einstein :)

nlpguy changed discussion status to closed
Owner

@nlpguy , if you are more interested in the datasets I use, you can have a look at this link:

https://huggingface.co/datasets/Weyaxi/sci-datasets/tree/main

It may be slightly outdated for 1-2 datasets, but that's the main repository I use.

Sign up or log in to comment