questions on pretrain and sql formats

#2
by eramlak - opened

hi, thanks for the contribution. Does the training include samples from biquery sql and other variants of sql?. Also can you eloborate on your 2 step approach for pretrain and instruct fine tune?. What is the dataset for pretrain is it just sql statements without questions and you do next token prediction?

NumbersStation org

Thanks for your interest in our work!

For the pertaining step, we use the SQL subset from The Stack (https://huggingface.co/datasets/bigcode/the-stack), containing around 1M training samples. We use the raw SQL data with the next token prediction for continuous pertaining.
For the instruct fine-tuning step, we collect text-to-SQL pairs from more than 20 different public sources across the web from standard datasets such as WikiSQL to medical datasets such as MIMIC_III, containing around 300,000 samples of text-to-SQL pairs.

You can find more information from our blog (https://www.numbersstation.ai/post/introducing-nsql-open-source-sql-copilot-foundation-models).

Have you seen any catastrophic interference with the pretraining step? Was it only pretrained on the SQL data or do you mix other dataset which was used in the salesforce codegen?. Do you plan to opensource the training code?

NumbersStation org

With the pretraining step, we intend to let the model understand more about SQL and it did improve the text-to-SQL capability (You can find the analysis in our blog). We only pre-trained on the SQL data w/o mixing any other data used in salesforce codegen pertaining. We'll release the instruct fine-tuning data soon. Stay tuned!

Sign up or log in to comment