motherduckdb
/

DuckDB-NSQL-7B-v0.1

@@ -23,7 +23,7 @@ In this repository we are introducing a new member of NSQL, DuckDB-NSQL. It's ba
 ## Training Data
-The general SQL queries are the SQL subset from [The Stack](https://huggingface.co/datasets/bigcode/the-stack), containing 1M training samples. The samples we transpiled to DuckDB SQL, using [sqlglot](https://github.com/tobymao/sqlglot). The labeled text-to-SQL pairs come [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) that were also transpiled to DuckDB SQL, and 200k synthetically generated DuckDB SQL queries, based on the DuckDB v.0.9.2 documentation.
 ## Evaluation Data
@@ -31,7 +31,7 @@ We evaluate our models on a DuckDB-specific benchmark that contains 75 text-to-S
 ## Training Procedure
-DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We pre-trained for 3 epochs and fine-tuned for 10 epochs.
 ## Intended Use and Limitations
@@ -45,8 +45,8 @@ Example 1:
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
-model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
 text = """### Instruction:
 Your task is to generate valid duckdb SQL to answer the following question.
@@ -70,8 +70,8 @@ Example 2:
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
-model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
 text = """### Instruction:
 Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
@@ -108,8 +108,8 @@ Example 3:
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
-model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
 text = """### Instruction:
 Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.

 ## Training Data
+200k synthetically generated text-to-SQL training data pairs, using Mixtral 7B Instruct V1, guided by the DuckDB v0.9.2 documentation. And text-to-SQL pairs from [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) that were transpiled to DuckDB SQL using [sqlglot](https://github.com/tobymao/sqlglot).
 ## Evaluation Data
 ## Training Procedure
+DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We fine-tuned for 10 epochs.
 ## Intended Use and Limitations
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
+model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
 text = """### Instruction:
 Your task is to generate valid duckdb SQL to answer the following question.
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
+model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
 text = """### Instruction:
 Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
+model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
 text = """### Instruction:
 Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.