tdoehmen commited on
Commit
816bafc
1 Parent(s): 3b92bae

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -23,7 +23,7 @@ In this repository we are introducing a new member of NSQL, DuckDB-NSQL. It's ba
23
 
24
  ## Training Data
25
 
26
- The general SQL queries are the SQL subset from [The Stack](https://huggingface.co/datasets/bigcode/the-stack), containing 1M training samples. The samples we transpiled to DuckDB SQL, using [sqlglot](https://github.com/tobymao/sqlglot). The labeled text-to-SQL pairs come [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) that were also transpiled to DuckDB SQL, and 200k synthetically generated DuckDB SQL queries, based on the DuckDB v.0.9.2 documentation.
27
 
28
  ## Evaluation Data
29
 
@@ -31,7 +31,7 @@ We evaluate our models on a DuckDB-specific benchmark that contains 75 text-to-S
31
 
32
  ## Training Procedure
33
 
34
- DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We pre-trained for 3 epochs and fine-tuned for 10 epochs.
35
 
36
  ## Intended Use and Limitations
37
 
@@ -45,8 +45,8 @@ Example 1:
45
  ```python
46
  import torch
47
  from transformers import AutoTokenizer, AutoModelForCausalLM
48
- tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
49
- model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
50
 
51
  text = """### Instruction:
52
  Your task is to generate valid duckdb SQL to answer the following question.
@@ -70,8 +70,8 @@ Example 2:
70
  ```python
71
  import torch
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
73
- tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
74
- model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
75
 
76
  text = """### Instruction:
77
  Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
@@ -108,8 +108,8 @@ Example 3:
108
  ```python
109
  import torch
110
  from transformers import AutoTokenizer, AutoModelForCausalLM
111
- tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
112
- model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
113
 
114
  text = """### Instruction:
115
  Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
 
23
 
24
  ## Training Data
25
 
26
+ 200k synthetically generated text-to-SQL training data pairs, using Mixtral 7B Instruct V1, guided by the DuckDB v0.9.2 documentation. And text-to-SQL pairs from [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) that were transpiled to DuckDB SQL using [sqlglot](https://github.com/tobymao/sqlglot).
27
 
28
  ## Evaluation Data
29
 
 
31
 
32
  ## Training Procedure
33
 
34
+ DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We fine-tuned for 10 epochs.
35
 
36
  ## Intended Use and Limitations
37
 
 
45
  ```python
46
  import torch
47
  from transformers import AutoTokenizer, AutoModelForCausalLM
48
+ tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
49
+ model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
50
 
51
  text = """### Instruction:
52
  Your task is to generate valid duckdb SQL to answer the following question.
 
70
  ```python
71
  import torch
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
73
+ tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
74
+ model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
75
 
76
  text = """### Instruction:
77
  Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
 
108
  ```python
109
  import torch
110
  from transformers import AutoTokenizer, AutoModelForCausalLM
111
+ tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
112
+ model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
113
 
114
  text = """### Instruction:
115
  Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.