tdoehmen commited on
Commit
3b92bae
1 Parent(s): d9b8722

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md CHANGED
@@ -1,3 +1,146 @@
1
  ---
2
  license: llama2
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ inference:
4
+ parameters:
5
+ do_sample: false
6
+ max_length: 200
7
+ widget:
8
+ - text: "### Instruction:\nYour task is to generate valid duckdb SQL to answer the following question.\n\n### Input:\n\n### Question:\ncreate a new table called tmp from test.csv\n\n### Response (use duckdb shorthand if possible):"
9
+ example_title: "read test.csv"
10
+ - text: "### Instruction:\nYour task is to generate valid duckdb SQL to answer the following question.\n\n### Input:\n\n### Question:\ncreate a new table called tmp from test.csv\n\n### Response (use duckdb shorthand if possible):"
11
+ example_title: "get _amount columns"
12
+ - text: "### Instruction:\nYour task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.\n\n### Input:\nHere is the database schema that the SQL query will run on:\nCREATE TABLE rideshare (\n hvfhs_license_num varchar,\n dispatching_base_num varchar,\n originating_base_num varchar,\n request_datetime timestamp,\n on_scene_datetime timestamp,\n pickup_datetime timestamp,\n dropoff_datetime timestamp,\n trip_miles double,\n trip_time bigint,\n\n);\n\n### Question:\nget longest trip in december 2022\n\n### Response (use duckdb shorthand if possible):"
13
+ example_title: "taxi trips"
14
  ---
15
+
16
+ # DuckDB-NSQL-7B
17
+
18
+ ## Model Description
19
+
20
+ NSQL is a family of autoregressive open-source large foundation models (FMs) designed specifically for SQL generation tasks.
21
+
22
+ In this repository we are introducing a new member of NSQL, DuckDB-NSQL. It's based on Meta's original [Llama-2 7B model](https://huggingface.co/meta-llama/Llama-2-7b) and further pre-trained on a dataset of general SQL queries and then fine-tuned on a dataset composed of DuckDB text-to-SQL pairs.
23
+
24
+ ## Training Data
25
+
26
+ The general SQL queries are the SQL subset from [The Stack](https://huggingface.co/datasets/bigcode/the-stack), containing 1M training samples. The samples we transpiled to DuckDB SQL, using [sqlglot](https://github.com/tobymao/sqlglot). The labeled text-to-SQL pairs come [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) that were also transpiled to DuckDB SQL, and 200k synthetically generated DuckDB SQL queries, based on the DuckDB v.0.9.2 documentation.
27
+
28
+ ## Evaluation Data
29
+
30
+ We evaluate our models on a DuckDB-specific benchmark that contains 75 text-to-SQL pairs. The benchmark is available [here](https://github.com/NumbersStationAI/DuckDB-NSQL/).
31
+
32
+ ## Training Procedure
33
+
34
+ DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We pre-trained for 3 epochs and fine-tuned for 10 epochs.
35
+
36
+ ## Intended Use and Limitations
37
+
38
+ The model was designed for text-to-SQL generation tasks from given table schema and natural language prompts. The model works best with the prompt format defined below and outputs.
39
+ In contrast to existing text-to-SQL models, the SQL generation is not contrained to `SELECT` statements, but can generate any valid DuckDB SQL statement, including statements for official DuckDB extensions.
40
+
41
+ ## How to Use
42
+
43
+ Example 1:
44
+
45
+ ```python
46
+ import torch
47
+ from transformers import AutoTokenizer, AutoModelForCausalLM
48
+ tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
49
+ model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
50
+
51
+ text = """### Instruction:
52
+ Your task is to generate valid duckdb SQL to answer the following question.
53
+
54
+ ### Input:
55
+
56
+ ### Question:
57
+ create a new table called tmp from test.csv
58
+
59
+ ### Response (use duckdb shorthand if possible):
60
+ """
61
+
62
+ input_ids = tokenizer(text, return_tensors="pt").input_ids
63
+
64
+ generated_ids = model.generate(input_ids, max_length=500)
65
+ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
66
+ ```
67
+
68
+ Example 2:
69
+
70
+ ```python
71
+ import torch
72
+ from transformers import AutoTokenizer, AutoModelForCausalLM
73
+ tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
74
+ model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
75
+
76
+ text = """### Instruction:
77
+ Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
78
+
79
+ ### Input:
80
+ Here is the database schema that the SQL query will run on:
81
+ CREATE TABLE taxi (
82
+ VendorID bigint,
83
+ tpep_pickup_datetime timestamp,
84
+ tpep_dropoff_datetime timestamp,
85
+ passenger_count double,
86
+ trip_distance double,
87
+ fare_amount double,
88
+ extra double,
89
+ tip_amount double,
90
+ tolls_amount double,
91
+ improvement_surcharge double,
92
+ total_amount double,
93
+ );
94
+
95
+ ### Question:
96
+ get all columns ending with _amount from taxi table
97
+
98
+ ### Response (use duckdb shorthand if possible):"""
99
+
100
+ input_ids = tokenizer(text, return_tensors="pt").input_ids
101
+
102
+ generated_ids = model.generate(input_ids, max_length=500)
103
+ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
104
+ ```
105
+
106
+ Example 3:
107
+
108
+ ```python
109
+ import torch
110
+ from transformers import AutoTokenizer, AutoModelForCausalLM
111
+ tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
112
+ model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
113
+
114
+ text = """### Instruction:
115
+ Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
116
+
117
+ ### Input:
118
+ Here is the database schema that the SQL query will run on:
119
+ CREATE TABLE rideshare (
120
+ hvfhs_license_num varchar,
121
+ dispatching_base_num varchar,
122
+ originating_base_num varchar,
123
+ request_datetime timestamp,
124
+ on_scene_datetime timestamp,
125
+ pickup_datetime timestamp,
126
+ dropoff_datetime timestamp,
127
+ trip_miles double,
128
+ trip_time bigint,
129
+
130
+ );
131
+
132
+ ### Question:
133
+ get longest trip in december 2022
134
+
135
+ ### Response (use duckdb shorthand if possible):
136
+ """
137
+
138
+ input_ids = tokenizer(text, return_tensors="pt").input_ids
139
+
140
+ generated_ids = model.generate(input_ids, max_length=500)
141
+ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
142
+ ```
143
+
144
+
145
+
146
+ For more information (e.g., run with your local database), please find examples in [this repository](https://github.com/NumbersStationAI/DuckDB-NSQL).