RichardErkhov commited on
Commit
47a1729
·
verified ·
1 Parent(s): 77bcec1

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ natural-sql-7b - bnb 4bits
11
+ - Model creator: https://huggingface.co/chatdb/
12
+ - Original model: https://huggingface.co/chatdb/natural-sql-7b/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ base_model: deepseek-ai/deepseek-coder-6.7b-instruct
20
+ tags:
21
+ - instruct
22
+ - finetune
23
+ library_name: transformers
24
+ license: cc-by-sa-4.0
25
+ pipeline_tag: text-generation
26
+ ---
27
+
28
+ # **Natural-SQL-7B by ChatDB**
29
+ ## Natural-SQL-7B is a model with very strong performance in Text-to-SQL instructions, has an excellent understanding of complex questions, and outperforms models of the same size in its space.
30
+
31
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/hafdsfrFCqrVbATIzV_EN.png" width="600">
32
+
33
+ [ChatDB.ai](https://chatdb.ai) | [Notebook](https://github.com/cfahlgren1/natural-sql/blob/main/natural-sql-7b.ipynb) | [Twitter](https://twitter.com/calebfahlgren)
34
+
35
+ # **Benchmarks**
36
+ ### *Results on Novel Datasets not trained on via SQL-Eval*
37
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/5ynfoKPzI3_-WasQQt7qR.png" width="800">
38
+
39
+ <em>Big thanks to the [defog](https://huggingface.co/defog) team for open sourcing [sql-eval](https://github.com/defog-ai/sql-eval)</em>👏
40
+
41
+ Natural-SQL also can handle complex, compound questions that other models typically struggle with. There is a more detailed writeup Here is a write up, small test done [here](https://chatdb.ai/post/naturalsql-vs-sqlcoder-for-text-to-sql).
42
+ # Usage
43
+
44
+ Make sure you have the correct version of the transformers library installed:
45
+
46
+ ```sh
47
+ pip install transformers==4.35.2
48
+ ```
49
+
50
+ ### Loading the Model
51
+
52
+ Use the following Python code to load the model:
53
+
54
+ ```python
55
+ import torch
56
+ from transformers import AutoModelForCausalLM, AutoTokenizer
57
+ tokenizer = AutoTokenizer.from_pretrained("chatdb/natural-sql-7b")
58
+ model = AutoModelForCausalLM.from_pretrained(
59
+ "chatdb/natural-sql-7b",
60
+ device_map="auto",
61
+ torch_dtype=torch.float16,
62
+ )
63
+ ```
64
+
65
+ ### **License**
66
+
67
+ The model weights are licensed under `CC BY-SA 4.0`, with extra guidelines for responsible use expanded from the original model's [Deepseek](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) license.
68
+ You're free to use and adapt the model, even commercially.
69
+ If you alter the weights, such as through fine-tuning, you must publicly share your changes under the same `CC BY-SA 4.0` license.
70
+
71
+
72
+ ### Generating SQL
73
+
74
+ ```python
75
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
76
+ generated_ids = model.generate(
77
+ **inputs,
78
+ num_return_sequences=1,
79
+ eos_token_id=100001,
80
+ pad_token_id=100001,
81
+ max_new_tokens=400,
82
+ do_sample=False,
83
+ num_beams=1,
84
+ )
85
+
86
+ outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
87
+ print(outputs[0].split("```sql")[-1])
88
+ ```
89
+ # Prompt Template
90
+
91
+ ```
92
+ # Task
93
+ Generate a SQL query to answer the following question: `{natural language question}`
94
+
95
+ ### PostgreSQL Database Schema
96
+ The query will run on a database with the following schema:
97
+
98
+ <SQL Table DDL Statements>
99
+
100
+ # SQL
101
+ Here is the SQL query that answers the question: `{natural language question}`
102
+ '''sql
103
+ ```
104
+
105
+
106
+ # Example SQL Output
107
+
108
+ ### Example Schemas
109
+
110
+ ```sql
111
+ CREATE TABLE users (
112
+ user_id SERIAL PRIMARY KEY,
113
+ username VARCHAR(50) NOT NULL,
114
+ email VARCHAR(100) NOT NULL,
115
+ password_hash TEXT NOT NULL,
116
+ created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
117
+ );
118
+ CREATE TABLE projects (
119
+ project_id SERIAL PRIMARY KEY,
120
+ project_name VARCHAR(100) NOT NULL,
121
+ description TEXT,
122
+ start_date DATE,
123
+ end_date DATE,
124
+ owner_id INTEGER REFERENCES users(user_id)
125
+ );
126
+ CREATE TABLE tasks (
127
+ task_id SERIAL PRIMARY KEY,
128
+ task_name VARCHAR(100) NOT NULL,
129
+ description TEXT,
130
+ due_date DATE,
131
+ status VARCHAR(50),
132
+ project_id INTEGER REFERENCES projects(project_id)
133
+ );
134
+ CREATE TABLE taskassignments (
135
+ assignment_id SERIAL PRIMARY KEY,
136
+ task_id INTEGER REFERENCES tasks(task_id),
137
+ user_id INTEGER REFERENCES users(user_id),
138
+ assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
139
+ );
140
+ CREATE TABLE comments (
141
+ comment_id SERIAL PRIMARY KEY,
142
+ content TEXT NOT NULL,
143
+ created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
144
+ task_id INTEGER REFERENCES tasks(task_id),
145
+ user_id INTEGER REFERENCES users(user_id)
146
+ );
147
+ ```
148
+ ### Example SQL Outputs
149
+
150
+ **Question**: **Show me the day with the most users joining**
151
+ ```sql
152
+ SELECT created_at::DATE AS day, COUNT(*) AS user_count
153
+ FROM users
154
+ GROUP BY day
155
+ ORDER BY user_count DESC
156
+ LIMIT 1;
157
+ ```
158
+ **Question**: **Show me the project that has a task with the most comments**
159
+ ```sql
160
+ SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
161
+ FROM projects p
162
+ JOIN tasks t ON p.project_id = t.project_id
163
+ JOIN comments c ON t.task_id = c.task_id
164
+ GROUP BY p.project_name, t.task_name
165
+ ORDER BY comment_count DESC
166
+ LIMIT 1;
167
+ ```
168
+
169
+ **Question**: **What is the ratio of users with gmail addresses vs without?**
170
+ ```sql
171
+ SELECT
172
+ SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
173
+ FROM
174
+ users;
175
+ ```
176
+