Update README.md

5fae20c verified 10 months ago

4.27 kB

	---
	base_model: deepseek-ai/deepseek-coder-6.7b-instruct
	tags:
	- instruct
	- finetune
	model-index:
	- name: NaturalQuery-6.7B-v0.1
	results: []
	license: other
	license_name: deepseek
	language:
	- en
	datasets:
	- cfahlgren1/wiki-sql-codellama-expanded
	- cfahlgren1/natural-sql
	---

	# NaturalQuery-6.7B-v0.1

	<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/hafdsfrFCqrVbATIzV_EN.png" width="600">

	NaturalQuery is a LLM that can translate natural language queries to SQL based on your schema. It is finetuned on 8k text to PostgreSQL Natural Language <> SQL pairs.

	Future Improvements:

	- Much larger training set
	- More complex schemas, questions, and queries
	- Reward modeling via DPO
	- Benchmarking

	# Usage

	Make sure you have the correct version of the transformers library installed:

	```sh
	pip install transformers==4.35.2
	```

	### Loading the Model

	Use the following Python code to load the model:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("cfahlgren1/NaturalSQL-6.7B-v0")
	model = AutoModelForCausalLM.from_pretrained(
	"cfahlgren1/NaturalSQL-6.7B-v0",
	device_map="auto",
	torch_dtype=torch.float16,
	)
	```

	### Generating Text

	To generate text, use the following Python code. [Here](https://gist.github.com/cfahlgren1/ba17f01cf688c4229686dc3dfb4d4549) is a full notebook with the SQL table prompt format to use.

	```python
	messages=[
	{ 'role': 'user', 'content': prompt}
	]

	inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

	# 32023 is the id of <\|EOT\|> token
	outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32023)

	print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

	```


	# SQL Generation Template

	```
	### Task

	Generate a SQL query to answer the following question: `{natural language question}`

	### Database Schema

	The query will run on a database with the following schema:

	'''
	<SQL Table DDL Statements>
	'''

	### Answer
	Here is the SQL query that answers the question: `{natural language question}`
	'''sql
	```

	# Example SQL Output

	### Example Schemas

	```sql
	CREATE TABLE users (
	user_id SERIAL PRIMARY KEY,
	username VARCHAR(50) NOT NULL,
	email VARCHAR(100) NOT NULL,
	password_hash TEXT NOT NULL,
	created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
	);

	CREATE TABLE projects (
	project_id SERIAL PRIMARY KEY,
	project_name VARCHAR(100) NOT NULL,
	description TEXT,
	start_date DATE,
	end_date DATE,
	owner_id INTEGER REFERENCES users(user_id)
	);

	CREATE TABLE tasks (
	task_id SERIAL PRIMARY KEY,
	task_name VARCHAR(100) NOT NULL,
	description TEXT,
	due_date DATE,
	status VARCHAR(50),
	project_id INTEGER REFERENCES projects(project_id)
	);

	CREATE TABLE taskassignments (
	assignment_id SERIAL PRIMARY KEY,
	task_id INTEGER REFERENCES tasks(task_id),
	user_id INTEGER REFERENCES users(user_id),
	assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
	);

	CREATE TABLE comments (
	comment_id SERIAL PRIMARY KEY,
	content TEXT NOT NULL,
	created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
	task_id INTEGER REFERENCES tasks(task_id),
	user_id INTEGER REFERENCES users(user_id)
	);
	```

	Question: Show me the day with the most users joining
	```sql
	SELECT created_at::DATE AS day, COUNT(*) AS user_count
	FROM users
	GROUP BY day
	ORDER BY user_count DESC
	LIMIT 1;
	```
	Question: Show me the project that has a task with the most comments
	```sql
	SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
	FROM projects p
	JOIN tasks t ON p.project_id = t.project_id
	JOIN comments c ON t.task_id = c.task_id
	GROUP BY p.project_name, t.task_name
	ORDER BY comment_count DESC
	LIMIT 1;
	```

	Question: What is the ratio of users with gmail addresses vs without?
	```sql
	SELECT
	SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
	FROM
	users;
	```