---
language:
- en
license: apache-2.0
tags:
- zsql
- chatml
- synthetic data
- text-to-sql
- dpo
datasets:
- zerolink/zsql-sqlite-dpo
widget:
  - text: >-
      <|im_start|>system
      Translate English to SQLite SQL.<|im_end|>
      <|im_start|>user
      Using the schema:
      CREATE TABLE Product (
          product_id INTEGER PRIMARY KEY,
          name TEXT NOT NULL,
          price DECIMAL NOT NULL,
          description TEXT
      );
      Generate SQL for the following question:
      What are all products worth more than $5.10?
      <|im_end|>
    example_title: sql
---

zsql-sqlite is a text-to-SQL model which is instruction tuned for SQL query
synthesis on English language text to the SQLite SQL code. The model is trained
on the [ZeroLink DPO](https://huggingface.co/datasets/zerolink/zsql-sqlite-dpo)
dataset.

This model is only capable of generating SQL queries and is designed to be
further fine-tuned to specific database schemas.

## Usage

You can run this model using the following code:

```python
import transformers
from transformers import AutoTokenizer

model = "zerolink/zsql-en-sqlite"

tokenizer = AutoTokenizer.from_pretrained(model)

prompt = f"""
Using the schema:
CREATE TABLE Product (
    product_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    price DECIMAL NOT NULL,
    description TEXT
);

CREATE TABLE Customer (
    customer_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT,
    phone TEXT
);
Generate SQL for the following question:
What are the prices and descriptions for all products that are greater than $5?
"""

system = "Translate English to SQLite SQL."
message = [
    {"role": "system", "content": system},
    {"role": "user", "content": prompt},
]

prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
    num_return_sequences=1,
    max_length=1024,
)
print(sequences[0]['generated_text'])
```

## Training hyperparameters

**LoRA**:

* r=16
* lora_alpha=16
* lora_dropout=0.05
* bias="none"
* task_type="CAUSAL_LM"
* target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']

**Training arguments**:

* per_device_train_batch_size=4
* gradient_accumulation_steps=4
* gradient_checkpointing=True
* learning_rate=5e-5
* lr_scheduler_type="linear"
* max_steps=200
* optim="paged_adamw_32bit"
* warmup_steps=100

**DPOTrainer**:

* beta=0.1
* max_prompt_length=4096
* max_length=3516