Commit ·
c4225a4
1
Parent(s): 9c760a1
Update files
Browse filesChanged base model to Gwen 2.5 Coder
Changed epochs to 5.
Changed final model folder name.
Updated READMEs
- README.es.md +10 -7
- README.md +11 -8
- test.py +49 -0
- trainer.py +21 -37
README.es.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
license: "apache-2.0"
|
| 12 |
datasets:
|
| 13 |
- xlangai/spider
|
| 14 |
-
base_model: "
|
| 15 |
---
|
| 16 |
|
| 17 |
> [Ver en ingles/See in english](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.md)
|
|
@@ -38,24 +38,27 @@ base_model: "deepseek-ai/deepseek-coder-1.3b-base"
|
|
| 38 |
El modelo de IA ha sido entrenado para convertir lenguaje natural a sentencias de PostgreSQL.
|
| 39 |
|
| 40 |
## 📝 Explicación de Tecnología
|
| 41 |
-
El modelo usa [
|
| 42 |
|
| 43 |
El dataset en archivo `JSON` contiene `train_spider.json` de **Spider**, ya que es el dataset principal.
|
| 44 |
|
| 45 |
-
El modelo se
|
| 46 |
|
| 47 |
## 🛠️ Instalación
|
| 48 |
Para ejecutar el script de entrenamiento por tu cuenta, primero necesitas instalar [Python](https://www.python.org/) y ejecuta este comando:
|
| 49 |
```
|
| 50 |
-
pip install transformers datasets peft accelerate bitsandbytes trl
|
| 51 |
```
|
| 52 |
Dependiendo en la versión, es posible que necesites usar este en su lugar:
|
| 53 |
```
|
| 54 |
-
py -m pip install transformers datasets peft accelerate bitsandbytes trl
|
| 55 |
```
|
| 56 |
|
|
|
|
|
|
|
|
|
|
| 57 |
## 📂 Archivos
|
| 58 |
-
Este repositorio incluye los archivos del modelo LLM entrenado, su script de entrenamiento
|
| 59 |
|
| 60 |
Puedes descargar el `GGUF` final desde los [Lanzamientos](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
|
| 61 |
|
|
@@ -79,6 +82,6 @@ El número de la versión seguirá este formato: \
|
|
| 79 |
- [trl](https://pypi.org/project/trl/)
|
| 80 |
- Otros:
|
| 81 |
- [llama.cpp](https://lmstudio.ai/)
|
| 82 |
-
- [
|
| 83 |
- [Spider](https://yale-lily.github.io/spider)
|
| 84 |
- IDE Recomendado: [VS Code](https://code.visualstudio.com/)
|
|
|
|
| 11 |
license: "apache-2.0"
|
| 12 |
datasets:
|
| 13 |
- xlangai/spider
|
| 14 |
+
base_model: "Qwen/Qwen2.5-Coder-1.5B-Instruct"
|
| 15 |
---
|
| 16 |
|
| 17 |
> [Ver en ingles/See in english](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.md)
|
|
|
|
| 38 |
El modelo de IA ha sido entrenado para convertir lenguaje natural a sentencias de PostgreSQL.
|
| 39 |
|
| 40 |
## 📝 Explicación de Tecnología
|
| 41 |
+
El modelo usa [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) de base y refinado con los datasets de [Spider](https://yale-lily.github.io/spider).
|
| 42 |
|
| 43 |
El dataset en archivo `JSON` contiene `train_spider.json` de **Spider**, ya que es el dataset principal.
|
| 44 |
|
| 45 |
+
El modelo se ha exportado a `GGUF` con [llama.cpp](https://github.com/ggml-org/llama.cpp) para que puedas usarlo en programas como [LM Studio](https://lmstudio.ai/).
|
| 46 |
|
| 47 |
## 🛠️ Instalación
|
| 48 |
Para ejecutar el script de entrenamiento por tu cuenta, primero necesitas instalar [Python](https://www.python.org/) y ejecuta este comando:
|
| 49 |
```
|
| 50 |
+
pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
|
| 51 |
```
|
| 52 |
Dependiendo en la versión, es posible que necesites usar este en su lugar:
|
| 53 |
```
|
| 54 |
+
py -m pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
|
| 55 |
```
|
| 56 |
|
| 57 |
+
>[!IMPORTANT]
|
| 58 |
+
>Asegurate que la libreria `TRL` está en la versión `1.0.0`, ya que es la única version compatible con el script de entrenamiento.
|
| 59 |
+
|
| 60 |
## 📂 Archivos
|
| 61 |
+
Este repositorio incluye los archivos del modelo LLM entrenado, su script de entrenamiento, el dataset para entrenar y un script para probar el modelo `.safetensors`.
|
| 62 |
|
| 63 |
Puedes descargar el `GGUF` final desde los [Lanzamientos](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
|
| 64 |
|
|
|
|
| 82 |
- [trl](https://pypi.org/project/trl/)
|
| 83 |
- Otros:
|
| 84 |
- [llama.cpp](https://lmstudio.ai/)
|
| 85 |
+
- [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
|
| 86 |
- [Spider](https://yale-lily.github.io/spider)
|
| 87 |
- IDE Recomendado: [VS Code](https://code.visualstudio.com/)
|
README.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
license: "apache-2.0"
|
| 12 |
datasets:
|
| 13 |
- xlangai/spider
|
| 14 |
-
base_model: "
|
| 15 |
---
|
| 16 |
|
| 17 |
> [See in spanish/Ver en español](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.es.md)
|
|
@@ -38,24 +38,27 @@ base_model: "deepseek-ai/deepseek-coder-1.3b-base"
|
|
| 38 |
The AI model has been trained for turning natural language to PostgreSQL queries.
|
| 39 |
|
| 40 |
## 📝 Technology Explanation
|
| 41 |
-
This model uses [
|
| 42 |
|
| 43 |
The `JSON` dataset file contains **Spider**'s `train_spider.json` as is the main dataset.
|
| 44 |
|
| 45 |
-
The model
|
| 46 |
|
| 47 |
## 🛠️ Setup
|
| 48 |
In order to execute the training script for your own, you first need to install [Python](https://www.python.org/) and run this command:
|
| 49 |
```
|
| 50 |
-
pip install transformers datasets peft accelerate bitsandbytes trl
|
| 51 |
```
|
| 52 |
Depending on the version, you may have to use this instead:
|
| 53 |
```
|
| 54 |
-
py -m pip install transformers datasets peft accelerate bitsandbytes trl
|
| 55 |
```
|
| 56 |
|
|
|
|
|
|
|
|
|
|
| 57 |
## 📂 Files
|
| 58 |
-
This repository includes the trained LLM model's files, its training script
|
| 59 |
|
| 60 |
You can download the final `GGUF` in the [Releases](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
|
| 61 |
|
|
@@ -76,9 +79,9 @@ The version number will follow this format: \
|
|
| 76 |
- [peft](https://pypi.org/project/peft/)
|
| 77 |
- [acceletare](https://pypi.org/project/accelerate/)
|
| 78 |
- [bitsandbytes](https://pypi.org/project/bitsandbytes/)
|
| 79 |
-
- [trl](https://pypi.org/project/trl/)
|
| 80 |
- Other:
|
| 81 |
- [llama.cpp](https://github.com/ggml-org/llama.cpp)
|
| 82 |
-
- [
|
| 83 |
- [Spider](https://yale-lily.github.io/spider)
|
| 84 |
- Recommended IDE: [VS Code](https://code.visualstudio.com/)
|
|
|
|
| 11 |
license: "apache-2.0"
|
| 12 |
datasets:
|
| 13 |
- xlangai/spider
|
| 14 |
+
base_model: "Qwen/Qwen2.5-Coder-1.5B-Instruct"
|
| 15 |
---
|
| 16 |
|
| 17 |
> [See in spanish/Ver en español](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.es.md)
|
|
|
|
| 38 |
The AI model has been trained for turning natural language to PostgreSQL queries.
|
| 39 |
|
| 40 |
## 📝 Technology Explanation
|
| 41 |
+
This model uses [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) as a base and then is fine tuned with [Spider](https://yale-lily.github.io/spider) datasets.
|
| 42 |
|
| 43 |
The `JSON` dataset file contains **Spider**'s `train_spider.json` as is the main dataset.
|
| 44 |
|
| 45 |
+
The model is exported to `GGUF` with [llama.cpp](https://github.com/ggml-org/llama.cpp) so it can be used by programs like [LM Studio](https://lmstudio.ai/).
|
| 46 |
|
| 47 |
## 🛠️ Setup
|
| 48 |
In order to execute the training script for your own, you first need to install [Python](https://www.python.org/) and run this command:
|
| 49 |
```
|
| 50 |
+
pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
|
| 51 |
```
|
| 52 |
Depending on the version, you may have to use this instead:
|
| 53 |
```
|
| 54 |
+
py -m pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
|
| 55 |
```
|
| 56 |
|
| 57 |
+
>[!IMPORTANT]
|
| 58 |
+
>Make sure the `TRL` library version is `1.0.0`, as is the only version supported by the trainer script.
|
| 59 |
+
|
| 60 |
## 📂 Files
|
| 61 |
+
This repository includes the trained LLM model's files, its training script, the training dataset and a tester script to test the `.safetensors` model.
|
| 62 |
|
| 63 |
You can download the final `GGUF` in the [Releases](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
|
| 64 |
|
|
|
|
| 79 |
- [peft](https://pypi.org/project/peft/)
|
| 80 |
- [acceletare](https://pypi.org/project/accelerate/)
|
| 81 |
- [bitsandbytes](https://pypi.org/project/bitsandbytes/)
|
| 82 |
+
- [trl](https://pypi.org/project/trl/) (1.0.0)
|
| 83 |
- Other:
|
| 84 |
- [llama.cpp](https://github.com/ggml-org/llama.cpp)
|
| 85 |
+
- [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
|
| 86 |
- [Spider](https://yale-lily.github.io/spider)
|
| 87 |
- Recommended IDE: [VS Code](https://code.visualstudio.com/)
|
test.py
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 3 |
+
|
| 4 |
+
MODEL_PATH = "./sql-model-merged"
|
| 5 |
+
|
| 6 |
+
PROMPT = """\
|
| 7 |
+
Write a select query of the invoice table.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
print("Loading tokenizer...")
|
| 11 |
+
|
| 12 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 13 |
+
MODEL_PATH
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
print("Loading model... (this may take a while)")
|
| 17 |
+
|
| 18 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 19 |
+
MODEL_PATH,
|
| 20 |
+
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
|
| 21 |
+
device_map="auto",
|
| 22 |
+
ignore_mismatched_sizes=True
|
| 23 |
+
)
|
| 24 |
+
|
| 25 |
+
model.eval()
|
| 26 |
+
|
| 27 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 28 |
+
print(f"Using device: {device}")
|
| 29 |
+
|
| 30 |
+
inputs = tokenizer(PROMPT, return_tensors="pt").to(model.device)
|
| 31 |
+
|
| 32 |
+
print("\nGenerating response...\n")
|
| 33 |
+
|
| 34 |
+
with torch.no_grad():
|
| 35 |
+
outputs = model.generate(
|
| 36 |
+
**inputs,
|
| 37 |
+
max_new_tokens=256,
|
| 38 |
+
temperature=0.2,
|
| 39 |
+
top_p=0.95,
|
| 40 |
+
do_sample=True,
|
| 41 |
+
repetition_penalty=1.1,
|
| 42 |
+
eos_token_id=tokenizer.eos_token_id
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 46 |
+
|
| 47 |
+
print("===== MODEL OUTPUT =====\n")
|
| 48 |
+
print(result)
|
| 49 |
+
print("\n========================")
|
trainer.py
CHANGED
|
@@ -4,52 +4,35 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
|
|
| 4 |
from peft import LoraConfig, PeftModel
|
| 5 |
from trl import SFTTrainer
|
| 6 |
|
| 7 |
-
model_name = "
|
| 8 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 9 |
tokenizer.pad_token = tokenizer.eos_token
|
| 10 |
|
| 11 |
model = AutoModelForCausalLM.from_pretrained(
|
| 12 |
model_name,
|
| 13 |
torch_dtype=torch.float32,
|
| 14 |
-
device_map=
|
| 15 |
)
|
| 16 |
|
|
|
|
|
|
|
| 17 |
dataset = load_dataset("json", data_files="train.json", split="train")
|
| 18 |
|
| 19 |
-
def format_example(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
return {
|
| 21 |
-
"
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
}
|
| 25 |
|
| 26 |
dataset = dataset.map(format_example)
|
| 27 |
|
| 28 |
-
def tokenize(example):
|
| 29 |
-
prompt_ids = tokenizer(
|
| 30 |
-
example["instruction"],
|
| 31 |
-
padding="max_length",
|
| 32 |
-
truncation=True,
|
| 33 |
-
max_length=512
|
| 34 |
-
).input_ids
|
| 35 |
-
|
| 36 |
-
label_ids = tokenizer(
|
| 37 |
-
example["output"],
|
| 38 |
-
padding="max_length",
|
| 39 |
-
truncation=True,
|
| 40 |
-
max_length=512
|
| 41 |
-
).input_ids
|
| 42 |
-
|
| 43 |
-
attention_mask = [1 if id != tokenizer.pad_token_id else 0 for id in prompt_ids]
|
| 44 |
-
|
| 45 |
-
return {
|
| 46 |
-
"input_ids": prompt_ids,
|
| 47 |
-
"attention_mask": attention_mask,
|
| 48 |
-
"labels": label_ids
|
| 49 |
-
}
|
| 50 |
-
|
| 51 |
-
dataset = dataset.map(tokenize, batched=False)
|
| 52 |
-
|
| 53 |
peft_config = LoraConfig(
|
| 54 |
r=16,
|
| 55 |
lora_alpha=32,
|
|
@@ -64,10 +47,10 @@ training_args = TrainingArguments(
|
|
| 64 |
per_device_train_batch_size=1,
|
| 65 |
gradient_accumulation_steps=4,
|
| 66 |
learning_rate=2e-4,
|
| 67 |
-
num_train_epochs=
|
| 68 |
logging_steps=10,
|
| 69 |
save_strategy="epoch",
|
| 70 |
-
fp16=
|
| 71 |
)
|
| 72 |
|
| 73 |
trainer = SFTTrainer(
|
|
@@ -85,8 +68,9 @@ tokenizer.save_pretrained("./sql-model")
|
|
| 85 |
base_model = AutoModelForCausalLM.from_pretrained(
|
| 86 |
model_name,
|
| 87 |
torch_dtype=torch.float32,
|
| 88 |
-
device_map=
|
| 89 |
)
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
| 4 |
from peft import LoraConfig, PeftModel
|
| 5 |
from trl import SFTTrainer
|
| 6 |
|
| 7 |
+
model_name = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
|
| 8 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 9 |
tokenizer.pad_token = tokenizer.eos_token
|
| 10 |
|
| 11 |
model = AutoModelForCausalLM.from_pretrained(
|
| 12 |
model_name,
|
| 13 |
torch_dtype=torch.float32,
|
| 14 |
+
device_map="auto"
|
| 15 |
)
|
| 16 |
|
| 17 |
+
model.config.pad_token_id = tokenizer.eos_token_id
|
| 18 |
+
|
| 19 |
dataset = load_dataset("json", data_files="train.json", split="train")
|
| 20 |
|
| 21 |
+
def format_example(x):
|
| 22 |
+
messages = [
|
| 23 |
+
{"role": "user", "content": f"Write SQL query for: {x['question']}"},
|
| 24 |
+
{"role": "assistant", "content": x["query"]}
|
| 25 |
+
]
|
| 26 |
return {
|
| 27 |
+
"text": tokenizer.apply_chat_template(
|
| 28 |
+
messages,
|
| 29 |
+
tokenize=False,
|
| 30 |
+
add_generation_prompt=False
|
| 31 |
+
)
|
| 32 |
}
|
| 33 |
|
| 34 |
dataset = dataset.map(format_example)
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
peft_config = LoraConfig(
|
| 37 |
r=16,
|
| 38 |
lora_alpha=32,
|
|
|
|
| 47 |
per_device_train_batch_size=1,
|
| 48 |
gradient_accumulation_steps=4,
|
| 49 |
learning_rate=2e-4,
|
| 50 |
+
num_train_epochs=5,
|
| 51 |
logging_steps=10,
|
| 52 |
save_strategy="epoch",
|
| 53 |
+
fp16=torch.cuda.is_available()
|
| 54 |
)
|
| 55 |
|
| 56 |
trainer = SFTTrainer(
|
|
|
|
| 68 |
base_model = AutoModelForCausalLM.from_pretrained(
|
| 69 |
model_name,
|
| 70 |
torch_dtype=torch.float32,
|
| 71 |
+
device_map="auto"
|
| 72 |
)
|
| 73 |
+
model = PeftModel.from_pretrained(base_model, "./sql-model")
|
| 74 |
+
model = model.merge_and_unload()
|
| 75 |
+
model.save_pretrained("./sql-model-merged", safe_serialization=True)
|
| 76 |
+
tokenizer.save_pretrained("./sql-model-merged")
|