Komma-LuisMiSanVe commited on
Commit
c4225a4
·
1 Parent(s): 9c760a1

Update files

Browse files

Changed base model to Gwen 2.5 Coder
Changed epochs to 5.
Changed final model folder name.
Updated READMEs

Files changed (4) hide show
  1. README.es.md +10 -7
  2. README.md +11 -8
  3. test.py +49 -0
  4. trainer.py +21 -37
README.es.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
  license: "apache-2.0"
12
  datasets:
13
  - xlangai/spider
14
- base_model: "deepseek-ai/deepseek-coder-1.3b-base"
15
  ---
16
 
17
  > [Ver en ingles/See in english](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.md)
@@ -38,24 +38,27 @@ base_model: "deepseek-ai/deepseek-coder-1.3b-base"
38
  El modelo de IA ha sido entrenado para convertir lenguaje natural a sentencias de PostgreSQL.
39
 
40
  ## 📝 Explicación de Tecnología
41
- El modelo usa [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) de base y refinado con los datasets de [Spider](https://yale-lily.github.io/spider).
42
 
43
  El dataset en archivo `JSON` contiene `train_spider.json` de **Spider**, ya que es el dataset principal.
44
 
45
- El modelo se puede exportar a `GGUF` con [llama.cpp](https://github.com/ggml-org/llama.cpp) para que puedas usarlo en programas como [LM Studio](https://lmstudio.ai/).
46
 
47
  ## 🛠️ Instalación
48
  Para ejecutar el script de entrenamiento por tu cuenta, primero necesitas instalar [Python](https://www.python.org/) y ejecuta este comando:
49
  ```
50
- pip install transformers datasets peft accelerate bitsandbytes trl
51
  ```
52
  Dependiendo en la versión, es posible que necesites usar este en su lugar:
53
  ```
54
- py -m pip install transformers datasets peft accelerate bitsandbytes trl
55
  ```
56
 
 
 
 
57
  ## 📂 Archivos
58
- Este repositorio incluye los archivos del modelo LLM entrenado, su script de entrenamiento y el dataset para entrenar.
59
 
60
  Puedes descargar el `GGUF` final desde los [Lanzamientos](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
61
 
@@ -79,6 +82,6 @@ El número de la versión seguirá este formato: \
79
  - [trl](https://pypi.org/project/trl/)
80
  - Otros:
81
  - [llama.cpp](https://lmstudio.ai/)
82
- - [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base)
83
  - [Spider](https://yale-lily.github.io/spider)
84
  - IDE Recomendado: [VS Code](https://code.visualstudio.com/)
 
11
  license: "apache-2.0"
12
  datasets:
13
  - xlangai/spider
14
+ base_model: "Qwen/Qwen2.5-Coder-1.5B-Instruct"
15
  ---
16
 
17
  > [Ver en ingles/See in english](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.md)
 
38
  El modelo de IA ha sido entrenado para convertir lenguaje natural a sentencias de PostgreSQL.
39
 
40
  ## 📝 Explicación de Tecnología
41
+ El modelo usa [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) de base y refinado con los datasets de [Spider](https://yale-lily.github.io/spider).
42
 
43
  El dataset en archivo `JSON` contiene `train_spider.json` de **Spider**, ya que es el dataset principal.
44
 
45
+ El modelo se ha exportado a `GGUF` con [llama.cpp](https://github.com/ggml-org/llama.cpp) para que puedas usarlo en programas como [LM Studio](https://lmstudio.ai/).
46
 
47
  ## 🛠️ Instalación
48
  Para ejecutar el script de entrenamiento por tu cuenta, primero necesitas instalar [Python](https://www.python.org/) y ejecuta este comando:
49
  ```
50
+ pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
51
  ```
52
  Dependiendo en la versión, es posible que necesites usar este en su lugar:
53
  ```
54
+ py -m pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
55
  ```
56
 
57
+ >[!IMPORTANT]
58
+ >Asegurate que la libreria `TRL` está en la versión `1.0.0`, ya que es la única version compatible con el script de entrenamiento.
59
+
60
  ## 📂 Archivos
61
+ Este repositorio incluye los archivos del modelo LLM entrenado, su script de entrenamiento, el dataset para entrenar y un script para probar el modelo `.safetensors`.
62
 
63
  Puedes descargar el `GGUF` final desde los [Lanzamientos](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
64
 
 
82
  - [trl](https://pypi.org/project/trl/)
83
  - Otros:
84
  - [llama.cpp](https://lmstudio.ai/)
85
+ - [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
86
  - [Spider](https://yale-lily.github.io/spider)
87
  - IDE Recomendado: [VS Code](https://code.visualstudio.com/)
README.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
  license: "apache-2.0"
12
  datasets:
13
  - xlangai/spider
14
- base_model: "deepseek-ai/deepseek-coder-1.3b-base"
15
  ---
16
 
17
  > [See in spanish/Ver en español](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.es.md)
@@ -38,24 +38,27 @@ base_model: "deepseek-ai/deepseek-coder-1.3b-base"
38
  The AI model has been trained for turning natural language to PostgreSQL queries.
39
 
40
  ## 📝 Technology Explanation
41
- This model uses [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) as a base and then is fine tuned with [Spider](https://yale-lily.github.io/spider) datasets.
42
 
43
  The `JSON` dataset file contains **Spider**'s `train_spider.json` as is the main dataset.
44
 
45
- The model can be exported to `GGUF` with [llama.cpp](https://github.com/ggml-org/llama.cpp) so it can be used by programs like [LM Studio](https://lmstudio.ai/).
46
 
47
  ## 🛠️ Setup
48
  In order to execute the training script for your own, you first need to install [Python](https://www.python.org/) and run this command:
49
  ```
50
- pip install transformers datasets peft accelerate bitsandbytes trl
51
  ```
52
  Depending on the version, you may have to use this instead:
53
  ```
54
- py -m pip install transformers datasets peft accelerate bitsandbytes trl
55
  ```
56
 
 
 
 
57
  ## 📂 Files
58
- This repository includes the trained LLM model's files, its training script and the training dataset.
59
 
60
  You can download the final `GGUF` in the [Releases](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
61
 
@@ -76,9 +79,9 @@ The version number will follow this format: \
76
  - [peft](https://pypi.org/project/peft/)
77
  - [acceletare](https://pypi.org/project/accelerate/)
78
  - [bitsandbytes](https://pypi.org/project/bitsandbytes/)
79
- - [trl](https://pypi.org/project/trl/)
80
  - Other:
81
  - [llama.cpp](https://github.com/ggml-org/llama.cpp)
82
- - [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base)
83
  - [Spider](https://yale-lily.github.io/spider)
84
  - Recommended IDE: [VS Code](https://code.visualstudio.com/)
 
11
  license: "apache-2.0"
12
  datasets:
13
  - xlangai/spider
14
+ base_model: "Qwen/Qwen2.5-Coder-1.5B-Instruct"
15
  ---
16
 
17
  > [See in spanish/Ver en español](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.es.md)
 
38
  The AI model has been trained for turning natural language to PostgreSQL queries.
39
 
40
  ## 📝 Technology Explanation
41
+ This model uses [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) as a base and then is fine tuned with [Spider](https://yale-lily.github.io/spider) datasets.
42
 
43
  The `JSON` dataset file contains **Spider**'s `train_spider.json` as is the main dataset.
44
 
45
+ The model is exported to `GGUF` with [llama.cpp](https://github.com/ggml-org/llama.cpp) so it can be used by programs like [LM Studio](https://lmstudio.ai/).
46
 
47
  ## 🛠️ Setup
48
  In order to execute the training script for your own, you first need to install [Python](https://www.python.org/) and run this command:
49
  ```
50
+ pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
51
  ```
52
  Depending on the version, you may have to use this instead:
53
  ```
54
+ py -m pip install transformers datasets peft accelerate bitsandbytes trl==1.0.0
55
  ```
56
 
57
+ >[!IMPORTANT]
58
+ >Make sure the `TRL` library version is `1.0.0`, as is the only version supported by the trainer script.
59
+
60
  ## 📂 Files
61
+ This repository includes the trained LLM model's files, its training script, the training dataset and a tester script to test the `.safetensors` model.
62
 
63
  You can download the final `GGUF` in the [Releases](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
64
 
 
79
  - [peft](https://pypi.org/project/peft/)
80
  - [acceletare](https://pypi.org/project/accelerate/)
81
  - [bitsandbytes](https://pypi.org/project/bitsandbytes/)
82
+ - [trl](https://pypi.org/project/trl/) (1.0.0)
83
  - Other:
84
  - [llama.cpp](https://github.com/ggml-org/llama.cpp)
85
+ - [Gwen Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
86
  - [Spider](https://yale-lily.github.io/spider)
87
  - Recommended IDE: [VS Code](https://code.visualstudio.com/)
test.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoTokenizer, AutoModelForCausalLM
3
+
4
+ MODEL_PATH = "./sql-model-merged"
5
+
6
+ PROMPT = """\
7
+ Write a select query of the invoice table.
8
+ """
9
+
10
+ print("Loading tokenizer...")
11
+
12
+ tokenizer = AutoTokenizer.from_pretrained(
13
+ MODEL_PATH
14
+ )
15
+
16
+ print("Loading model... (this may take a while)")
17
+
18
+ model = AutoModelForCausalLM.from_pretrained(
19
+ MODEL_PATH,
20
+ torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
21
+ device_map="auto",
22
+ ignore_mismatched_sizes=True
23
+ )
24
+
25
+ model.eval()
26
+
27
+ device = "cuda" if torch.cuda.is_available() else "cpu"
28
+ print(f"Using device: {device}")
29
+
30
+ inputs = tokenizer(PROMPT, return_tensors="pt").to(model.device)
31
+
32
+ print("\nGenerating response...\n")
33
+
34
+ with torch.no_grad():
35
+ outputs = model.generate(
36
+ **inputs,
37
+ max_new_tokens=256,
38
+ temperature=0.2,
39
+ top_p=0.95,
40
+ do_sample=True,
41
+ repetition_penalty=1.1,
42
+ eos_token_id=tokenizer.eos_token_id
43
+ )
44
+
45
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
46
+
47
+ print("===== MODEL OUTPUT =====\n")
48
+ print(result)
49
+ print("\n========================")
trainer.py CHANGED
@@ -4,52 +4,35 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
4
  from peft import LoraConfig, PeftModel
5
  from trl import SFTTrainer
6
 
7
- model_name = "deepseek-ai/deepseek-coder-1.3b-base"
8
  tokenizer = AutoTokenizer.from_pretrained(model_name)
9
  tokenizer.pad_token = tokenizer.eos_token
10
 
11
  model = AutoModelForCausalLM.from_pretrained(
12
  model_name,
13
  torch_dtype=torch.float32,
14
- device_map={"": "cpu"} # Sets CPU for training, you can change it to use the GPU instead
15
  )
16
 
 
 
17
  dataset = load_dataset("json", data_files="train.json", split="train")
18
 
19
- def format_example(example):
 
 
 
 
20
  return {
21
- "instruction": example["question"],
22
- "input": "",
23
- "output": example["query"]
 
 
24
  }
25
 
26
  dataset = dataset.map(format_example)
27
 
28
- def tokenize(example):
29
- prompt_ids = tokenizer(
30
- example["instruction"],
31
- padding="max_length",
32
- truncation=True,
33
- max_length=512
34
- ).input_ids
35
-
36
- label_ids = tokenizer(
37
- example["output"],
38
- padding="max_length",
39
- truncation=True,
40
- max_length=512
41
- ).input_ids
42
-
43
- attention_mask = [1 if id != tokenizer.pad_token_id else 0 for id in prompt_ids]
44
-
45
- return {
46
- "input_ids": prompt_ids,
47
- "attention_mask": attention_mask,
48
- "labels": label_ids
49
- }
50
-
51
- dataset = dataset.map(tokenize, batched=False)
52
-
53
  peft_config = LoraConfig(
54
  r=16,
55
  lora_alpha=32,
@@ -64,10 +47,10 @@ training_args = TrainingArguments(
64
  per_device_train_batch_size=1,
65
  gradient_accumulation_steps=4,
66
  learning_rate=2e-4,
67
- num_train_epochs=1, # More epochs -> better accuracy but longer training
68
  logging_steps=10,
69
  save_strategy="epoch",
70
- fp16=False
71
  )
72
 
73
  trainer = SFTTrainer(
@@ -85,8 +68,9 @@ tokenizer.save_pretrained("./sql-model")
85
  base_model = AutoModelForCausalLM.from_pretrained(
86
  model_name,
87
  torch_dtype=torch.float32,
88
- device_map={"": "cpu"}
89
  )
90
- model_merged = PeftModel.from_pretrained(base_model, "./sql-model")
91
- model_merged = model_merged.merge_and_unload()
92
- model_merged.save_pretrained("./sql-model-merged")
 
 
4
  from peft import LoraConfig, PeftModel
5
  from trl import SFTTrainer
6
 
7
+ model_name = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
8
  tokenizer = AutoTokenizer.from_pretrained(model_name)
9
  tokenizer.pad_token = tokenizer.eos_token
10
 
11
  model = AutoModelForCausalLM.from_pretrained(
12
  model_name,
13
  torch_dtype=torch.float32,
14
+ device_map="auto"
15
  )
16
 
17
+ model.config.pad_token_id = tokenizer.eos_token_id
18
+
19
  dataset = load_dataset("json", data_files="train.json", split="train")
20
 
21
+ def format_example(x):
22
+ messages = [
23
+ {"role": "user", "content": f"Write SQL query for: {x['question']}"},
24
+ {"role": "assistant", "content": x["query"]}
25
+ ]
26
  return {
27
+ "text": tokenizer.apply_chat_template(
28
+ messages,
29
+ tokenize=False,
30
+ add_generation_prompt=False
31
+ )
32
  }
33
 
34
  dataset = dataset.map(format_example)
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  peft_config = LoraConfig(
37
  r=16,
38
  lora_alpha=32,
 
47
  per_device_train_batch_size=1,
48
  gradient_accumulation_steps=4,
49
  learning_rate=2e-4,
50
+ num_train_epochs=5,
51
  logging_steps=10,
52
  save_strategy="epoch",
53
+ fp16=torch.cuda.is_available()
54
  )
55
 
56
  trainer = SFTTrainer(
 
68
  base_model = AutoModelForCausalLM.from_pretrained(
69
  model_name,
70
  torch_dtype=torch.float32,
71
+ device_map="auto"
72
  )
73
+ model = PeftModel.from_pretrained(base_model, "./sql-model")
74
+ model = model.merge_and_unload()
75
+ model.save_pretrained("./sql-model-merged", safe_serialization=True)
76
+ tokenizer.save_pretrained("./sql-model-merged")