Fine tuning Text2SQL based on Mistral-7B using LoRA on MLX

Files changed (7) hide show

README.md +265 -0
config.json +24 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +42 -0
weights.00.safetensors +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,268 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+## [mlx-community/Mistral-7B-v0.1-LoRA-Text2SQL](https://huggingface.co/mlx-community/Mistral-7B-v0.1-LoRA-Text2SQL)
+本次微调的模型我已经上传到了 HuggingFace Hub 上，大家可以直接使用。
+### 安装
+```bash
+pip install mlx-lm
+```
+### 生成
+```
+python -m mlx_lm.generate --model mlx-community/Mistral-7B-v0.1-LoRA-Text2SQL \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: Which school did Wang Junjian come from?
+A: "
+```
+```
+SELECT School FROM Students WHERE Name = 'Wang Junjian'
+```
+## [在 MLX 上使用 LoRA 基于 Mistral-7B 微调 Text2SQL（一）](https://wangjunjian.com/mlx/lora/2024/01/23/Fine-tuning-Text2SQL-based-on-Mistral-7B-using-LoRA-on-MLX-1.html)
+📌 没有使用模型的标注格式生成数据集，导致不能结束，直到生成最大的 Tokens 数量。
+这次我们来解决这个问题。
+## 数据集 WikiSQL
+- [WikiSQL](https://github.com/salesforce/WikiSQL)
+- [sqllama/sqllama-V0](https://huggingface.co/sqllama/sqllama-V0/blob/main/wikisql.ipynb)
+### 修改脚本 mlx-examples/lora/data/wikisql.py
+```py
+if __name__ == "__main__":
+    # ......
+    for dataset, name, size in datasets:
+        with open(f"data/{name}.jsonl", "w") as fid:
+            for e, t in zip(range(size), dataset):
+                """
+                t 变量的文本是这样的：
+                ------------------------
+                <s>table: 1-1058787-1
+                columns: Approximate Age, Virtues, Psycho Social Crisis, Significant Relationship, Existential Question [ not in citation given ], Examples
+                Q: How many significant relationships list Will as a virtue?
+                A: SELECT COUNT Significant Relationship FROM 1-1058787-1 WHERE Virtues = 'Will'</s>
+                """
+                t = t[3:] # 去掉开头的 <s>，因为 tokenizer 会自动添加 <s>
+                json.dump({"text": t}, fid)
+                fid.write("\n")
+```
+执行脚本 `data/wikisql.py` 生成数据集。
+### 样本示例
+```json
+table: 1-10753917-1
+columns: Season, Driver, Team, Engine, Poles, Wins, Podiums, Points, Margin of defeat
+Q: Which podiums did the alfa romeo team have?
+A: SELECT Podiums FROM 1-10753917-1 WHERE Team = 'Alfa Romeo'</s>
+```
+## 微调
+- 预训练模型 [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
+### LoRA 微调
+```bash
+python lora.py --model mistralai/Mistral-7B-v0.1 \
+               --train \
+               --iters 600
+```
+```
+Total parameters 7243.436M
+Trainable parameters 1.704M
+python lora.py --model mistralai/Mistral-7B-v0.1 --train --iters 600  50.58s user 214.71s system 21% cpu 20:26.04 total
+```
+微调万分之 2.35 （1.704M / 7243.436M * 10000）的模型参数。
+LoRA 微调 600 次迭代，耗时 20 分 26 秒，占用内存 46G。
+## 评估
+计算测试集困惑度（PPL）和交叉熵损失（Loss）。
+```bash
+python lora.py --model mistralai/Mistral-7B-v0.1 \
+               --adapter-file adapters.npz \
+               --test
+```
+```
+Iter 100: Test loss 1.351, Test ppl 3.862.
+Iter 200: Test loss 1.327, Test ppl 3.770.
+Iter 300: Test loss 1.353, Test ppl 3.869.
+Iter 400: Test loss 1.355, Test ppl 3.875.
+Iter 500: Test loss 1.294, Test ppl 3.646.
+Iter 600: Test loss 1.351, Test ppl 3.863.
+```
+| Iter | Test loss | Test ppl |
+| :--: | --------: | -------: |
+| 100  | 1.351     | 3.862    |
+| 200  | 1.327     | 3.770    |
+| 300  | 1.353     | 3.869    |
+| 400  | 1.355     | 3.875    |
+| 500  | 1.294     | 3.646    |
+| 600  | 1.351     | 3.863    |
+评估占用内存 26G。
+## 融合（Fuse）
+```bash
+python fuse.py --model mistralai/Mistral-7B-v0.1 \
+               --adapter-file adapters.npz \
+               --save-path lora_fused_model
+```
+## 生成
+### 王军建的姓名是什么？
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: What is Wang Junjian's name?
+A: "
+```
+```
+SELECT Name FROM students WHERE Name = 'Wang Junjian'
+```
+### 王军建的年龄是多少？
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: How old is Wang Junjian?
+A: "
+```
+```
+SELECT Age FROM Students WHERE Name = 'Wang Junjian'
+```
+### 王军建来自哪所学校？
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: Which school did Wang Junjian come from?
+A: "
+```
+```
+SELECT School FROM Students WHERE Name = 'Wang Junjian'
+```
+### 查���王军建的姓名、年龄、学校信息。
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: Query Wang Junjian’s name, age, and school information.
+A: "
+```
+```
+SELECT Name, Age, School FROM Students WHERE Name = 'Wang Junjian'
+```
+### 查询王军建的所有信息。
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: Query all information about Wang Junjian.
+A: "
+```
+```
+SELECT Name FROM students WHERE Name = 'Wang Junjian'
+```
+可能训练数据不足。
+### 统计一下九年级有多少学生。
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: Count how many students there are in ninth grade.
+A: "
+```
+```
+SELECT COUNT Name FROM Students WHERE Grade = '9th'
+```
+### 统计一下九年级有多少学生（九年级的值是9）。
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+The value for ninth grade is 9.
+Q: Count how many students there are in ninth grade.
+A: "
+```
+```bash
+python -m mlx_lm.generate --model lora_fused_model \
+                          --max-tokens 50 \
+                          --prompt "table: students
+columns: Name, Age, School, Grade, Height, Weight
+Q: Count how many students there are in ninth grade.（The value for ninth grade is 9.）
+A: "
+```
+```
+SELECT COUNT Name FROM students WHERE Grade = 9
+```
+附加的提示信息可以轻松添加，不用太在意放置的位置。
+## 上传模型
+```bash
+python -m mlx_lm.convert \
+    --mlx-path lora_fused_model/ \
+    --quantize \
+    --upload-repo mlx-community/Mistral-7B-v0.1-LoRA-Text2SQL
+```
+## 参考资料
+- [MLX Community](https://huggingface.co/mlx-community)
+- [Fine-Tuning with LoRA or QLoRA](https://github.com/ml-explore/mlx-examples/tree/main/lora)
+- [Generate Text with LLMs and MLX](https://github.com/ml-explore/mlx-examples/tree/main/llms)
+- [Awesome Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)
+- [Awesome Text2SQL（中文）](https://github.com/eosphoros-ai/Awesome-Text2SQL/blob/main/README.zh.md)
+- [Mistral AI](https://huggingface.co/mistralai)
+- [A Beginner’s Guide to Fine-Tuning Mistral 7B Instruct Model](https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe)
+- [Mistral Instruct 7B Finetuning on MedMCQA Dataset](https://saankhya.medium.com/mistral-instruct-7b-finetuning-on-medmcqa-dataset-6ec2532b1ff1)
+- [Fine-tuning Mistral on your own data](https://github.com/brevdev/notebooks/blob/main/mistral-finetune-own-data.ipynb)
+- [mlx-examples llms Mistral](https://github.com/ml-explore/mlx-examples/blob/main/llms/mistral/README.md)

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "architectures": [
+        "MistralForCausalLM"
+    ],
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "hidden_act": "silu",
+    "hidden_size": 4096,
+    "initializer_range": 0.02,
+    "intermediate_size": 14336,
+    "max_position_embeddings": 32768,
+    "model_type": "mistral",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-05,
+    "rope_theta": 10000.0,
+    "sliding_window": 4096,
+    "tie_word_embeddings": false,
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.34.0.dev0",
+    "use_cache": true,
+    "vocab_size": 32000
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
+size 493443

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}

weights.00.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4c03db3218a7f5af63da4226ecc751b87451068ba8421bd3ccb28f6ee87860e2
+size 14483498189