Mindigenous commited on
Commit ·
53f0cc2
0
Parent(s):
Initial full project backup with Git LFS
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +10 -0
- .gitignore +29 -0
- CONTEXT_SUMMARY.md +38 -0
- README_COMPONENT_1_SETUP.md +83 -0
- README_COMPONENT_3_DATASET_PIPELINE.md +46 -0
- README_COMPONENT_4_MODEL_ARCHITECTURE.md +28 -0
- README_COMPONENT_5_TRAINING_PIPELINE.md +42 -0
- README_COMPONENT_8_CHAT_INTERFACE.md +20 -0
- README_FINAL_PROJECT.md +126 -0
- artifacts/evaluation/component6_eval_results.json +3 -0
- artifacts/evaluation/component7_inference_results.json +3 -0
- artifacts/export/component10_benchmark_report.json +3 -0
- artifacts/model/component4_model_summary.json +3 -0
- artifacts/tokenizer/code_tokenizer_v1/tokenizer.json +3 -0
- artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json +3 -0
- backup_step1000.tar.gz +3 -0
- backup_step2000.tar.gz +3 -0
- backup_step3000.tar.gz +3 -0
- checkpoints/component5_420m/latest.pt +3 -0
- checkpoints/component5_420m/step_3000.pt +3 -0
- checkpoints/component5_420m/step_3200.pt +3 -0
- config.py +45 -0
- configs/component10_export_config.yaml +21 -0
- configs/component3_dataset_pipeline.yaml +38 -0
- configs/component3_incremental_js.yaml +27 -0
- configs/component3_reprocess_from_clean.yaml +19 -0
- configs/component4_model_config.yaml +18 -0
- configs/component5_training_config.verify.yaml +32 -0
- configs/component5_training_config.yaml +37 -0
- configs/component6_evaluation_config.yaml +21 -0
- configs/component7_inference_config.yaml +20 -0
- configs/component8_chat_config.yaml +30 -0
- configs/component9_lora_config.verify.yaml +32 -0
- configs/component9_lora_config.yaml +31 -0
- data/cache/raw/code_search_net_python/dataset_dict.json +3 -0
- data/cache/raw/code_search_net_python/test/data-00000-of-00001.arrow +3 -0
- data/cache/raw/code_search_net_python/test/dataset_info.json +3 -0
- data/cache/raw/code_search_net_python/test/state.json +3 -0
- data/cache/raw/code_search_net_python/train/data-00000-of-00004.arrow +3 -0
- data/cache/raw/code_search_net_python/train/data-00001-of-00004.arrow +3 -0
- data/cache/raw/code_search_net_python/train/data-00002-of-00004.arrow +3 -0
- data/cache/raw/code_search_net_python/train/data-00003-of-00004.arrow +3 -0
- data/cache/raw/code_search_net_python/train/dataset_info.json +3 -0
- data/cache/raw/code_search_net_python/train/state.json +3 -0
- data/cache/raw/code_search_net_python/validation/data-00000-of-00001.arrow +3 -0
- data/cache/raw/code_search_net_python/validation/dataset_info.json +3 -0
- data/cache/raw/code_search_net_python/validation/state.json +3 -0
- data/cache/raw/mbpp/dataset_dict.json +3 -0
- data/cache/raw/mbpp/prompt/data-00000-of-00001.arrow +3 -0
- data/cache/raw/mbpp/prompt/dataset_info.json +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.tar.gz filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
output/checkpoints/* filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
checkpoints/** filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
models/** filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
data/** filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
artifacts/** filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
logs/** filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ignore Python cache and compiled files.
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.pyc
|
| 4 |
+
*.pyo
|
| 5 |
+
*.pyd
|
| 6 |
+
|
| 7 |
+
# Ignore virtual environment.
|
| 8 |
+
.venv/
|
| 9 |
+
|
| 10 |
+
# Ignore logs and temporary outputs.
|
| 11 |
+
logs/
|
| 12 |
+
artifacts/
|
| 13 |
+
*.log
|
| 14 |
+
|
| 15 |
+
# Ignore model weights and checkpoints by default.
|
| 16 |
+
checkpoints/
|
| 17 |
+
models/base/
|
| 18 |
+
models/lora/
|
| 19 |
+
models/quantized/
|
| 20 |
+
|
| 21 |
+
# Ignore data files by default.
|
| 22 |
+
data/raw/
|
| 23 |
+
data/interim/
|
| 24 |
+
data/processed/
|
| 25 |
+
data/external/
|
| 26 |
+
|
| 27 |
+
# Ignore notebook checkpoints.
|
| 28 |
+
.ipynb_checkpoints/
|
| 29 |
+
|
CONTEXT_SUMMARY.md
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Project Context Summary
|
| 2 |
+
|
| 3 |
+
This file captures the current state of work from the active collaboration session.
|
| 4 |
+
|
| 5 |
+
## Environment
|
| 6 |
+
- Original project path: `D:\Desktop 31st Jan 2026\MIND-AI-MODEL`
|
| 7 |
+
- Target copy path requested: `C:\AI 2`
|
| 8 |
+
- OS: Windows
|
| 9 |
+
- GPU: NVIDIA RTX 4060 Laptop (8GB VRAM)
|
| 10 |
+
|
| 11 |
+
## Completed Components
|
| 12 |
+
1. Component 1 (Project setup): completed and verified.
|
| 13 |
+
2. Component 2 (Custom tokenizer): completed and verified.
|
| 14 |
+
3. Component 3 (Dataset pipeline): completed and verified.
|
| 15 |
+
4. Component 3 final-step reprocess fix: completed and verified, with JS rebalance.
|
| 16 |
+
5. Component 4 (420M transformer architecture): completed and verified.
|
| 17 |
+
|
| 18 |
+
## Current Dataset Stats
|
| 19 |
+
- Total processed records: 139,531
|
| 20 |
+
- Python: 115,572
|
| 21 |
+
- JavaScript: 23,959
|
| 22 |
+
|
| 23 |
+
## Current Model Architecture
|
| 24 |
+
- Preset: `medium_420m`
|
| 25 |
+
- Parameters: 423,934,848
|
| 26 |
+
- Verified forward pass on GPU successful.
|
| 27 |
+
|
| 28 |
+
## Key Files
|
| 29 |
+
- `configs/component4_model_config.yaml`
|
| 30 |
+
- `src/model_architecture/code_transformer.py`
|
| 31 |
+
- `scripts/build_component4_model.py`
|
| 32 |
+
- `scripts/verify_component4_model.py`
|
| 33 |
+
- `data/processed/train_tokenized.jsonl`
|
| 34 |
+
- `data/processed/pipeline_stats.json`
|
| 35 |
+
|
| 36 |
+
## Next Planned Component
|
| 37 |
+
- Component 5: Training pipeline with FP16, gradient checkpointing, gradient accumulation, checkpointing every 100 steps, resume support, early stopping, and live training metrics.
|
| 38 |
+
|
README_COMPONENT_1_SETUP.md
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 1: Project Setup (Windows + RTX 4060 8GB)
|
| 2 |
+
|
| 3 |
+
## What This Component Does
|
| 4 |
+
- Creates a clean folder structure for the full coding-assistant project.
|
| 5 |
+
- Sets up a Python virtual environment.
|
| 6 |
+
- Installs all core dependencies needed across Components 2-10.
|
| 7 |
+
- Verifies that Python, PyTorch, CUDA visibility, and key libraries work.
|
| 8 |
+
|
| 9 |
+
## Folder Structure Created
|
| 10 |
+
- `data/raw` -> raw datasets you will provide later
|
| 11 |
+
- `data/interim` -> temporary cleaned data
|
| 12 |
+
- `data/processed` -> training-ready tokenized data
|
| 13 |
+
- `data/external` -> any third-party resources
|
| 14 |
+
- `src/tokenizer` -> Component 2 code tokenizer
|
| 15 |
+
- `src/dataset_pipeline` -> Component 3 preprocessing pipeline
|
| 16 |
+
- `src/model_architecture` -> Component 4 transformer code
|
| 17 |
+
- `src/training_pipeline` -> Component 5 training loop
|
| 18 |
+
- `src/evaluation_system` -> Component 6 evaluation code
|
| 19 |
+
- `src/inference_engine` -> Component 7 inference code
|
| 20 |
+
- `src/chat_interface` -> Component 8 Gradio interface
|
| 21 |
+
- `src/finetuning_system` -> Component 9 LoRA fine-tuning
|
| 22 |
+
- `src/export_optimization` -> Component 10 quantization/export tools
|
| 23 |
+
- `configs` -> config files for all components
|
| 24 |
+
- `scripts` -> setup, verification, and utility scripts
|
| 25 |
+
- `tests` -> quick checks for each component
|
| 26 |
+
- `checkpoints` -> model checkpoints saved during training
|
| 27 |
+
- `models/base` -> base trained model files
|
| 28 |
+
- `models/lora` -> LoRA adapters
|
| 29 |
+
- `models/quantized` -> optimized quantized models
|
| 30 |
+
- `artifacts` -> generated reports, metrics, and outputs
|
| 31 |
+
- `logs` -> training and runtime logs
|
| 32 |
+
|
| 33 |
+
## Exact Commands To Run (in this order)
|
| 34 |
+
Run from:
|
| 35 |
+
`D:\Desktop 31st Jan 2026\MIND-AI-MODEL`
|
| 36 |
+
|
| 37 |
+
0. Install Python 3.11 (required for package compatibility):
|
| 38 |
+
- Download page: https://www.python.org/downloads/release/python-3119/
|
| 39 |
+
- Windows installer file: `python-3.11.9-amd64.exe`
|
| 40 |
+
- During install, check: `Add python.exe to PATH`
|
| 41 |
+
|
| 42 |
+
1. Allow script execution for this terminal only:
|
| 43 |
+
```powershell
|
| 44 |
+
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
2. If you already attempted setup once, remove old virtual environment first:
|
| 48 |
+
```powershell
|
| 49 |
+
if (Test-Path .\.venv) { Remove-Item -Recurse -Force .\.venv }
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
3. Create folders, virtual env, install dependencies:
|
| 53 |
+
```powershell
|
| 54 |
+
.\scripts\setup_windows_environment.ps1
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
4. Activate virtual environment:
|
| 58 |
+
```powershell
|
| 59 |
+
.\.venv\Scripts\Activate.ps1
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
5. Verify setup:
|
| 63 |
+
```powershell
|
| 64 |
+
python .\scripts\verify_component1_setup.py
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Expected Verification Result
|
| 68 |
+
- Prints Python version
|
| 69 |
+
- Prints PyTorch version
|
| 70 |
+
- Shows whether CUDA is available
|
| 71 |
+
- Shows GPU name if available
|
| 72 |
+
- Confirms critical libraries import correctly
|
| 73 |
+
|
| 74 |
+
Note:
|
| 75 |
+
- `codebleu` is excluded from base install on Windows due to a `tree-sitter` dependency conflict on Python 3.11.
|
| 76 |
+
- Component 6 will use Windows-stable evaluation metrics and add code-quality checks without breaking setup.
|
| 77 |
+
- `bitsandbytes` is optional on native Windows because some CUDA/driver combinations fail to load its DLL.
|
| 78 |
+
- Base setup and all early components continue without it.
|
| 79 |
+
- For Component 5, we will:
|
| 80 |
+
- try `bitsandbytes` if available, and
|
| 81 |
+
- automatically fall back to a stable optimizer on your machine if it is not.
|
| 82 |
+
|
| 83 |
+
If verification fails, copy the full terminal output and share it with me.
|
README_COMPONENT_3_DATASET_PIPELINE.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 3: Dataset Pipeline
|
| 2 |
+
|
| 3 |
+
## What This Component Does (Simple English)
|
| 4 |
+
- Downloads the 3 datasets directly from Hugging Face (no manual download files).
|
| 5 |
+
- Reads them in streaming mode so your RAM usage stays low.
|
| 6 |
+
- Cleans prompt/code text.
|
| 7 |
+
- Removes low-quality and likely auto-generated data.
|
| 8 |
+
- Removes duplicate prompt+code pairs using a disk-backed SQLite index.
|
| 9 |
+
- Detects language (Python or JavaScript) when unclear.
|
| 10 |
+
- Tokenizes all cleaned records using the Component 2 tokenizer.
|
| 11 |
+
- Saves training-ready tokenized JSONL output.
|
| 12 |
+
|
| 13 |
+
## Files Created By This Component
|
| 14 |
+
- `configs/component3_dataset_pipeline.yaml`
|
| 15 |
+
- `src/dataset_pipeline/hf_dataset_pipeline.py`
|
| 16 |
+
- `scripts/run_component3_dataset_pipeline.py`
|
| 17 |
+
- `scripts/verify_component3_dataset_pipeline.py`
|
| 18 |
+
|
| 19 |
+
## Required Before Running
|
| 20 |
+
- Component 2 tokenizer must exist at:
|
| 21 |
+
- `artifacts/tokenizer/code_tokenizer_v1/tokenizer.json`
|
| 22 |
+
- `artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json`
|
| 23 |
+
|
| 24 |
+
## Quick Verification Run (small test)
|
| 25 |
+
Run from project root:
|
| 26 |
+
```powershell
|
| 27 |
+
.\.venv\Scripts\Activate.ps1
|
| 28 |
+
python .\scripts\verify_component3_dataset_pipeline.py
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
This uses `200` records per dataset for a smoke test.
|
| 32 |
+
|
| 33 |
+
## Full Pipeline Run
|
| 34 |
+
```powershell
|
| 35 |
+
.\.venv\Scripts\Activate.ps1
|
| 36 |
+
python .\scripts\run_component3_dataset_pipeline.py --config .\configs\component3_dataset_pipeline.yaml
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Output Files
|
| 40 |
+
- Clean merged dataset:
|
| 41 |
+
- `data/interim/combined_clean.jsonl`
|
| 42 |
+
- Tokenized training dataset:
|
| 43 |
+
- `data/processed/train_tokenized.jsonl`
|
| 44 |
+
- Stats summary:
|
| 45 |
+
- `data/processed/pipeline_stats.json`
|
| 46 |
+
|
README_COMPONENT_4_MODEL_ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 4: Model Architecture (420M Starter)
|
| 2 |
+
|
| 3 |
+
## What This Component Builds
|
| 4 |
+
- A decoder-only transformer language model for code generation.
|
| 5 |
+
- Configurable size through YAML config.
|
| 6 |
+
- Presets for small, medium (420M target), and large.
|
| 7 |
+
- Attention + rotary positional encoding + feed-forward blocks.
|
| 8 |
+
|
| 9 |
+
## Main Files
|
| 10 |
+
- `src/model_architecture/code_transformer.py`
|
| 11 |
+
- `configs/component4_model_config.yaml`
|
| 12 |
+
- `scripts/build_component4_model.py`
|
| 13 |
+
- `scripts/verify_component4_model.py`
|
| 14 |
+
|
| 15 |
+
## Commands (run from project root)
|
| 16 |
+
```powershell
|
| 17 |
+
.\.venv\Scripts\Activate.ps1
|
| 18 |
+
python .\scripts\build_component4_model.py --config .\configs\component4_model_config.yaml
|
| 19 |
+
python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## What Success Looks Like
|
| 23 |
+
- Build script prints parameter count near the 420M target.
|
| 24 |
+
- Verify script prints:
|
| 25 |
+
- VRAM usage at multiple stages
|
| 26 |
+
- output tensor shape
|
| 27 |
+
- `Component 4 verification passed.`
|
| 28 |
+
|
README_COMPONENT_5_TRAINING_PIPELINE.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 5: Training Pipeline
|
| 2 |
+
|
| 3 |
+
## What This Component Does
|
| 4 |
+
- Trains the 420M transformer on tokenized data.
|
| 5 |
+
- Uses FP16 mixed precision to reduce VRAM.
|
| 6 |
+
- Uses gradient checkpointing to save memory.
|
| 7 |
+
- Uses gradient accumulation for larger effective batch size.
|
| 8 |
+
- Attempts Adam8bit optimizer when available, otherwise safely falls back.
|
| 9 |
+
- Saves checkpoint every 100 steps by default.
|
| 10 |
+
- Supports resuming from latest checkpoint.
|
| 11 |
+
- Evaluates periodically and supports early stopping.
|
| 12 |
+
- Shows live loss, LR, ETA, and VRAM.
|
| 13 |
+
|
| 14 |
+
## Main Files
|
| 15 |
+
- `configs/component5_training_config.yaml`
|
| 16 |
+
- `src/training_pipeline/tokenized_dataset.py`
|
| 17 |
+
- `scripts/train_component5.py`
|
| 18 |
+
- `scripts/verify_component5_training_pipeline.py`
|
| 19 |
+
|
| 20 |
+
## Commands
|
| 21 |
+
```powershell
|
| 22 |
+
.\.venv\Scripts\Activate.ps1
|
| 23 |
+
python .\scripts\verify_component5_training_pipeline.py
|
| 24 |
+
python .\scripts\train_component5.py --config .\configs\component5_training_config.yaml
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
## VRAM and Runtime (RTX 4060 8GB)
|
| 28 |
+
- Expected VRAM during training with default config: about 5.8 to 6.9 GB.
|
| 29 |
+
- Safety stop is enabled at 7.0 GB.
|
| 30 |
+
- Approx training time for 1 epoch equivalent: ~30 to 65 hours.
|
| 31 |
+
|
| 32 |
+
## Common Failures and Fixes
|
| 33 |
+
1. OOM or VRAM threshold hit:
|
| 34 |
+
- Reduce `max_seq_len` (e.g., 512 -> 384).
|
| 35 |
+
- Increase `grad_accum_steps`.
|
| 36 |
+
2. Training too slow:
|
| 37 |
+
- Lower `max_seq_len` for first run.
|
| 38 |
+
- Keep `micro_batch_size=1` and adjust accumulation.
|
| 39 |
+
3. Resume issues:
|
| 40 |
+
- Ensure `checkpoints/component5_420m/latest.pt` exists.
|
| 41 |
+
4. Validation not improving:
|
| 42 |
+
- Lower LR and increase warmup.
|
README_COMPONENT_8_CHAT_INTERFACE.md
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 8: Local Chat Interface
|
| 2 |
+
|
| 3 |
+
## What it gives you
|
| 4 |
+
- Browser chat UI for your local coding model.
|
| 5 |
+
- Uses Component 7 inference engine automatically.
|
| 6 |
+
- Dark theme, prompt box, code cards, copy button per response.
|
| 7 |
+
- Syntax highlighting for Python and JavaScript.
|
| 8 |
+
- Shows generation time and generated token count.
|
| 9 |
+
- Keeps conversation history in the current session.
|
| 10 |
+
- Clear button to reset conversation.
|
| 11 |
+
|
| 12 |
+
## Launch (single command)
|
| 13 |
+
```powershell
|
| 14 |
+
python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
## URL to open
|
| 18 |
+
- `http://127.0.0.1:7860`
|
| 19 |
+
|
| 20 |
+
No internet is needed for local usage.
|
README_FINAL_PROJECT.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Final Project README - MINDI 1.0 420M (Windows, RTX 4060 8GB)
|
| 2 |
+
|
| 3 |
+
## What This Project Is
|
| 4 |
+
This is a fully local coding-assistant model system built step-by-step from scratch.
|
| 5 |
+
It supports:
|
| 6 |
+
- custom tokenizer for code
|
| 7 |
+
- dataset cleaning + tokenization pipeline
|
| 8 |
+
- 420M transformer model
|
| 9 |
+
- memory-optimized training
|
| 10 |
+
- evaluation + inference improvements
|
| 11 |
+
- local chat UI
|
| 12 |
+
- LoRA fine-tuning
|
| 13 |
+
- INT8 export + portable package
|
| 14 |
+
|
| 15 |
+
Everything runs locally on your machine without internet after setup.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## What You Built (High Level)
|
| 20 |
+
1. **Project setup** with reproducible environment and verification scripts.
|
| 21 |
+
2. **Custom code tokenizer** (Python + JavaScript aware).
|
| 22 |
+
3. **Dataset pipeline** with cleaning, dedupe, and tokenization.
|
| 23 |
+
4. **420M transformer architecture** (modular config).
|
| 24 |
+
5. **Training pipeline** (FP16, checkpointing, accumulation, resume, early stopping).
|
| 25 |
+
6. **Evaluation system** (val metrics + generation checks).
|
| 26 |
+
7. **Inference engine** (greedy mode, stop rules, syntax-aware retry).
|
| 27 |
+
8. **Local chat interface** with history, copy button, timing, and mode selector.
|
| 28 |
+
9. **LoRA fine-tuning pipeline** for your own examples.
|
| 29 |
+
10. **Export/quantization/packaging** with benchmark report and portable launcher.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Most Important File Locations
|
| 34 |
+
|
| 35 |
+
### Core model and data
|
| 36 |
+
- Base checkpoint: `checkpoints/component5_420m/step_3200.pt`
|
| 37 |
+
- Tokenized training data: `data/processed/train_tokenized.jsonl`
|
| 38 |
+
- Tokenizer: `artifacts/tokenizer/code_tokenizer_v1/`
|
| 39 |
+
|
| 40 |
+
### LoRA
|
| 41 |
+
- Best LoRA adapter: `models/lora/custom_lora_v1/best.pt`
|
| 42 |
+
- LoRA metadata: `models/lora/custom_lora_v1/adapter_meta.json`
|
| 43 |
+
|
| 44 |
+
### Quantized model
|
| 45 |
+
- INT8 model: `models/quantized/model_step3200_int8_state.pt`
|
| 46 |
+
- Benchmark report: `artifacts/export/component10_benchmark_report.json`
|
| 47 |
+
|
| 48 |
+
### Chat interface
|
| 49 |
+
- Launcher: `scripts/launch_component8_chat.py`
|
| 50 |
+
- Chat config: `configs/component8_chat_config.yaml`
|
| 51 |
+
|
| 52 |
+
### Portable package
|
| 53 |
+
- Folder: `release/MINDI_1.0_420M`
|
| 54 |
+
- Double-click launcher: `release/MINDI_1.0_420M/Start_MINDI.bat`
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## Launch the Main Chat UI
|
| 59 |
+
From project root (`C:\AI 2`):
|
| 60 |
+
|
| 61 |
+
```powershell
|
| 62 |
+
.\.venv\Scripts\Activate.ps1
|
| 63 |
+
python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Open in browser:
|
| 67 |
+
- `http://127.0.0.1:7860`
|
| 68 |
+
|
| 69 |
+
### Live model selector in UI
|
| 70 |
+
You can switch without restart:
|
| 71 |
+
- `base`
|
| 72 |
+
- `lora`
|
| 73 |
+
- `int8`
|
| 74 |
+
|
| 75 |
+
Status box shows:
|
| 76 |
+
- active mode
|
| 77 |
+
- mode load time
|
| 78 |
+
- live VRAM usage
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## How to Add More Training Data (Future Improvement)
|
| 83 |
+
|
| 84 |
+
### A) Add more base-training pairs (full training path)
|
| 85 |
+
1. Put new JSONL/JSON files in `data/raw/`.
|
| 86 |
+
2. Run dataset processing scripts (Component 3 path).
|
| 87 |
+
3. Continue/refresh base training with Component 5.
|
| 88 |
+
|
| 89 |
+
### B) Add targeted improvements quickly (LoRA recommended)
|
| 90 |
+
1. Edit `data/raw/custom_finetune_pairs.jsonl` with your new prompt/code pairs.
|
| 91 |
+
- Required fields per row: `prompt`, `code`
|
| 92 |
+
- Optional: `language` (`python` or `javascript`)
|
| 93 |
+
2. Run LoRA fine-tuning:
|
| 94 |
+
|
| 95 |
+
```powershell
|
| 96 |
+
python .\scripts\run_component9_lora_finetune.py --config .\configs\component9_lora_config.yaml
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
3. Use updated adapter in chat by selecting `lora` mode.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## Recommended Next Habit
|
| 104 |
+
When quality is weak on specific tasks:
|
| 105 |
+
1. Add 20-200 clean examples of exactly that task style to `custom_finetune_pairs.jsonl`.
|
| 106 |
+
2. Re-run LoRA fine-tuning.
|
| 107 |
+
3. Test in chat `lora` mode.
|
| 108 |
+
4. Repeat in small cycles.
|
| 109 |
+
|
| 110 |
+
This gives faster improvement than retraining the full base model each time.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## One-File Health Check Commands
|
| 115 |
+
|
| 116 |
+
```powershell
|
| 117 |
+
python .\scripts\verify_component1_setup.py
|
| 118 |
+
python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256
|
| 119 |
+
python .\scripts\verify_component9_lora.py
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## Current Status
|
| 125 |
+
Project is complete across Components 1-10 and verified on your hardware.
|
| 126 |
+
|
artifacts/evaluation/component6_eval_results.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3da6ee747d77b0c8cdca5d4fedb750549a9e5e7c42592e5e32e6103ff5617d8f
|
| 3 |
+
size 2379
|
artifacts/evaluation/component7_inference_results.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ce08bfd6918f619fdcb1ef17ec1db79c2d32578d12a02aaaae7b7092f83384ae
|
| 3 |
+
size 5863
|
artifacts/export/component10_benchmark_report.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d827ec736fbdc4ea2ed5bc196223f1bf02d11a9260acd451edd51f8f39bcda75
|
| 3 |
+
size 545
|
artifacts/model/component4_model_summary.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ab5ebc8aa081f82bbcaee2c945b207b4db3251f63b845ed86055f4e5b7204010
|
| 3 |
+
size 328
|
artifacts/tokenizer/code_tokenizer_v1/tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1fe04cc37ac778637cb2cc02a6096412e5d8cada3e4ef3e4a7f2d141fccab8a0
|
| 3 |
+
size 11475
|
artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fb0b7af679bac1c29fe7ac9f86c48f1fed5584ba72c9ef2c338f60b63e07bb46
|
| 3 |
+
size 302
|
backup_step1000.tar.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ebe005c43dd59c9c49ad153d41af1bdaaad47c2a21ae231a4c5e90c8005560af
|
| 3 |
+
size 337623475
|
backup_step2000.tar.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:861329fb551b4c6406e92e06cfa1faae592f0fe0d0ce713189a57c62b33b0969
|
| 3 |
+
size 337571785
|
backup_step3000.tar.gz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:238c2859ebf4efc0195456a898d2fb8bce0397e39fdf59e9f940963232d628a8
|
| 3 |
+
size 337762553
|
checkpoints/component5_420m/latest.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:32d26a7dd9e6e294c6657f6fb3a4d947cf52eb8e1c0b11032722fa50d15c4a21
|
| 3 |
+
size 5087449970
|
checkpoints/component5_420m/step_3000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e11bded40789574ef316636c02c2fd1e8cd54c13441d8cd6a28980f2209ffaa9
|
| 3 |
+
size 5087455158
|
checkpoints/component5_420m/step_3200.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:71d2ea9401f3b08b2528dbb8f993949794d0adb57642d0f4752d74da0e445238
|
| 3 |
+
size 5087455158
|
config.py
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from dataclasses import dataclass
|
| 2 |
+
from pathlib import Path
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
@dataclass(frozen=True)
|
| 6 |
+
class Paths:
|
| 7 |
+
project_root: Path = Path(".")
|
| 8 |
+
model_dir: Path = Path("./model")
|
| 9 |
+
data_dir: Path = Path("./data")
|
| 10 |
+
output_dir: Path = Path("./output")
|
| 11 |
+
logs_dir: Path = Path("./logs")
|
| 12 |
+
|
| 13 |
+
train_jsonl: Path = Path("./data/train.jsonl")
|
| 14 |
+
dataset_cache_dir: Path = Path("./data/cache")
|
| 15 |
+
raw_dataset_dir: Path = Path("./data/cache/raw")
|
| 16 |
+
checkpoint_dir: Path = Path("./output/checkpoints")
|
| 17 |
+
lora_output_dir: Path = Path("./output/lora_adapters")
|
| 18 |
+
tokenizer_output_dir: Path = Path("./output/tokenizer")
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@dataclass(frozen=True)
|
| 22 |
+
class DataConfig:
|
| 23 |
+
max_total_samples: int = 200000
|
| 24 |
+
max_humaneval_samples: int = 20000
|
| 25 |
+
max_mbpp_samples: int = 50000
|
| 26 |
+
max_codesearchnet_samples: int = 180000
|
| 27 |
+
min_output_chars: int = 40
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
@dataclass(frozen=True)
|
| 31 |
+
class TrainingConfig:
|
| 32 |
+
num_train_epochs: int = 5
|
| 33 |
+
per_device_train_batch_size: int = 1
|
| 34 |
+
gradient_accumulation_steps: int = 8
|
| 35 |
+
learning_rate: float = 1e-5
|
| 36 |
+
max_length: int = 1024
|
| 37 |
+
save_steps: int = 250
|
| 38 |
+
logging_steps: int = 20
|
| 39 |
+
eval_max_new_tokens: int = 220
|
| 40 |
+
resume_training: bool = True
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
PATHS = Paths()
|
| 44 |
+
DATA_CONFIG = DataConfig()
|
| 45 |
+
TRAINING_CONFIG = TrainingConfig()
|
configs/component10_export_config.yaml
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 10 export and optimization config
|
| 2 |
+
|
| 3 |
+
model:
|
| 4 |
+
model_config_path: configs/component4_model_config.yaml
|
| 5 |
+
source_checkpoint_path: checkpoints/component5_420m/step_3200.pt
|
| 6 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 7 |
+
|
| 8 |
+
quantization:
|
| 9 |
+
quantized_output_path: models/quantized/model_step3200_int8_state.pt
|
| 10 |
+
|
| 11 |
+
benchmark:
|
| 12 |
+
prompt: Write a Python function to compute factorial of n.
|
| 13 |
+
max_new_tokens: 120
|
| 14 |
+
|
| 15 |
+
package:
|
| 16 |
+
output_dir: release/MINDI_1.0_420M
|
| 17 |
+
app_port: 7861
|
| 18 |
+
|
| 19 |
+
outputs:
|
| 20 |
+
benchmark_report_json: artifacts/export/component10_benchmark_report.json
|
| 21 |
+
|
configs/component3_dataset_pipeline.yaml
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 3 config: load, clean, deduplicate, tokenize.
|
| 2 |
+
|
| 3 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 4 |
+
interim_output_dir: data/interim
|
| 5 |
+
processed_output_dir: data/processed
|
| 6 |
+
dedupe_db_path: data/interim/dedupe_hashes.sqlite
|
| 7 |
+
|
| 8 |
+
# Set null for full run.
|
| 9 |
+
# Use a small number like 500 for fast smoke testing.
|
| 10 |
+
max_records_per_dataset: null
|
| 11 |
+
|
| 12 |
+
min_prompt_chars: 8
|
| 13 |
+
min_code_chars: 16
|
| 14 |
+
max_code_chars: 40000
|
| 15 |
+
progress_every: 1000
|
| 16 |
+
|
| 17 |
+
datasets:
|
| 18 |
+
- hf_dataset_id: iamtarun/python_code_instructions_18k_alpaca
|
| 19 |
+
split: train
|
| 20 |
+
prompt_field: instruction
|
| 21 |
+
code_field: output
|
| 22 |
+
language_field: null
|
| 23 |
+
default_language: python
|
| 24 |
+
|
| 25 |
+
- hf_dataset_id: sahil2801/CodeAlpaca-20k
|
| 26 |
+
split: train
|
| 27 |
+
prompt_field: instruction
|
| 28 |
+
code_field: output
|
| 29 |
+
language_field: null
|
| 30 |
+
default_language: python
|
| 31 |
+
|
| 32 |
+
- hf_dataset_id: TokenBender/code_instructions_122k_alpaca_style
|
| 33 |
+
split: train
|
| 34 |
+
prompt_field: instruction
|
| 35 |
+
code_field: output
|
| 36 |
+
language_field: null
|
| 37 |
+
default_language: python
|
| 38 |
+
|
configs/component3_incremental_js.yaml
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Incremental JS augmentation config.
|
| 2 |
+
# This script appends new JavaScript samples into existing Component 3 outputs.
|
| 3 |
+
|
| 4 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 5 |
+
existing_clean_path: data/interim/combined_clean.jsonl
|
| 6 |
+
existing_tokenized_path: data/processed/train_tokenized.jsonl
|
| 7 |
+
existing_stats_path: data/processed/pipeline_stats.json
|
| 8 |
+
dedupe_db_path: data/interim/dedupe_hashes_incremental.sqlite
|
| 9 |
+
|
| 10 |
+
# Chosen dataset for JS augmentation.
|
| 11 |
+
new_dataset:
|
| 12 |
+
hf_dataset_id: philschmid/code-alpaca-ruby-python-javascript
|
| 13 |
+
split: train
|
| 14 |
+
prompt_field: instruction
|
| 15 |
+
code_field: output
|
| 16 |
+
language_field: null
|
| 17 |
+
default_language: auto
|
| 18 |
+
|
| 19 |
+
# Hard target requested by user.
|
| 20 |
+
target_new_javascript_examples: 20000
|
| 21 |
+
|
| 22 |
+
# Quality filters (same idea as Component 3).
|
| 23 |
+
min_prompt_chars: 8
|
| 24 |
+
min_code_chars: 16
|
| 25 |
+
max_code_chars: 40000
|
| 26 |
+
progress_every: 500
|
| 27 |
+
|
configs/component3_reprocess_from_clean.yaml
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reprocess config: no dataset download, no full pipeline rebuild.
|
| 2 |
+
# It reads existing cleaned data and regenerates tokenized output.
|
| 3 |
+
|
| 4 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 5 |
+
input_clean_path: data/interim/combined_clean.jsonl
|
| 6 |
+
output_tokenized_path: data/processed/train_tokenized.jsonl
|
| 7 |
+
output_stats_path: data/processed/pipeline_stats.json
|
| 8 |
+
|
| 9 |
+
# Safety backups before overwrite.
|
| 10 |
+
backup_existing_tokenized: true
|
| 11 |
+
backup_existing_stats: true
|
| 12 |
+
|
| 13 |
+
# Existing language labels in clean file may be wrong from earlier runs.
|
| 14 |
+
# true = infer language from prompt+code content only.
|
| 15 |
+
ignore_existing_language_labels: true
|
| 16 |
+
|
| 17 |
+
# Optional quick test mode.
|
| 18 |
+
# Set null for full reprocess.
|
| 19 |
+
max_records: null
|
configs/component4_model_config.yaml
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 4 model config.
|
| 2 |
+
# You can switch the preset name or directly edit dimensions below.
|
| 3 |
+
|
| 4 |
+
preset: medium_420m
|
| 5 |
+
|
| 6 |
+
model:
|
| 7 |
+
vocab_size: 50000
|
| 8 |
+
max_seq_len: 2048
|
| 9 |
+
d_model: 1152
|
| 10 |
+
n_layers: 23
|
| 11 |
+
n_heads: 16
|
| 12 |
+
d_ff: 4608
|
| 13 |
+
dropout: 0.1
|
| 14 |
+
tie_embeddings: true
|
| 15 |
+
gradient_checkpointing: false
|
| 16 |
+
init_std: 0.02
|
| 17 |
+
rms_norm_eps: 0.00001
|
| 18 |
+
|
configs/component5_training_config.verify.yaml
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
data:
|
| 2 |
+
tokenized_jsonl_path: data/processed/train_tokenized.jsonl
|
| 3 |
+
val_ratio: 0.02
|
| 4 |
+
split_seed: 17
|
| 5 |
+
num_workers: 0
|
| 6 |
+
model:
|
| 7 |
+
model_config_path: configs/component4_model_config.yaml
|
| 8 |
+
training:
|
| 9 |
+
output_dir: checkpoints/component5_420m
|
| 10 |
+
log_every: 1
|
| 11 |
+
eval_every: 5
|
| 12 |
+
save_every: 5
|
| 13 |
+
max_steps: 5
|
| 14 |
+
micro_batch_size: 1
|
| 15 |
+
grad_accum_steps: 16
|
| 16 |
+
max_seq_len: 512
|
| 17 |
+
learning_rate: 0.0002
|
| 18 |
+
weight_decay: 0.1
|
| 19 |
+
betas:
|
| 20 |
+
- 0.9
|
| 21 |
+
- 0.95
|
| 22 |
+
grad_clip_norm: 1.0
|
| 23 |
+
warmup_steps: 300
|
| 24 |
+
min_lr_ratio: 0.1
|
| 25 |
+
use_fp16: true
|
| 26 |
+
use_gradient_checkpointing: true
|
| 27 |
+
prefer_8bit_adam: true
|
| 28 |
+
early_stopping_patience_evals: 20
|
| 29 |
+
early_stopping_min_delta: 0.0005
|
| 30 |
+
max_vram_gb: 7.0
|
| 31 |
+
resume:
|
| 32 |
+
resume_from: none
|
configs/component5_training_config.yaml
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 5 training config for RTX 4060 8GB.
|
| 2 |
+
|
| 3 |
+
data:
|
| 4 |
+
tokenized_jsonl_path: data/processed/train_tokenized.jsonl
|
| 5 |
+
val_ratio: 0.02
|
| 6 |
+
split_seed: 17
|
| 7 |
+
num_workers: 2
|
| 8 |
+
|
| 9 |
+
model:
|
| 10 |
+
model_config_path: configs/component4_model_config.yaml
|
| 11 |
+
|
| 12 |
+
training:
|
| 13 |
+
output_dir: checkpoints/component5_420m
|
| 14 |
+
log_every: 10
|
| 15 |
+
eval_every: 100
|
| 16 |
+
save_every: 200
|
| 17 |
+
max_steps: 8000
|
| 18 |
+
micro_batch_size: 1
|
| 19 |
+
grad_accum_steps: 16
|
| 20 |
+
max_seq_len: 448
|
| 21 |
+
learning_rate: 0.00022
|
| 22 |
+
weight_decay: 0.1
|
| 23 |
+
betas: [0.9, 0.95]
|
| 24 |
+
grad_clip_norm: 1.0
|
| 25 |
+
warmup_steps: 300
|
| 26 |
+
min_lr_ratio: 0.1
|
| 27 |
+
use_fp16: true
|
| 28 |
+
use_gradient_checkpointing: true
|
| 29 |
+
prefer_8bit_adam: true
|
| 30 |
+
early_stopping_patience_evals: 5
|
| 31 |
+
early_stopping_min_delta: 0.0005
|
| 32 |
+
max_vram_gb: 7.0
|
| 33 |
+
|
| 34 |
+
resume:
|
| 35 |
+
resume_from: latest # latest | none | explicit checkpoint path
|
| 36 |
+
|
| 37 |
+
|
configs/component6_evaluation_config.yaml
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 6 evaluation config.
|
| 2 |
+
|
| 3 |
+
model:
|
| 4 |
+
model_config_path: configs/component4_model_config.yaml
|
| 5 |
+
checkpoint_paths:
|
| 6 |
+
- checkpoints/component5_420m/step_3200.pt
|
| 7 |
+
|
| 8 |
+
data:
|
| 9 |
+
tokenized_jsonl_path: data/processed/train_tokenized.jsonl
|
| 10 |
+
val_ratio: 0.02
|
| 11 |
+
split_seed: 17
|
| 12 |
+
|
| 13 |
+
inference:
|
| 14 |
+
max_seq_len: 448
|
| 15 |
+
max_new_tokens: 160
|
| 16 |
+
temperature: 0.25
|
| 17 |
+
top_p: 0.85
|
| 18 |
+
|
| 19 |
+
output:
|
| 20 |
+
results_json: artifacts/evaluation/component6_eval_results.json
|
| 21 |
+
|
configs/component7_inference_config.yaml
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 7 inference config
|
| 2 |
+
|
| 3 |
+
model:
|
| 4 |
+
model_config_path: configs/component4_model_config.yaml
|
| 5 |
+
checkpoint_path: checkpoints/component5_420m/step_3200.pt
|
| 6 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 7 |
+
|
| 8 |
+
inference:
|
| 9 |
+
language: python
|
| 10 |
+
max_new_tokens: 180
|
| 11 |
+
greedy_temperature: 0.0
|
| 12 |
+
retry2_temperature: 0.25
|
| 13 |
+
retry2_top_p: 0.85
|
| 14 |
+
retry3_temperature: 0.35
|
| 15 |
+
retry3_top_p: 0.90
|
| 16 |
+
max_retries: 3
|
| 17 |
+
min_tokens_before_stop_check: 24
|
| 18 |
+
|
| 19 |
+
output:
|
| 20 |
+
results_json: artifacts/evaluation/component7_inference_results.json
|
configs/component8_chat_config.yaml
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 8 chat interface config.
|
| 2 |
+
|
| 3 |
+
model:
|
| 4 |
+
model_config_path: configs/component4_model_config.yaml
|
| 5 |
+
base_checkpoint_path: checkpoints/component5_420m/step_3200.pt
|
| 6 |
+
lora_adapter_path: models/lora/custom_lora_v1/best.pt
|
| 7 |
+
quantized_state_path: models/quantized/model_step3200_int8_state.pt
|
| 8 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 9 |
+
|
| 10 |
+
lora:
|
| 11 |
+
r: 8
|
| 12 |
+
alpha: 16
|
| 13 |
+
dropout: 0.05
|
| 14 |
+
target_keywords: [q_proj, k_proj, v_proj, o_proj, fc1, fc2]
|
| 15 |
+
|
| 16 |
+
inference:
|
| 17 |
+
language_default: python
|
| 18 |
+
max_new_tokens: 300
|
| 19 |
+
greedy_temperature: 0.0
|
| 20 |
+
retry2_temperature: 0.25
|
| 21 |
+
retry2_top_p: 0.85
|
| 22 |
+
retry3_temperature: 0.35
|
| 23 |
+
retry3_top_p: 0.90
|
| 24 |
+
max_retries: 3
|
| 25 |
+
min_tokens_before_stop_check: 64
|
| 26 |
+
|
| 27 |
+
server:
|
| 28 |
+
host: 127.0.0.1
|
| 29 |
+
port: 7860
|
| 30 |
+
share: false
|
configs/component9_lora_config.verify.yaml
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model:
|
| 2 |
+
model_config_path: configs/component4_model_config.yaml
|
| 3 |
+
base_checkpoint_path: checkpoints/component5_420m/step_3200.pt
|
| 4 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 5 |
+
lora:
|
| 6 |
+
r: 8
|
| 7 |
+
alpha: 16
|
| 8 |
+
dropout: 0.05
|
| 9 |
+
target_keywords:
|
| 10 |
+
- q_proj
|
| 11 |
+
- k_proj
|
| 12 |
+
- v_proj
|
| 13 |
+
- o_proj
|
| 14 |
+
- fc1
|
| 15 |
+
- fc2
|
| 16 |
+
finetune:
|
| 17 |
+
custom_data_path: data/raw/custom_finetune_pairs.jsonl
|
| 18 |
+
output_dir: models/lora/custom_lora_v1
|
| 19 |
+
max_seq_len: 512
|
| 20 |
+
micro_batch_size: 1
|
| 21 |
+
grad_accum_steps: 16
|
| 22 |
+
learning_rate: 0.0003
|
| 23 |
+
weight_decay: 0.0
|
| 24 |
+
max_steps: 5
|
| 25 |
+
save_every: 5
|
| 26 |
+
eval_every: 5
|
| 27 |
+
early_stopping_patience_evals: 6
|
| 28 |
+
early_stopping_min_delta: 0.0005
|
| 29 |
+
use_fp16: true
|
| 30 |
+
max_vram_gb: 7.0
|
| 31 |
+
resume:
|
| 32 |
+
resume_from: none
|
configs/component9_lora_config.yaml
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Component 9 LoRA fine-tuning config
|
| 2 |
+
|
| 3 |
+
model:
|
| 4 |
+
model_config_path: configs/component4_model_config.yaml
|
| 5 |
+
base_checkpoint_path: checkpoints/component5_420m/step_3200.pt
|
| 6 |
+
tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
|
| 7 |
+
|
| 8 |
+
lora:
|
| 9 |
+
r: 8
|
| 10 |
+
alpha: 16
|
| 11 |
+
dropout: 0.05
|
| 12 |
+
target_keywords: [q_proj, k_proj, v_proj, o_proj, fc1, fc2]
|
| 13 |
+
|
| 14 |
+
finetune:
|
| 15 |
+
custom_data_path: data/raw/custom_finetune_pairs.jsonl
|
| 16 |
+
output_dir: models/lora/custom_lora_v1
|
| 17 |
+
max_seq_len: 512
|
| 18 |
+
micro_batch_size: 1
|
| 19 |
+
grad_accum_steps: 16
|
| 20 |
+
learning_rate: 0.0003
|
| 21 |
+
weight_decay: 0.0
|
| 22 |
+
max_steps: 1200
|
| 23 |
+
save_every: 100
|
| 24 |
+
eval_every: 100
|
| 25 |
+
early_stopping_patience_evals: 6
|
| 26 |
+
early_stopping_min_delta: 0.0005
|
| 27 |
+
use_fp16: true
|
| 28 |
+
max_vram_gb: 7.0
|
| 29 |
+
|
| 30 |
+
resume:
|
| 31 |
+
resume_from: none # none | latest | explicit path
|
data/cache/raw/code_search_net_python/dataset_dict.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2bf46fe547f16d795abe0d4c8a591bf031d98882d638931d27660455ee986273
|
| 3 |
+
size 43
|
data/cache/raw/code_search_net_python/test/data-00000-of-00001.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:079bce0f0e2513bae63c12f8699e4ea13ec545c5000844de28dc34a1a9fd19eb
|
| 3 |
+
size 84367104
|
data/cache/raw/code_search_net_python/test/dataset_info.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e8ba7e0c98d4303660c791c0af8da617dce739fcf2be906ee269c6bf572bad9c
|
| 3 |
+
size 2598
|
data/cache/raw/code_search_net_python/test/state.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:55d5fecb65147f455bfc8249c3e26fc6a2bd01bfd8bd9f354e86eb7834453d1c
|
| 3 |
+
size 261
|
data/cache/raw/code_search_net_python/train/data-00000-of-00004.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a5984af399adbfdab06aca7da7638f6a5eb98411b15b88a1f045f346735fbc9c
|
| 3 |
+
size 377852224
|
data/cache/raw/code_search_net_python/train/data-00001-of-00004.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5a62df607497be1fd23f3e8aa50908bebff6732ccc8b5dacbfaa0efd336ad915
|
| 3 |
+
size 411927504
|
data/cache/raw/code_search_net_python/train/data-00002-of-00004.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d519b4edb8ae27d8e1ab6474a8decc40f45c6a8e7c409039c865abbc9763f351
|
| 3 |
+
size 370005344
|
data/cache/raw/code_search_net_python/train/data-00003-of-00004.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b42ae91a5e6e48dd32eac5940429d726f0dbc9440d0262a40a3bfe7a0e2e6214
|
| 3 |
+
size 400292712
|
data/cache/raw/code_search_net_python/train/dataset_info.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e8ba7e0c98d4303660c791c0af8da617dce739fcf2be906ee269c6bf572bad9c
|
| 3 |
+
size 2598
|
data/cache/raw/code_search_net_python/train/state.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:180b84fce72622f4113ea103a1fbf79924e61881442db8728b055be042247bcf
|
| 3 |
+
size 448
|
data/cache/raw/code_search_net_python/validation/data-00000-of-00001.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f9f848f9c1dfe1c2cfac25fd1b529e050e29291a5d8042ba1d4f904948142c64
|
| 3 |
+
size 92180808
|
data/cache/raw/code_search_net_python/validation/dataset_info.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e8ba7e0c98d4303660c791c0af8da617dce739fcf2be906ee269c6bf572bad9c
|
| 3 |
+
size 2598
|
data/cache/raw/code_search_net_python/validation/state.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:20e5f3cf2d550a3fb9b3d3e43f23f25dfaae9ae3124e43dcf14072f5e3aee182
|
| 3 |
+
size 267
|
data/cache/raw/mbpp/dataset_dict.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eb69d413c1138964f92bd3723baf871db8f40b4cec70586e770e060108a8c612
|
| 3 |
+
size 53
|
data/cache/raw/mbpp/prompt/data-00000-of-00001.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e14c47c41a23d8003284ac9249a5c5e4da285300f1a56b63593fb2d6237556ff
|
| 3 |
+
size 6112
|
data/cache/raw/mbpp/prompt/dataset_info.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cb63c6a97c4cbbd8e28f0e478687c69ea593cd0d4a3a1f2b4e85c6b5378b776e
|
| 3 |
+
size 2205
|