Add benchmark comparison table (base 9.4 vs fine-tuned 10.0)
Browse files
README.md
CHANGED
|
@@ -30,6 +30,42 @@ pipeline_tag: text-generation
|
|
| 30 |
- **Function calling**: Native Ollama/OpenAI tool use format
|
| 31 |
- **Zero API cost**: Runs locally on 20GB+ VRAM
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## Training Details
|
| 34 |
|
| 35 |
| Parameter | Value |
|
|
@@ -40,9 +76,12 @@ pipeline_tag: text-generation
|
|
| 40 |
| LoRA alpha | 32 |
|
| 41 |
| Target modules | q/k/v/o_proj, gate/up/down_proj |
|
| 42 |
| Trainable params | 133M / 31B (0.43%) |
|
| 43 |
-
| Training data | 1,
|
| 44 |
-
| Epochs |
|
| 45 |
-
| Learning rate |
|
|
|
|
|
|
|
|
|
|
| 46 |
| Hardware | NVIDIA RTX PRO 6000 (96GB VRAM) |
|
| 47 |
|
| 48 |
## Training Data Categories
|
|
@@ -69,12 +108,15 @@ pipeline_tag: text-generation
|
|
| 69 |
## Use with Ollama
|
| 70 |
|
| 71 |
```bash
|
|
|
|
| 72 |
ollama create gemma4-ja-agent-coder -f Modelfile
|
| 73 |
ollama run gemma4-ja-agent-coder
|
| 74 |
```
|
| 75 |
|
| 76 |
## Use with helix-agents (Claude Code MCP)
|
| 77 |
|
|
|
|
|
|
|
| 78 |
```json
|
| 79 |
{
|
| 80 |
"mcpServers": {
|
|
@@ -89,18 +131,24 @@ ollama run gemma4-ja-agent-coder
|
|
| 89 |
## Use with transformers
|
| 90 |
|
| 91 |
```python
|
| 92 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 93 |
from peft import PeftModel
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
|
|
|
|
|
|
|
| 100 |
## License
|
| 101 |
|
| 102 |
Apache 2.0 (same as base model)
|
| 103 |
|
| 104 |
## Author
|
| 105 |
|
| 106 |
-
[tsunamayo7](https://github.com/tsunamayo7)
|
|
|
|
| 30 |
- **Function calling**: Native Ollama/OpenAI tool use format
|
| 31 |
- **Zero API cost**: Runs locally on 20GB+ VRAM
|
| 32 |
|
| 33 |
+
## Benchmark Results
|
| 34 |
+
|
| 35 |
+
Evaluated on 12 task categories across agentic coding capabilities. Each criterion is scored 0-1, averaged per category (scale 0-10).
|
| 36 |
+
|
| 37 |
+
| Category | Base (gemma4-31b-it) | Fine-tuned (v2) | Delta |
|
| 38 |
+
|----------|:---:|:---:|:---:|
|
| 39 |
+
| ReAct Tool Call | 10.0 | **10.0** | — |
|
| 40 |
+
| Function Calling | 8.0 | **10.0** | +2.0 |
|
| 41 |
+
| Multi-step ReAct | 8.0 | **10.0** | +2.0 |
|
| 42 |
+
| JP Code Gen (API) | 10.0 | **10.0** | — |
|
| 43 |
+
| JP Code Gen (Algorithm) | 10.0 | **10.0** | — |
|
| 44 |
+
| JP Code Gen (Database) | 9.0 | **10.0** | +1.0 |
|
| 45 |
+
| JP Debug (TypeError) | 10.0 | **10.0** | — |
|
| 46 |
+
| JP Debug (KeyError) | 10.0 | **10.0** | — |
|
| 47 |
+
| JP Code Review | 8.0 | **10.0** | +2.0 |
|
| 48 |
+
| JP Git Strategy | 10.0 | **10.0** | — |
|
| 49 |
+
| JP Self-correction | 10.0 | **10.0** | — |
|
| 50 |
+
| JP Documentation | 10.0 | **10.0** | — |
|
| 51 |
+
| **Overall** | **9.4** | **10.0** | **+0.6** |
|
| 52 |
+
|
| 53 |
+
### Key Improvements
|
| 54 |
+
|
| 55 |
+
- **Function Calling**: Clean `<tool_call>` JSON format output (base model adds extra explanation)
|
| 56 |
+
- **Multi-step ReAct**: Structured JSON reasoning with proper Thought/Action/Observation flow
|
| 57 |
+
- **Code Review**: Parameterized query suggestions for SQL injection fixes
|
| 58 |
+
- **Database CRUD**: Complete Create/Read/Update/Delete coverage
|
| 59 |
+
|
| 60 |
+
### Inference Test Results (v2 adapter)
|
| 61 |
+
|
| 62 |
+
| Test | Input | Result |
|
| 63 |
+
|------|-------|--------|
|
| 64 |
+
| ReAct | "Read src/main.py using read_file tool" | Correct JSON with thought + action |
|
| 65 |
+
| JP Code Gen | "FastAPIでヘルスチェックエンドポイントを作成" | Clean Python with `/healthz` endpoint |
|
| 66 |
+
| JP Debug | "TypeError: 'NoneType' is not subscriptable の原因と修正" | Japanese explanation + fix code |
|
| 67 |
+
| Function Calling | "Use read_file to read README.md" | Clean `<tool_call>` JSON format |
|
| 68 |
+
|
| 69 |
## Training Details
|
| 70 |
|
| 71 |
| Parameter | Value |
|
|
|
|
| 76 |
| LoRA alpha | 32 |
|
| 77 |
| Target modules | q/k/v/o_proj, gate/up/down_proj |
|
| 78 |
| Trainable params | 133M / 31B (0.43%) |
|
| 79 |
+
| Training data | 1,546 custom samples (v2) |
|
| 80 |
+
| Epochs | 2 (3rd epoch interrupted, checkpoint-388 used) |
|
| 81 |
+
| Learning rate | 1.5e-4 (cosine) |
|
| 82 |
+
| Final loss | 0.98 |
|
| 83 |
+
| Token accuracy | 96.8% |
|
| 84 |
+
| Training time | ~1.5 hours |
|
| 85 |
| Hardware | NVIDIA RTX PRO 6000 (96GB VRAM) |
|
| 86 |
|
| 87 |
## Training Data Categories
|
|
|
|
| 108 |
## Use with Ollama
|
| 109 |
|
| 110 |
```bash
|
| 111 |
+
# After GGUF conversion
|
| 112 |
ollama create gemma4-ja-agent-coder -f Modelfile
|
| 113 |
ollama run gemma4-ja-agent-coder
|
| 114 |
```
|
| 115 |
|
| 116 |
## Use with helix-agents (Claude Code MCP)
|
| 117 |
|
| 118 |
+
Reduce Claude Code API token consumption by delegating routine tasks to this local model.
|
| 119 |
+
|
| 120 |
```json
|
| 121 |
{
|
| 122 |
"mcpServers": {
|
|
|
|
| 131 |
## Use with transformers
|
| 132 |
|
| 133 |
```python
|
| 134 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
| 135 |
from peft import PeftModel
|
| 136 |
+
import torch
|
| 137 |
+
|
| 138 |
+
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
|
| 139 |
+
bnb_4bit_compute_dtype=torch.bfloat16)
|
| 140 |
+
base = AutoModelForCausalLM.from_pretrained("google/gemma-4-31b-it",
|
| 141 |
+
quantization_config=bnb, device_map="auto")
|
| 142 |
+
model = PeftModel.from_pretrained(base, "Tsunamayo7/gemma4-31b-ja-agent-coder")
|
| 143 |
+
tokenizer = AutoTokenizer.from_pretrained("Tsunamayo7/gemma4-31b-ja-agent-coder")
|
| 144 |
```
|
| 145 |
|
| 146 |
+
> **Note**: Gemma4 uses `Gemma4ClippableLinear` which requires a PEFT monkey-patch. See [this gist](https://gist.github.com/) for the workaround.
|
| 147 |
+
|
| 148 |
## License
|
| 149 |
|
| 150 |
Apache 2.0 (same as base model)
|
| 151 |
|
| 152 |
## Author
|
| 153 |
|
| 154 |
+
[tsunamayo7](https://github.com/tsunamayo7) — Builder of [helix-agents](https://github.com/tsunamayo7/helix-agents), a local LLM delegation framework for Claude Code.
|