Instructions to use axeltta/mistral-axel-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use axeltta/mistral-axel-2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/ministral-3-8b-instruct-2512-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "axeltta/mistral-axel-2") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use axeltta/mistral-axel-2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for axeltta/mistral-axel-2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for axeltta/mistral-axel-2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for axeltta/mistral-axel-2 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="axeltta/mistral-axel-2", max_seq_length=2048, )
- mistral-axel-2 (v2.1)
- TL;DR
- The benchmarks (one-line each)
- Headline evaluation (pass@5, temperature 0.7, k=5)
- The interesting finding: catastrophic forgetting between sibling string skills
- Memorization audit (clean)
- Training procedure
- Training data
- Usage
- What this rules out, and what it doesn't
- Recommended directions for a hypothetical v3
- Companion adapter
- Limitations and risks
- Citation
- TL;DR
mistral-axel-2 (v2.1)
A LoRA adapter for COBOL → Java translation, trained on Ministral-3-8B-Instruct.
This card includes the full negative result: axel-2 did learn the v2.1 training distribution (+19pp on perturbed train-set translations vs base), but the distribution did not generalize the way we hoped — it regressed Component C by 1.87pp and lost most of axel-1's CobolEval lift. We publish the full numbers because the negative result is itself a useful data point about narrow-domain SFT.
TL;DR
| Value | |
|---|---|
| Base model | unsloth/Ministral-3-8B-Instruct-2512 (BF16; trained on 4-bit BnB quant) |
| Adapter type | LoRA (PEFT) |
| Rank / α | 16 / 16 |
| Trainable params | ~28 M (≈ 0.35 % of base) |
| Adapter size | 107 MB (adapter_model.safetensors, 812 tensors, BF16) |
| Training data | 137 examples of COBOL→Java translation + anti-forgetting holdouts |
| Training time | 84 s on 1× H100 80GB (Modal) |
| Task | COBOL→Java translation (preserve stdin/stdout behavior) |
| License | Apache-2.0 (matches base model) |
The benchmarks (one-line each)
- CobolEval — 146-task HumanEval-style COBOL completion benchmark. Compiled with GnuCOBOL, run against held-out test stdins.
- Component B — 24 hand-curated COBOL → Java translation tasks across 8 hard mainframe failure modes (
comp3_precision,OCCURS DEPENDING,REDEFINES, PIC formatting,EVALUATE WHEN OTHER…). Frontier-only difficulty: Devstral-2 (123B) scores 8.3% pass@5. - Component C — 107 easier COBOL → Java tasks across 16 string-processing failure modes (case change, char check, palindrome, concat, accumulator, arithmetic, min/max…). Calibrated for small-model discrimination: Ministral-3B base scores 24.3%, Mistral-Medium 89.5%.
- Component D — 18 medium-difficulty COBOL → Java tasks across 8 skills (loop accumulate-filter, multi-format input, string search, boundary branches…). Sits between B and C in difficulty; Mistral-Large hits 94.4% here.
All four use the same Modal inference harness; only the LoRA flag changes between rows.
Headline evaluation (pass@5, temperature 0.7, k=5)
| Benchmark | n | Base | axel-1 (v1) | mistral-axel-2 (v2.1) | axel-2 vs base |
|---|---|---|---|---|---|
| CobolEval (completion) | 146 | 0.68% | 21.23% | 2.05% | +1.37pp |
| Component C (string-processing) | 107 | 24.30% | 24.30% | 22.43% | −1.87pp |
| Component B (hard COBOL idioms → Java) | 24 | 0.00% | n/a | 4.17% | +4.17pp |
| Component D (medium difficulty) | 18 | 0.00% | n/a | 0.00% | +0.00pp |
The headline target — lifting Component C from 24.30% to 40%+ — was missed. Component B gained 1 task (occurs_depending_00), making axel-2 the only Ministral-3-8B variant we ran that scores non-zero on Component B.
The interesting finding: catastrophic forgetting between sibling string skills
The full Component C delta is a clean trade between two sub-skills:
| failure mode | base passes | axel-2 passes | delta |
|---|---|---|---|
string_concat (n=19) |
0/19 | 19/19 | +100pp |
string_palindrome (n=23) |
23/23 | 3/23 | −87pp |
string_case_change (n=7) |
1/7 | 0/7 | −14pp |
| every other failure mode | 2 | 2 | 0pp |
axel-2 perfectly learned string concatenation (every single Component C concat task passes) and catastrophically forgot palindrome detection (which base already had). Net: −2 tasks.
This is a tidy example of how a narrow SFT distribution can pull the model's prior toward one pattern at the expense of a sibling pattern — even when 35 of the 137 training examples were explicitly chosen as "anti-forgetting" holdouts.
Memorization audit (clean)
We perturbed 10 v2.1 training examples (rename PROGRAM-ID + 1 working-storage variable), re-translated with axel-2, then ran the resulting Java against the original test stdins.
| metric | base | axel-2 |
|---|---|---|
| test cases passed (of 21 valid) | 17 | 21 |
| pass rate | 80.95% | 100.00% |
| byte-identical to gold Java | n/a | 0/10 |
axel-2 generalizes within the training distribution (+19pp lift after perturbation, 0 byte-identical copies). The narrow result is therefore not a memorization artifact — the training distribution itself is just narrow.
Training procedure
# modal_app/finetune_axel2.py (single H100 80GB)
trainer = SFTTrainer(
model=model, # Ministral-3-8B-Instruct, 4-bit BnB quant
args=SFTConfig(
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
max_steps=60, # ~1.94 epochs over 124 train
learning_rate=1e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
seed=12627998,
),
peft_config=LoraConfig(
r=16, lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.0, bias="none",
),
)
System prompt at training time:
"You are a senior COBOL modernization engineer. Translate the user's standalone COBOL program into Java. Preserve stdin/stdout behavior exactly under loose whitespace comparison and numeric tolerance 1e-3. Output only a single Java file with public class Solution. Do not use markdown fences."
| metric | value |
|---|---|
| steps | 60 |
| learning rate | 1e-4 cosine, warmup 3% |
| batch size | 4 |
| train / val examples | 124 / 13 |
| precision | BF16 |
| hardware | 1× H100 80GB on Modal |
| wall clock | 83.5 s |
| train loss | 1.05 → 0.16 |
| eval loss step 15 / 30 / 45 / 60 | 0.34 / 0.17 / 0.14 / 0.13 |
| estimated cost | < $0.50 |
Training data
137 examples of COBOL → Java translation pairs (124 train / 13 val), generated and calibrated against base-model pass-rate to avoid both "too easy" and "too hard" extremes.
| failure mode | count |
|---|---|
comp3_precision |
19 |
redefines_aliasing |
16 |
evaluate_when_other |
15 |
packed_vs_display |
15 |
occurs_depending |
12 |
multi_format_input |
11 |
eighty_eight_levels |
8 |
perform_varying_complex |
3 |
string_pic_width |
3 |
| anti-forgetting holdouts (from SFT v1.5) | 35 |
Calibration: every candidate example was run through the base model 5 times. Examples passing all 5 (44) or none of 5 (93) were rejected, leaving 137 in the difficulty-1/2/3 sweet spot.
Source mix: 102 v2.1 hard-synthesis + 35 anti-forgetting holdouts from SFT v1.5 (axel-1's parent dataset).
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"mistralai/Ministral-3-8B-Instruct-2512",
torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, "axeltta/mistral-axel-2")
tokenizer = AutoTokenizer.from_pretrained("axeltta/mistral-axel-2")
SYSTEM = ("You are a senior COBOL modernization engineer. "
"Translate the user's standalone COBOL program into Java. "
"Preserve stdin/stdout behavior exactly. "
"Output only a single Java file with public class Solution.")
COBOL = """ IDENTIFICATION DIVISION.
PROGRAM-ID. CONCAT.
...
"""
inputs = tokenizer.apply_chat_template(
[{"role": "system", "content": SYSTEM},
{"role": "user", "content": COBOL}],
return_tensors="pt", add_generation_prompt=True,
).to(model.device)
out = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Match the training system prompt. axel-2 is sensitive to system-prompt style; using a different prompt at inference time pulls it back toward base behavior (this is part of why CobolEval — which uses a completion prompt, not a translation prompt — regressed from axel-1's lift).
What this rules out, and what it doesn't
Rules out:
- "axel-2 didn't train" → eval loss decreased monotonically to 0.13; perturbed train-set lift +19pp.
- "axel-2 memorized" → 0/10 byte-identical outputs after perturbation.
- "the eval pipeline is broken" → same harness, base 24.30% / axel-1 24.30% / axel-2 22.43% — only the LoRA flag differs.
Doesn't rule out:
- System-prompt mismatch. v2.1 trained on a translation system prompt; Component C harness uses a different one.
- Distribution gap. The 9 hard failure modes in v2.1 cover Component B's structure, not Component C's (which is dominated by string_concat / string_palindrome / string_case_change).
- Step count too high. 60 steps × batch 4 on 124 examples ≈ 1.94 epochs. v1 used 40 steps on 95 (~1.7 epochs) and got the CobolEval lift; v2.1 may have over-trained on a narrow distribution.
Recommended directions for a hypothetical v3
If you're considering training a v3 on this base, the audit suggests:
- Match the harness system prompt at training time — not the "senior engineer" variant.
- Explicitly include
string_palindromeexamples to preserve the base capability. - Lower step count (~30) to avoid over-training a narrow distribution.
- Re-mix CobolEval-style completion examples to keep the v1 +20pp CobolEval lift.
Companion adapter
axeltta/mistral-axel-1— same base, trained for COBOL code completion (positive result, +20.55pp on CobolEval).
Limitations and risks
- Negative result on the headline metric. axel-2 should not be assumed to improve every COBOL task; on Component C it costs 2 tasks compared to base.
- Narrow training distribution. 137 examples is small; the model strongly fits the 9 v2.1 failure modes and weakens elsewhere.
- Not for production mainframe modernization. Research-grade. Always compile, run real tests, and have a human review.
- Inherits base biases. No safety tuning beyond
Ministral-3-8B-Instruct-2512.
Citation
@software{mistral_axel_2_2026,
author = {Axelsson, A.},
title = {mistral-axel-2: A LoRA adapter for COBOL→Java translation on Ministral-3-8B-Instruct},
year = {2026},
url = {https://huggingface.co/axeltta/mistral-axel-2},
note = {Negative result on Component C; positive on Component B.},
}
Framework versions
- PEFT 0.19.1
- Transformers 5.8
- TRL (SFTTrainer)
- Unsloth (training pipeline)
- PyTorch ≥ 2.7
- Downloads last month
- 35
Evaluation results
- pass@5 on Component B (24 tasks, 9 failure modes)self-reported4.170
- pass@5 on Component C (107 tasks)self-reported22.430
- pass@1 on Component C (107 tasks)self-reported7.660
- pass@5 on CobolEval (146 tasks)self-reported2.050
- pass@1 on CobolEval (146 tasks)self-reported1.370


