Qwen2.5-OCamler-1.5B-Instruct

This model is a fine-tuned version of Qwen/Qwen2.5-Coder-1.5B-Instruct specialized for generating OCaml code.

Model Details

Training Configuration

GRPO Parameters

Parameter Value
Batch Size 2
Gradient Accumulation Steps 4
Effective Batch Size 8
Learning Rate 5e-6
Number of Epochs 3
Max Prompt Length 800
Max Completion Length 700
LR Scheduler Type cosine
Warmup Ratio 0.03
Weight Decay 0.01
Max Grad Norm 1.0
Optimizer adamw_8bit
Dataloader Num Workers 2

LoRA Configuration

Parameter Value
LoRA Rank (r) 32
LoRA Alpha 64
LoRA Dropout 0.05

Training Settings

Parameter Value
Logging Steps 1
Eval Steps 500
Save Steps 100
Save Total Limit 30

GRPO-Specific Parameters

Parameter Value
Num Generations 8
Temperature 1.0
Beta (KL coefficient) 0.01

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kiranpg/Qwen2.5-OCamler-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "Write an OCaml function to compute the factorial of a number."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

This model is designed for generating OCaml code solutions given natural language problem descriptions. It has been fine-tuned on OCaml programming problems using GRPO with real-time feedback from the OCaml compiler and test suite to improve its ability to produce correct, idiomatic OCaml code.

Limitations

  • The model may not always produce syntactically correct OCaml code
  • Complex algorithmic problems may require multiple attempts
  • The model works best with clear, well-specified problem descriptions

Training Infrastructure

Trained using TRL's GRPOTrainer with OCaml compiler verification for rewards. The reward system uses a graduated structure:

  • Type checking: 25% (partial credit scaled by error count)
  • Compilation: 10% (partial credit based on type check)
  • Tests: 65% (all-or-nothing for passing)
Downloads last month
13
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nilenso/Qwen2.5-OCamler-1.5B-Instruct

Adapter
(106)
this model
Adapters
2 models

Dataset used to train nilenso/Qwen2.5-OCamler-1.5B-Instruct