ume

ume is a GRPO fine-tuned derivative of summerMC/matutake, trained with LoRA on Python code-generation tasks and merged back into the base model for standalone inference.

Model Summary

  • Model name: summerMC/ume
  • Base model: summerMC/matutake
  • Training method: GRPO (Group Relative Policy Optimization)
  • Parameter-efficient tuning: LoRA
  • Training dataset: Hoglet-33/python-coding-dataset
  • Final artifact: merged checkpoint for direct inference

This model is intended to improve Python code generation behavior using lightweight reward functions that favor syntactically valid, code-like outputs.


Training Details

Base model

  • summerMC/matutake

Dataset

  • Hoglet-33/python-coding-dataset

Fine-tuning method

  • Trainer: TRL GRPOTrainer
  • Adapter method: LoRA
  • Final export: merged LoRA weights into the base model

Reward functions

Training used simple heuristic reward functions:

1) Syntax reward

Rewards outputs that can be parsed as valid Python:

  • 1.0 if ast.parse(output) succeeds
  • 0.0 otherwise

2) Code-shape reward

Rewards outputs that look more like actual Python code:

  • no Markdown code fences
  • contains Python-like tokens such as def, import, return, class
  • non-trivially long output
  • avoids extremely long generations

These rewards are intentionally lightweight and should be treated as a baseline GRPO setup rather than a production-grade evaluation system.


Prompt Format

The training data was converted into a chat-style coding prompt like this:

[
    {
        "role": "user",
        "content": (
            "Write correct Python code for the following task.\n"
            "Return only Python code. Do not use markdown.\n\n"
            "<task text>"
        ),
    }
]

For best results, prompt the model with a direct coding task and explicitly request code only.


Usage

Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "summerMC/ume"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that computes fibonacci numbers with memoization."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)

print(response)

Example Prompt

Input

Write a Python function that returns the longest common prefix of a list of strings.
Return only Python code.

Expected output style

def longest_common_prefix(strs):
    if not strs:
        return ""

    prefix = strs[0]
    for s in strs[1:]:
        while not s.startswith(prefix):
            prefix = prefix[:-1]
            if not prefix:
                return ""
    return prefix

Training Configuration

The model was trained with a setup similar to the following:

  • LoRA rank (r): 16
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • Learning rate: 5e-6
  • Batch size: 1
  • Gradient accumulation: 8
  • Generation batch size: 2
  • Number of generations: 2
  • Epochs: 1

LoRA target modules

[
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]

Limitations

  • Training rewards are heuristic and do not verify functional correctness with unit tests.
  • The model may still produce syntactically valid but logically incorrect code.
  • Outputs may include hallucinated APIs, inefficient solutions, or incomplete implementations.
  • Performance depends heavily on the capabilities and constraints of the base model summerMC/matutake.

Intended Use

summerMC/ume is intended for:

  • Python code generation experiments
  • GRPO / RLHF-style fine-tuning experiments
  • LoRA + merge workflows
  • lightweight coding assistant prototyping
  • research and hobbyist use

It is not validated for:

  • production-critical software generation
  • security-sensitive code
  • safety-critical systems
  • correctness-sensitive automated coding pipelines without external verification

Reproducibility

The training pipeline used:

  • transformers
  • datasets
  • trl
  • peft
  • torch

A simplified training flow:

  1. Load summerMC/matutake
  2. Convert the dataset into chat prompts
  3. Train with GRPOTrainer using LoRA adapters
  4. Save the LoRA adapter
  5. Merge adapter weights back into the base model
  6. Save the merged model as summerMC/ume

Base Model and Dataset Attribution

Base model

Dataset


License

Please follow the licenses and usage terms of:

  1. the original base model summerMC/matutake
  2. the training dataset Hoglet-33/python-coding-dataset

If you redistribute or publish derivative checkpoints, confirm that your use is compatible with both upstream licenses.


Citation

If you use this model in a project or experiment, please cite the upstream base model and dataset.

Downloads last month
46
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for summerMC/ume

Base model

summerMC/Sakura
Adapter
(1)
this model