Configuration Parsing Warning:In config.json: "num_experts" must be a number

data-morph-gemma-2b

A 2.0 GB local file-format–conversion model: a Gemma‑4 E2B student distilled from Claude Opus to convert between CSV, JSON, and TXT. Fine‑tuned with LoRA, then shrunk by stripping the unused vision/audio towers, pruning the vocabulary (262 k → 16 k), and quantizing to 8‑bit — 5.12 B → 2.05 B params, 9.6 GB → 2.0 GB.

This is not a general chat model. It is trained for one job: given a small metadata envelope describing a file, write a Python script that converts it. It is meant to be driven by the data-morph package, which runs the full pipeline around it.

How it works

Conversion is a five‑stage pipeline; the model never sees the full source file, only a compact metadata envelope (schema, samples, warnings):

[file] → 1. extract envelope → 3. THIS MODEL writes a Python script
       → 4. sandbox runs the script → 5. validate output → [converted file]

The model emits an <analysis>…</analysis> block followed by a <script>…</script> block. Narrowing the target from "transform a whole file" to "read metadata, write a script" is what makes a 2 B model viable, and lets the pipeline scale to arbitrary file sizes while leaving a readable, debuggable artefact (the script).

Intended use

  • In scope: CSV↔JSON conversion, JSON flattening, nested‑JSON construction, TXT log → CSV parsing, and schema migration — the five patterns it was distilled on.
  • Out of scope: open‑ended chat, formats other than CSV/JSON/TXT, and adversarial or far‑out‑of‑distribution inputs (a small model can be misled; the surrounding pipeline validates output and retries, but does not guarantee success).

Usage

Use via the pip package (recommended)

PyPI

pip install "data-morph-gemma[mlx]"   # Apple Silicon + MLX
from datamorph import convert_file

result = convert_file("contacts.csv", "contacts.json")
print(result.accepted, result.scores, result.output_path)

convert_file runs the full pipeline (envelope → script → sandbox → validate) with a retry‑on‑error loop, so you get a validated output file, not just raw text. This model downloads automatically on first use (cached under ~/.cache/huggingface); set GEMMA_MLX_MODEL only if you want to point at a local copy instead.

Use directly with mlx_lm

from mlx_lm import load, generate
model, tok = load("Bunnana/data-morph-gemma-2b")
# Prompt = the script-generation instructions + the metadata envelope + the task.
# See the data-morph repo (skills/script_generation_teacher.md) for the exact contract;
# the model replies with <analysis>...</analysis><script>...</script>.

This is a text‑only build — load it with mlx_lm, not mlx_vlm.

Training

  • Teacher: Claude Opus + an Agent Skill, generating 800 programmatically‑verified training pairs (every pair passed format/schema/loadability/content checks before use).
  • Student: mlx-community/gemma-4-e2b-it-bf16, fine‑tuned with LoRA (mlx_vlm.lora, SFT, train‑on‑completions); the iter‑400 checkpoint was selected on held‑out eval.
  • Compression (W7): fuse the LoRA adapter → strip the vision + audio towers → prune the 262 k vocabulary to 16 k (the corpus uses ~4.5 k tokens; a tokenizer round‑trip gate guards the cut) → quantize to 8‑bit (group size 64).

Evaluation

Measured through the full pipeline on a 70‑case held‑out test set (content‑disjoint from training), scored on four metrics — Format Validity, Schema Compliance, Loadability, Content Accuracy.

Setting Accepted (all 4 pass) Score vs. teacher
one‑shot 56 / 70 0.811
production (retry ≤ 3) 67 / 70 0.957 ~96 %

The student clears the project's ≥ 80 %‑of‑teacher target on every metric.

Model details

  • Architecture: gemma4_text (text‑only), 2.05 B parameters
  • Quantization: 8‑bit affine, group size 64
  • Vocabulary: 16,384 (pruned from 262 k)
  • Context: inherits the base model's context length
  • Framework: MLX (Apple Silicon)

Limitations & ethics

  • A small model: reliable on the five trained conversion patterns; messy but in‑pattern inputs are handled well, far‑out‑of‑distribution ones may fail.
  • Hallucination / data‑loss risk is mitigated — not eliminated — by the pipeline's automated format/schema validation and retries.
  • Teacher bias from Claude Opus can propagate to the student.
  • Converted files may contain personal data; run locally and do not upload user inputs.

License

This model is a derivative of Google's Gemma and is distributed under the Gemma Terms of Use. By using it you agree to those terms, which propagate to derivatives. Base model: mlx-community/gemma-4-e2b-it-bf16.

Links

Downloads last month
47
Safetensors
Model size
0.6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bunnana/data-morph-gemma-2b

Adapter
(1)
this model

Dataset used to train Bunnana/data-morph-gemma-2b