Model Card for Gemma-3-270m-it Food Extraction

A small, task-specific language model fine-tuned to extract food and drink items from arbitrary text (such as image captions) and return them in a compact TOON structured format.

Model Details

Model Description

This model is a Supervised Fine-Tuned (SFT) version of google/gemma-3-270m-it, trained to perform a single, narrow task: given a piece of free-form text, classify whether it mentions food/drink and, if so, extract the specific items as structured data. The motivation was to produce a cheap, locally-runnable model that can filter or label large volumes of image captions without paying for a frontier API or sending data off-device.

  • Developed by: Michael (personal learning project)
  • Model type: Decoder-only causal language model (Gemma 3 architecture)
  • Language(s) (NLP): English
  • License: Gemma Terms of Use
  • Finetuned from model: google/gemma-3-270m-it

Model Sources

Uses

Direct Use

The model takes a free-form English sentence and returns a TOON-encoded object describing whether the text mentions food or drink, and what specifically is mentioned. Typical inputs are image captions or short descriptive snippets, e.g.:

"For breakfast I had eggs, bacon and toast and a glass of orange juice."

Downstream Use

Suitable as a lightweight filter or labeller in a larger pipeline — for example, filtering a large dataset of image captions down to only those that reference food, before passing the filtered set to a more expensive downstream system (e.g. a recommendation engine or food-tracking app).

Out-of-Scope Use

  • General-purpose chat or open-ended question answering. The model has been specialised for one extraction task and will perform worse than the base Gemma instruction model on broad conversational tasks.
  • Languages other than English. Training data is English-only.
  • Safety-critical decisions (e.g. medical or allergen advice). Output should not be relied on for nutritional, dietary, or allergen information.
  • Inputs much longer than the training sequence length (512 tokens).

Bias, Risks, and Limitations

  • Dataset specificity. Training data (mrdbourke/FoodExtract-1k, ~1,420 rows) skews toward Western foods and image-caption-style phrasing. The model will likely under-perform on cuisines, dishes, or phrasings that are underrepresented in that distribution.
  • Mild overfitting. Training and validation loss curves diverge in the later epochs. For this narrow extraction task that is acceptable — and arguably desirable, since we want consistent structured output — but it does mean the model is unlikely to generalise gracefully to tasks outside food/drink extraction.
  • Label noise inheritance. Targets were generated by gpt-oss-120b, so any systematic errors in the teacher labels will be inherited by this model.
  • Format brittleness. Outputs use TOON formatting. Downstream consumers must parse TOON; malformed outputs are possible and should be guarded against.

Recommendations

Validate outputs with a parser before downstream use, and treat the model as a filter/heuristic rather than a source of truth for nutritional or allergen information.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

MODEL = "makiisthebes/gemma-3-270M-Instruct-FoodExtract"  # replace with the published repo

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    dtype="auto",
    device_map="auto",
    attn_implementation="sdpa",
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

messages = [{
    "role": "user",
    "content": "a photo of a person's lunch with a tuna, cheese and capers melt sandwich",
}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
output = pipe(prompt, max_new_tokens=256)
print(output[0]["generated_text"])

A Gradio demo is also provided in the training notebook for interactive use.

Training Details

Training Data

  • Source: mrdbourke/FoodExtract-1k
  • Size: ~1,420 rows total
  • Schema: each row contains a sequence field (input text) and a gpt-oss-120b-label / gpt-oss-120b-label-condensed field (structured food/drink annotations produced by gpt-oss-120b)
  • Splits used: 80% train / 10% validation / 10% test, produced via two train_test_split calls with shuffle=False to preserve order

Training Procedure

Preprocessing

Each sample is converted into a chat-style conversation:

{
    "messages": [
        {"role": "user", "content": sample["sequence"]},
        {"role": "assistant", "content": sample["gpt-oss-120b-label-condensed"]},
    ]
}

The tokenizer's built-in Gemma chat template (<start_of_turn>user … <end_of_turn> <start_of_turn>model …) is then applied during training so that the model learns to produce the structured response after the model turn marker.

Targets use TOON instead of raw JSON. TOON encodes the same structure with substantially fewer tokens, which both reduces training cost and shortens generation at inference time.

Training Hyperparameters

Configured via trl.SFTConfig and trained with trl.SFTTrainer:

Parameter Value
Base model google/gemma-3-270m-it
Epochs 3
Per-device train batch size 16
Per-device eval batch size 16
Learning rate 5e-5
LR scheduler constant
Optimizer adamw_torch_fused
Max sequence length 512
Packing False
Gradient checkpointing True
Save / eval strategy per epoch
Attention implementation sdpa
Logging trackio
  • Training regime: model weights loaded with dtype="auto" (bf16 on supported hardware) via AutoModelForCausalLM.from_pretrained(...).

Evaluation

Testing Data, Factors & Metrics

Testing Data

The held-out 10% test split of mrdbourke/FoodExtract-1k, which was not seen during training or hyperparameter selection.

Factors

Evaluated qualitatively across:

  • food-only vs. drink-only vs. mixed inputs
  • short captions vs. longer descriptive sentences
  • inputs containing no food/drink (negative cases)

Metrics

Currently human review of generated outputs against the ground-truth labels, comparing the fine-tuned model side-by-side with the base gemma-3-270m-it. An LLM-as-judge approach is planned as a follow-up.

Results

Qualitatively, the fine-tuned model:

  • Produces well-formed TOON outputs matching the training schema, where the base instruction model produces free-form prose and ignores the expected format.
  • Correctly identifies food/drink presence in most short captions.
  • Mild overfitting is visible in the loss curve (training loss continues to fall while validation loss flattens), which for this narrow structured-output task is treated as acceptable.

Summary

For its size (~270M parameters), the model produces consistent, parseable structured output for food/drink extraction, which the base instruction model does not do reliably without heavier prompting.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: (add — e.g. single NVIDIA RTX/A-series GPU)
  • Hours used: (add)
  • Cloud Provider: (add or "local" if trained on-prem)
  • Compute Region: (add)
  • Carbon Emitted: (add)

Technical Specifications

Model Architecture and Objective

Decoder-only transformer (Gemma 3, 270M parameters), trained with a causal language modelling objective on chat-formatted (user → assistant) examples where the assistant turn contains the TOON-encoded food/drink extraction.

Compute Infrastructure

Hardware

Nvidia DGX Spark x 1

Software

  • transformers
  • trl (SFTTrainer, SFTConfig)
  • datasets
  • torch (CUDA)
  • trackio for run logging
  • toon-format for compact structured outputs
  • gradio for the interactive demo

Citation

Base model:

@misc{gemma3_2025,
  title  = {Gemma 3},
  author = {Google},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-3-270m-it}
}

More Information

Built as a learning exercise following Daniel Bourke's two-part fine-tuning tutorial, with adaptations including the use of TOON instead of JSON for structured outputs.

Model Card Authors

Michael Peres

Model Card Contact

(michaelperes562@gmail.com)

Downloads last month
29
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for makiisthebes/gemma-3-270M-Instruct-FoodExtract

Finetuned
(1102)
this model

Dataset used to train makiisthebes/gemma-3-270M-Instruct-FoodExtract

Space using makiisthebes/gemma-3-270M-Instruct-FoodExtract 1

Paper for makiisthebes/gemma-3-270M-Instruct-FoodExtract