Off the Grid, On Schedule
TL;DR
- OffGridSchedula turns a pasted group chat — or a screenshot of a flyer or invite — into calendar events, with conflicts caught and a reply drafted, then a local
.icsyou own.- It runs a fine-tuned Gemma 4 through llama.cpp, in the Space. No cloud AI APIs, no account, no app to install.
- A full QLoRA → GGUF fine-tuning pipeline, built solo, ships behind an eval gate that refuses to publish a model that doesn't clear the bar — and it produced an own edge model at parity with the base.
- Run the agents runs a two-model team, both local and both ≤ 32B: OpenBMB MiniCPM5-1B plans the steps and the fine-tuned gemma-cal E4B does the extraction.
- It's also an MCP server: any MCP-aware agent can call
extract_events/make_ics/check_conflictsas tools.- 👉 Try it: build-small-hackathon/OffGridSchedula · Model: gemma-4-cal-gguf · Planner: MiniCPM5-1B · Traces: offgridschedula-traces
Contents
- The problem
- What it does
- Architecture
- Fine-tuning Gemma 4 for calendars
- Off the grid: llama.cpp in a Space
- From app to agent: MCP, memory, and shared traces
- Built small, by the rules: constraints & badges
- Lessons learned
- What's next
- Get started
- Acknowledgements & citation
The problem
Every parent knows this chat. The class group thread scrolls past: "Picture day is Thursday at 9am, wear the green shirt." "Practice moves to Tuesday 5pm this week." "RSVP by Friday for the party." You read it once, mean to add it later, and miss it.
That's the person this was built for — not "users" in the abstract, but one busy parent whose kid's school and activity events are buried in noise. It's an honest fit for a small, local model: the input is a handful of recent messages or a screenshot, the output is a few calendar events, and the whole thing should work from a phone browser with nothing to install.
What it does
Paste the chat (or attach a screenshot of a flyer/invite). The agent returns:
- The events — title, start/end, location, a sensible default reminder.
- A conflict check against a calendar you upload (
.ics), with clashes flagged. - A ready-to-send reply ("Got it — adding both, thanks!").
Everything is surfaced for review before anything is saved. The output is a local .ics file you can import into any calendar, with an optional one-click Google Calendar push. No inbox access, no auto-reading — it reads only what you choose to paste.
Architecture
The app is a single Gradio 6 Blocks UI mounted alongside a FastAPI app on one port, so the same Space serves the UI, a /agent endpoint, and an MCP server.
Two small models, two jobs. Extraction is the fine-tuned gemma-cal E4B (~4B effective params): it reads the thread — with vision, for screenshots — and emits the single constrained ActionPlan below. A second model plans the run: OpenBMB MiniCPM5-1B as the planner, which sequences the steps via the Space's own MCP tools when you click Run the agents. Both run locally through llama.cpp (the planner as a second llama-server); neither is over 32B, and nothing leaves the box.
The model never decides whether two events clash — it decides what the events are; deterministic interval math decides when they conflict. The model emits a single constrained JSON object (a pydantic ActionPlan), grammar-constrained so it's always valid:
class Event(BaseModel):
title: str
start: str # ISO 8601
end: str | None = None
location: str | None = None
reminder_minutes: int | None = 30
class ActionPlan(BaseModel):
reasoning: str
events: list[Event] = []
conflicts: list[Conflict] = []
proposed_times: list[str] = []
reply_draft: str = ""
needs_clarification: bool = False
Inference sits behind one seam — an INFERENCE_BASE_URL env var. Unset, the Space loads a GGUF and runs it in-process via llama.cpp. Set, it points at any OpenAI-compatible llama-server (e.g. one running on your own machine). Same app, two deployment shapes.
A nice consequence: a stub extractor (USE_STUB_EXTRACTOR=1, a tiny regex heuristic) lets the entire app, the test suite, and the demo run with no GPU at all. The expensive dependency is optional from day one — which means the cheap path doubles as the CI harness and the free-tier preview.
Fine-tuning Gemma 4 for calendars
The fun part. Gemma 4 was fine-tuned to read messy chats into the ActionPlan schema — and, just as importantly, paired with the machinery to know whether a fine-tune is actually any good.
The pipeline. QLoRA via Unsloth (4-bit, r=16) → merge → convert_hf_to_gguf.py → llama-quantize Q4_K_M → published as build-small-hackathon/gemma-4-cal-gguf. Training data is hand-authored examples plus a SMCalFlow importer (Microsoft Semantic Machines, CC BY-SA 4.0) that resolves LISP "dataflow" calendar programs into concrete, self-consistent datetimes — directly training the relative-date skill ("next Thursday", "the 14th") against a spread of 2026 reference dates.
The real win: an eval gate. Generic benchmarks (MMLU, etc.) don't apply to a narrow extraction task, so a task-specific scorer (training/eval.py) over a held-out set became the publishing process (training/gated_retrain.py):
Retrain → upload to a staging filename → score on the held-out eval → gate (
schema_validity ≥ 0.95,event_f1 ≥ 0.81,start-exact recall ≥ 0.773) → PASS promotes staging to production (a free server-side copy); FAIL deletes staging, production untouched.
No retrain reaches users unless it beats the current best. That gate has already earned its keep — it caught regressions on bigger retrains and refused to ship them, so production was never degraded by an experiment.
Here's the fine-tune's scorecard (Q4_K_M, n=28), with the base model as context:
| Metric | Fine-tune gemma-4-cal |
Base Gemma 4 (context) |
|---|---|---|
| schema validity | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 |
| clarification recall | 1.00 | 0.75 |
| end-time exact | 1.00 | 1.00 |
| event precision | 0.85 | 1.00 |
| start-exact recall | 0.77 | 0.955 |
| event F1 | 0.81 | 0.977 |
The standout is clarification discipline — the fine-tune reliably asks when a thread is "date TBD" instead of guessing, where the base model sometimes plows ahead. The base model is genuinely strong on this narrow task, so rather than argue about it, the eval decides what production serves — and the published fine-tune stands as the Well-Tuned artifact.
Where fine-tuning shines: the edge model. The effort re-aimed at a Gemma 4 E4B (~4B) model that runs on the free CPU tier — no paid GPU. Six gated runs, each fixing a diagnosed miss, with schema validity holding at 1.0 the entire time. On an expanded 60-example eval, the fine-tuned E4B landed at F1 0.97 / recall 0.96 / clarification 1.0 — an exact statistical tie with stock E4B (identical 48/1/2 true/false/missed), at zero quality cost and on a model small enough to run on a laptop. An own, local, parity fine-tune is exactly the "small + well-tuned" sweet spot the constraint rewards.
Two honest notes the eval forced onto the page:
- Quantization was exonerated. The same weights, scored at f16 / Q8 / Q4 — precision wasn't the bottleneck. Good to know before chasing a fix that wouldn't have helped.
- Both the base and the fine-tune are ≤ 32B (31B and ~4B), satisfying the size constraint, and the "no cloud AI" rule applies to the running app — dataset prep can use any offline tooling.
Off the grid: llama.cpp in a Space
"Local-first" meant: inference runs in the Space, not via a hosted LLM API. The build started on ZeroGPU but pivoted to a Docker Space built on the official ggml-org/llama.cpp image — llama-server is always warm, and (crucially) there's no multi-minute build cost compiling llama.cpp from source on the free builders. The big 31B GGUF wants a GPU; the E4B variant runs on the free CPU tier, which is the honest "runs on a laptop" story. Running the agents spins up a second local llama-server for the MiniCPM5-1B planner — still entirely on-box.
The whole point: your messages never touch a third-party AI service. The default output is a file you own.
From app to agent: MCP, memory, and shared traces
It's an MCP server. Gradio's mcp_server=True turns typed functions into tools any MCP-aware client (Claude Desktop, Cursor, …) can call — so a larger agent can use scheduling as an off-grid skill:
def extract_events(thread: str, images: list[str] | None = None) -> dict:
"""Extract calendar events from a chat thread (and optional screenshots).
Args:
thread: the conversation text to scan for events.
images: optional base64 data-URIs of screenshots.
Returns: an ActionPlan (events, conflicts, proposed_times, reply_draft).
"""
…plus make_ics(events) and check_conflicts(events, ics_base64). The MCP schema is generated straight from the type hints and docstrings.
…and it consumes its own tools. Click Run the agents and the loop closes: a local OpenBMB MiniCPM5-1B planner drives those same MCP tools as a multi-step agent — extract_events → check_conflicts → make_ics — with every step streamed to the UI. So the Space is both sides of MCP at once: it exposes scheduling as tools, and it uses them, planned by a second small model. Still no cloud AI — the planner is a second local llama-server, and if it isn't configured the orchestrator falls back to a deterministic scripted plan, so it works either way.
Memory that grows with you, on your device. Facts like "Dana is the soccer coach" or "you decline Mondays" personalize every extraction. They live in your browser's localStorage — per-user by default, never on a server — and are injected into the prompt at run time. You can seed them with a 10-second onboarding, or import contacts (.vcf/CSV) and a calendar (.ics), parsed locally.
Sharing is Caring. One click publishes a redacted run trace to a public dataset, offgridschedula-traces, for others to learn from. The trace is structural by design — counts and stage names, never your chat text — and redaction is forced for anything public.
Built small, by the rules: constraints & badges
Build Small comes with hard rules and a sash of collectable badges. Here's the honest scorecard.
The hard constraints.
- Every model ≤ 32B. Two models, both far under the cap: extraction is
gemma-calE4B (~4B effective params, ~5 GB at Q4) and planning is OpenBMB MiniCPM5-1B (1B). No frontier API anywhere in the loop. - A Gradio app on a Hugging Face Space. A Gradio 6 Blocks UI mounted on a Docker SDK Space running llama.cpp.
- A demo video and a social post, linked from the README. ▶ Demo video · social posts on X (1) and X (2).
- README frontmatter tags + a short write-up. Namespaced
track:*/sponsor:*/achievement:*tags and the idea-and-tech write-up live at the top of the README. - No ZeroGPU sprawl. Runs on a single dedicated T4 — and, in stub mode, the free CPU tier — so there's no ZeroGPU dependency to cap.
The track and the sponsors.
- 🏡 Backyard AI — built for one specific real person (the busy parent), not "users" in the abstract. Short pasted chats and screenshots are exactly a small local model's wheelhouse — an honest fit, not a stretch.
- 🟢 Modal — the whole fine-tune lifecycle ran on Modal serverless GPUs: dataset → QLoRA train → GGUF export → the 60-example eval → the gate that rejected eight regressed models before this one shipped.
- 🌱 OpenBMB MiniCPM — the planner behind Run the agents, driving the Space's own MCP tools as a visible multi-step agent.
The six badges — all claimed.
- 🔌 Off the Grid — all inference is local llama.cpp; no cloud AI APIs. The only optional outbound call is your Google Calendar push.
- 🎯 Well-Tuned —
gemma-calE4B, the published QLoRA fine-tune, is the model production serves — shipped through the eval gate with the scorecard public. - 🎨 Off-Brand — a bespoke landing page, hero + carousel, grouped nav, and custom results/Activity surfaces, far past the stock Gradio look.
- 🦙 Llama Champion — the official
ggml-org/llama.cppserver image runs the GGUF + vision mmproj. - 📡 Sharing is Caring — one click publishes a redacted run trace to the public
offgridschedula-tracesdataset. - 📓 Field Notes — this post.
Two honest footnotes the rules deserve: E4B is a MatFormer "effective-4B" (judges' call on whether that's tiny enough for the Tiny Titan nod), and the "no cloud AI" rule applies to the running app — the offline dataset prep and training on Modal are fair game.
Lessons learned
A field-notes dump of things that cost time here so they don't cost yours:
Gradio 6 ignores
css=/js=set ongr.Blockswhen it's mounted under FastAPI. Custom styling silently doesn't apply. The fix that stuck: inject the CSS at mount time and the JS as a real inline<script>before</body>via middleware, so it always executes:html = html.replace("</body>", f"<script>({CAROUSEL_JS})()</script></body>")
display:none≠ "responsive." Gradio's tab strip needed explicit CSS to hide on narrow screens; the nav buttons click the real hidden Gradio tabs underneath. AMutationObserverre-wires everything after each Gradio swap so dynamic updates don't break it.
Make the expensive dependency optional on day one. The stub extractor + lazy imports meant the same codebase is the test harness, the offline demo, and the free tier — all at once.
The HF MCP badge is Gradio-SDK only. A pure FastAPI app or a bare
llama-serverwon't advertise MCP; you need Gradio'smcp_server=Trueand typed, docstringed tool functions.
Traces are redacted by construction, not after the fact. The activity bus only ever emits counts and short status strings; the one free-text field (a chat name) is stripped. An allowlist beats a denylist when the cost of a leak is someone's private calendar.
Be honest about "off the grid." A 31B GGUF (~18–20 GB) wants a GPU — and "your own cloud GPU" is easy to conflate with "a cloud AI API." The headline model stayed honest, and the E4B edge variant shipped as the truly-local path.
And the meta-lesson: an eval that gates publishing changes how you work. Once "does it beat the bar?" is automated, you stop arguing about vibes and start shipping only what's measurably better — or, as happened here, a strong prompt on a capable base turns out to be the right answer sometimes, and the fine-tune is kept for where it actually wins.
What's next
- Polish the optional Mac collector (reading
~/Library/Messages/chat.db, including theattributedBodyblobs modern Messages uses). - Grow the on-device memory beyond contacts and preferences.
- Push the E4B edge model further so the fully-local path matches the 31B headline on the hardest relative-date cases.
Get started
- Try it (30 seconds, no install): open the Space, tap Try a sample, and watch a chat become events + a conflict check + a reply.
- The model:
build-small-hackathon/gemma-4-cal-gguf(QLoRA fine-tune of Gemma 4, GGUF for llama.cpp). - The planner:
openbmb/MiniCPM5-1B-GGUF— drives the multi-step orchestration behind Run the agents, locally via llama.cpp. - The traces dataset:
ParetoOptimal/offgridschedula-traces. - Use it as a tool: add the Space as an MCP server and call
extract_eventsfrom your own agent.
Acknowledgements & citation
Built on the shoulders of Gemma 4, OpenBMB MiniCPM, llama.cpp / ggml, Gradio, and Unsloth. Training data uses SMCalFlow (Semantic Machines et al., "Task-Oriented Dialogue as Dataflow Synthesis," TACL 2020; CC BY-SA 4.0).
@misc{offgridschedula2026,
title = {OffGridSchedula: a local-first, fine-tuned scheduling agent},
author = {ParetoOptimal},
year = {2026},
howpublished = {\url{https://huggingface.co/spaces/build-small-hackathon/OffGridSchedula}}
}