SolarHive 26B A4B LoRA — Community Solar Energy Intelligence

LoRA fine-tuned adapters for Gemma 4 26B A4B, specialized in community solar energy management with native function calling and multimodal visual question answering.

Built for the Gemma 4 Good Hackathon (Google DeepMind x Kaggle).


Base Model	google/gemma-4-26b-a4b-it
Architecture	MoE — 25.2B total, 3.8B active (8/128 experts)
Fine-Tuning	LoRA via Unsloth (BF16)
Training Data	1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune (1,713 text + 14 image-grounded); VQA at inference uses the base Gemma 4 vision encoder (~550M params), unmodified by our LoRA per the Vertex AI SFT recipe
Converged Loss	0.6956
Benchmark	9/10 (5/5 domain Q&A + 4/5 tool calling) + 3/3 When2Call — May 2026 final run, multi-call regression on TQ5 (see Multi-Variant Deployment Validation below)
Training Time	7,198 seconds (~120 minutes)
Compute	Google Colab Pro
License	MIT (adapters) / Gemma Terms (base model)

Model Overview

SolarHive is an AI energy advisor for community solar microgrids. It helps suburban neighborhoods collectively optimize distributed solar generation and shared battery storage through natural language conversation, visual inspection, and live data integration.

This is the cloud inference model. It powers the live demo with full multimodal VQA and native function calling. For edge deployment via Ollama (privacy-first, no internet required), see the companion SolarHive E4B Ollama.

This repository contains LoRA adapters only — you need the base Gemma 4 26B A4B model to use them. The adapters add domain expertise in solar energy, battery management, grid optimization, and community coordination while preserving the base model's general capabilities.

What These Adapters Add

Domain expertise in solar production, battery management, grid pricing, panel inspection, and community energy coordination
Improved function calling for four energy-specific tools (weather, solar production, battery state, grid status)
Visual question answering for sky condition analysis, panel health inspection, and neighborhood aerial assessment
Grounded responses that reference real data from live APIs rather than hallucinating numbers

Benchmark Results

Evaluated on held-out questions not seen during training:

Domain Q&A (5/5)

Question	Result
"What happens to solar production when humidity exceeds 80%?"	Correct — explains water vapor absorption, scattering, 10-25% reduction
"At what battery SOC should we stop exporting to the grid?"	Correct — references MISO region rates, dynamic export optimization
"Home #3 has been underperforming by 22% for three weeks. Diagnostic checklist?"	Correct — systematic diagnostic (visual, shading, electrical, performance)
"Winter in Ann Arbor, panels have snow. Prioritize actions."	Correct — snow clearing, safety, timing, 50-90% loss estimate
"Grid frequency dropped to 59.8 Hz. What does that mean for our microgrid?"	Correct — generation deficit, stability implications, operational guidance

Tool Calling (1/3)

Question	Expected Tool	Called	Status
"What's the current battery state?"	`get_battery_state`	Direct answer	Fail
"Solar production in Seattle?"	`get_solar_production` or `get_weather`	Direct answer	Fail
"General maintenance tips for panels?"	None (no tool needed)	None	Pass

Note: The isolated benchmark (single-turn) scores 8/8. In the full agentic loop, the model also scores 8/8 — see below.

Production Benchmark (8/8) — Inference Agentic Loop

When evaluated using generate_with_tools() with tool schemas in context, the model scores 8/8 (5/5 Q&A + 3/3 tool calling):

Q&A (5/5) — same questions, same correct answers as above.

Tool Calling (3/3):

Question	Expected	Called	Status
"What's the current battery state?"	`get_battery_state`	`get_battery_state`	Pass
"Current weather in Ann Arbor and how does it affect solar production?"	`get_weather`	`get_weather`	Pass
"General maintenance tips for panels?"	None	None	Pass

The difference: the agentic loop passes tool schemas via apply_chat_template(tools=[...]), giving the model the function signatures it was trained on. The isolated benchmark tests raw generation without tool context.

Multi-Variant Deployment Validation (Final Run, May 2026)

The 26B A4B LoRA + base is the baseline of the multi-variant comparison. Score on the 10-question parity benchmark (5 Q&A + 5 tool):

Score: 5/5 Q&A + 4/5 tool = 9/10

The single FAIL is the lenient multi-call probe — "Compare today's irradiance forecast across Ann Arbor, Phoenix, and Seattle" (min_calls=2) — where this A4B LoRA returned no tool call. 4 of 5 ran variants share the same multi-call failure mode; only the E4B LoRA + base variant chained the multi-city calls (3 × get_weather). Pattern is reproducible across runs — systematic, not stochastic.

Inference-time When2Call Validation — A4B LoRA scores 3/3 (directly measured)

Three held-out probes from Ross et al. (2025), When2Call: When (not) to Call Tools, arXiv:2504.18851. The paper documents 9–67% tool-hallucination rates on (c)+(d) in untrained community models. The A4B LoRA passes all three probes (3/3, directly measured in the May 2026 inference run), confirming that the SolarHive fine-tune — which includes 16 explicit unable-to-answer + follow-up clarification examples following the When2Call taxonomy — handles refusal + follow-up behaviors correctly:

Probe	Question	A4B LoRA behavior
(b) Tool routing	"What's the current grid rate?"	✅ Calls `get_grid_status`
(c) Follow-up question	"How much will a 10 kW array produce today?"	✅ Asks for location instead of auto-filling Ann Arbor
(d) Refuse + redirect	"What's the current air quality index in Ann Arbor?"	✅ Explicit disclaimer: "I don't have a dedicated air quality tool, but I can check the weather…"

Compare to the E4B family (solarhive-e4b-lora and solarhive-e4b-ollama) which both score 2/3 on the same probes (pass (b)+(d), fail (c) by auto-filling location instead of asking back). The +1/3 W2C delta between the A4B family (3/3 across LoRA + merged + NF4, all measured) and E4B family (2/3 across LoRA + merged) is the empirical signature of size-vs-refusal scaling. A4B outperforming E4B on these reasoning-heavy probes was the pre-stated hypothesis going in, not a discovery — per the official Google Gemma 4 Core docs "Parameter sizes and quantization" section: "Models with higher parameters and bit counts (higher precision) are generally more capable, but are more expensive to run." This 26B A4B variant accesses ~25B total knowledge capacity (3.8B active per token via MoE sparsity) and a ~550M vision encoder — vs E4B's 8B total / 4.5B effective / ~150M vision encoder. The When2Call paper documents the same size-vs-refusal scaling empirically. A4B is the right deployment target for under-specified or out-of-scope queries; E4B handles the well-specified-routing volume at the edge.

Quantitative reinforcement from Unsloth's published Gemma 4 benchmarks:

Benchmark	26B A4B	E4B	A4B − E4B gap
MMLU Pro	82.6%	69.4%	+13.2 pts
MMMU Pro	73.8%	52.6%	+21.2 pts
AIME 2026	88.3%	42.5%	+45.8 pts
LiveCodeBench v6	77.1%	52.0%	+25.1 pts

The 45.8 pp AIME gap (math reasoning) + 21 pp MMMU Pro gap predict the SolarHive When2Call (c)/(d) regression directly — refusal/follow-up behavior is a reasoning task, and the published reasoning-benchmark delta scales cleanly into the 2-of-3 behavioral regression we observed.

Precision Note — BF16 is Gemma 4's Native Release Format

This repository contains LoRA adapter weights only — apply them on top of Google's open-source google/gemma-4-26b-a4b-it base via Unsloth's FastVisionModel.from_pretrained(...) at inference time. Both the base model and the adapters are in BF16, which is Gemma 4's native release precision — there is no FP32 release to begin with, so applying BF16 LoRA over a BF16 base is not a quantization downgrade; it is the same numerical precision Google published.

Variant	Precision	Repository	Use case
This repo — LoRA adapters	BF16 (~2 GB adapter weights)	`solarhive-26b-a4b-lora`	Apply over base at runtime; smallest download; needs Unsloth
Pre-merged BF16 weights	BF16 (~48 GB full model)	solarhive-26b-a4b-merged	`from_pretrained(...)` directly; no PEFT/Unsloth dep
NF4 quantized	4-bit packed (~48 GB)	solarhive-26b-a4b-nf4	HF Spaces / 24 GB+ GPU deployment

All three variants are derived from the same fine-tuning run; the LoRA delta in this repo is the canonical source. The merged and NF4 variants exist for deployment convenience.

Training Details

Hyperparameters

Parameter	Value
Method	LoRA via Unsloth `FastVisionModel` (BF16, RTX PRO 6000 Blackwell 102 GB)
LoRA rank	16
LoRA alpha	16
LoRA dropout	0
Target modules	All linear layers
Learning rate	2e-4
Optimizer	AdamW 8-bit
Warmup steps	5
Epochs	3
Max sequence length	2048
Precision	BF16
Seed	3407
Trainable parameters	505.4M / 26.3B (1.92%)

Training Data — 1,727 Examples

The canonical training corpus is solarhive-community-solar-multimodal — 1,727 rows (1,713 text + 14 image-grounded). The full hand-crafted portion is preserved verbatim in solarhive_datagen.py Cell 7a (LEGACY_DATA + LEGACY_TOOL_CALL_DATA), and the API-grounded portion is reproducible at training time via _fetch_api_examples().

Three complementary sources ensure both breadth and depth:

413 hand-crafted Q&A spanning 15+ US cities and 9 energy domains:

Sky conditions and cloud impact on production
Battery management and charge/discharge strategy
Panel health diagnostics and maintenance
Consumption optimization and load shifting
Community and grid coordination strategy
Emergency resilience and outage planning
Seasonal planning and weather adaptation
Multi-step reasoning across multiple data sources
Alternative storage (fuel cells, thermal)

~1,117 API-grounded Q&A generated from live data:

Open-Meteo (GHI, DNI, DHI, low/mid/high cloud cover), PVWatts, OpenWeatherMap, EIA APIs
Joined on (location, hourly timestamp) so each multi-source example carries co-occurring grounding
Locations: Ann Arbor, MI and San Mateo, CA
Every numeric claim traces back to a real API response

183 tool-calling examples trained with the When2Call taxonomy — 106 should-call, 53 should-not-call, 10 unable-to-answer, 6 follow-up clarification, 8 failure-recovery — so the model learns when to call tools, when to refuse politely, when to admit a tool can't answer, and when to ask a clarifying question.

14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs — cloud type and percentage cloud cover are human-confirmed, expected production traces back to the cloud-cover label via the same temperature-derated GHI formula.

Training Loss

Metric	Value
Converged loss (last 20 steps)	0.6956
Final step loss	0.727
Minimum loss	0.357
Total steps	645
Training time	7,198 seconds

Hardware

GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB GDDR7 total, 94.97 GB max usable per Unsloth)
Platform: Google Colab Pro (G4 VM)
Precision: BF16 (no quantization during training)

How to Use

Loading with Unsloth (Recommended)

Standard PEFT cannot handle Gemma 4's Gemma4ClippableLinear layers. Use Unsloth's FastVisionModel for reliable adapter loading:

from unsloth import FastVisionModel
import torch

# Load base model + LoRA adapters
model, processor = FastVisionModel.from_pretrained(
    model_name="google/gemma-4-26b-a4b-it",
    adapter_name="Truthseeker87/solarhive-26b-a4b-lora",  # This repo
    dtype=torch.bfloat16,
    device_map="auto",
)
FastVisionModel.for_inference(model)

Two-Step Tokenization (Required)

Single-step apply_chat_template(tokenize=True) crashes in transformers 5.5.x on messages without a "content" key (e.g., tool_calls messages). Use this two-step pattern:

messages = [
    {"role": "system", "content": "You are SolarHive, an AI energy advisor..."},
    {"role": "user", "content": "How will today's weather affect our solar production?"},
]

# Step 1: render text (tokenize=False)
text = processor.apply_chat_template(
    messages, tools=tools,
    add_generation_prompt=True,
    enable_thinking=False,
    tokenize=False,
)

# Step 2: tokenize separately
inputs = processor(text=text, images=None, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024, temperature=1.0, top_p=0.95, top_k=64)
response = processor.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

Native Function Calling

Define tools as Python functions with Google-style docstrings. Gemma 4 autonomously decides which to invoke:

def get_weather(location: str) -> dict:
    """Get current weather conditions for a location.

    Args:
        location: City name, e.g. 'Ann Arbor, MI'

    Returns:
        dict with temp_f, clouds_pct, wind_mph, humidity, sunrise, sunset
    """
    # Your API call here
    ...

def get_solar_production(clouds_pct: int, temp_f: float) -> dict:
    """Get estimated community solar production using GHI irradiance data.

    Args:
        clouds_pct: Cloud cover percentage (0-100)
        temp_f: Temperature in Fahrenheit

    Returns:
        dict with production_kw, capacity_kw, efficiency_pct, ghi_wm2
    """
    ...

tools = [get_weather, get_solar_production, get_battery_state, get_grid_status]

text = processor.apply_chat_template(
    messages, tools=tools,
    add_generation_prompt=True,
    enable_thinking=False,
    tokenize=False,
)

The model emits tool calls as call:fn_name{arg: "value"} in its output, parsed via regex r'call:(\w+)\{([^}]*)\}'.

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

Mode	Input	Output
Sky Analysis	Sky photograph	Cloud coverage %, production forecast, storage recommendation
Panel Inspection	Panel photograph	Dirt/damage/shading detection, efficiency impact estimate
Neighborhood Assessment	Aerial/satellite image	Panel inventory, expansion priorities, shading analysis

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

Tool	API	Returns
`get_weather(location)`	OpenWeatherMap (`OWM_API_KEY`)	Temperature, clouds %, wind, humidity, sunrise/sunset
`get_solar_production(clouds_pct, temp_f)`	Open-Meteo GHI (keyless)	Production kW, efficiency %, GHI W/m², temp derating
`get_battery_state()`	Community BMS (sim)	State of charge, capacity, charging status
`get_grid_status()`	EIA Open Data (`EIA_API_KEY`)	Pricing period, rate/kWh, renewable %, CO2 intensity
`get_nrel_pvwatts_baseline()`	NREL PVWatts v8 (`NREL_API_KEY`)	Annual + current-month typical kWh + avg kW for the 72 kW array

Tool results feed back as a 2-message sequence matching the training distribution:

{"role": "assistant", "tool_calls": [{"function": {"name": ..., "arguments": ...}}, ...]}
{"role": "tool",      "name": "<fn>", "content": json.dumps(result)}  # one per tool call

This format is shared across solarhive_datagen.py (training-data generation), solarhive_finetune.py (SFT preprocessing + schema validation), solarhive_inference.py Cell 4, and test_ollama_tools.py Solution B — inference matches the training distribution exactly.

3. Selective Tool Reasoning

The model reasons about which tools are relevant — not blindly calling everything:

"What time does peak pricing start?"
→ Calls: get_grid_status() only

"Is today's production above typical for January?"
→ Calls: get_solar_production() + get_nrel_pvwatts_baseline()

"Should I run my pool heater now?"
→ Calls: get_weather() + get_solar_production() + get_battery_state() + get_grid_status()

"What are general maintenance tips?"
→ Calls: none (answers from training knowledge)

4. Inference-time When2Call Validation

Three held-out probes validate coverage of 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851. The paper documents 9–67% tool-hallucination rates in untrained community models on (c) and (d):

Category	Probe	Expected behavior
(b)	"What's the current grid rate?"	Correct tool call (`get_grid_status`) — well-specified, in-scope
(c)	"How much will a 10 kW array produce today?"	Follow-up question (asks for location) — does NOT auto-fill Ann Arbor
(d)	"What's the current air quality index in Ann Arbor?"	Polite refusal + redirect (e.g., airnow.gov) — does NOT hallucinate a tool

Models trained without explicit unable-to-answer and follow-up clarification examples typically fail (c) + (d). The SolarHive corpus includes 16 such examples (10 unable-to-answer + 6 follow-up clarification) following the When2Call taxonomy; this A4B LoRA passes all 3 probes (3/3, directly measured in the May 2026 inference run).

Live Demo Output

Five interactive queries from the April 15, 2026 demo capture with real sky photos from Ann Arbor:

Query 1 — Sky photo uploaded (VQA Mode 1)

Q: "How is current cloud cover affecting potential household solar production?"

A: "The image shows heavy, uniform overcast conditions with thick gray cloud cover obscuring the sun completely. This type of overcast sky typically reduces solar production to 10-25% of clear-sky capacity. At this moment, expect roughly 10-25 kW community output from your 72 kW array. This is a good time to conserve battery for evening peak hours and defer non-essential loads."

Query 2 — Tool calling (solar production)

Q: "What was estimated solar production at 9.00am Eastern time?"

Tools called: get_solar_production

A: "At 9:00am in Ann Arbor, MI, the community produced 40.4 kW out of 72 kW capacity (56.1% efficiency). GHI was 660.0 W/m² with 10% cloud cover and no temperature derating at 65°F. This is a solid early morning output — production will ramp up toward the noon peak."

Query 3 — Sky photo + energy planning

Q: "How should I plan my household solar production and energy storage for the next two hours?"

Tools called: none (direct visual answer)

A: "The image shows heavy, broken cumulus and stratocumulus clouds covering approximately 80-90% of the sky. Production will be highly variable — cycling between 15-30% of capacity during cloud gaps and dropping to 5-10% under the thicker gray patches. If your battery is below 50%, prioritize essential loads only."

Query 4 — Tool calling (grid status)

Q: "What is current grid capacity now?"

Tools called: get_grid_status

A: "Peak pricing at $0.28/kWh with 30.3% renewable grid mix. CO2 intensity is 381.5 kg/MWh. Peak rates typically run 2-7pm — prioritize battery discharge and minimize grid imports."

Query 5 — Tool calling (weather)

Q: "What is the weather forecast going to be like this week in Ann Arbor?"

Tools called: get_weather

A: "Current conditions in Ann Arbor: 67°F with 100% cloud cover and mist. With 100% cloud cover, expect significantly reduced solar output — run only essential loads and preserve battery for evening peak."

Data Pipeline Diagnostics

Training data quality validated with 14 diagnostic charts generated from live API data:

Solar Irradiance and Production



GHI distribution: Ann Arbor median 265 W/m² vs San Mateo 364 W/m² — Michigan receives ~27% less solar irradiance	Hourly production curve: Peak at 1-2pm. Ann Arbor peaks higher but with wider variance

Month x hour heatmaps: Ann Arbor peaks June-July at 45+ kW midday. San Mateo has broader, flatter production season	Temperature derating: Flat at 1.0 below 77°F, linear decline at 0.4%/°F above. Validates the derating formula

Environmental Correlations



Feature correlations: GHI to production r=0.97 (near-perfect). Humidity to GHI r=-0.57	Cloud cover by season: Ann Arbor consistently cloudier than San Mateo across all seasons

Seasonal production: Summer median ~33 kW (Ann Arbor) vs ~26 kW (San Mateo). Winter drops to ~12 kW	GHI vs production scatter: Clear-sky (tight linear) vs cloudy (scattered) — demonstrates direct vs diffuse radiation physics

Cross-Validation and Grid Analysis



Open-Meteo vs PVWatts: Strong seasonal agreement validates GHI formula against NREL industry standard	OWM snapshot: Temperature, clouds, wind, humidity at data generation time

Fuel mix: MISO (33.5% gas, 23.4% wind, 18.8% coal) vs CAISO (35.8% solar, 20.6% wind)	Renewable % and CO2: CISO hits 100% renewable at midday solar peaks; MISO ranges 20-50%

Atmospheric Decomposition



Irradiance decomposition: Total GHI separated into direct-beam (DNI) and diffuse (DHI) on a clear summer day. Confirms training on physically-decomposed solar radiation	Vertical cloud-cover composition by month: Low (<3 km) / mid (3-8 km) / high (>8 km) stratification — exposes the model to seasonal shifts in cloud-layer mix

Community Model

Parameter	Value
Location	Ann Arbor, Michigan (42.2808°N, 83.7430°W)
Community size	12 homes
Total panel capacity	72 kW
Shared battery storage	100 kWh
Grid region	MISO (Midcontinent Independent System Operator)

Companion Repositories

Model	Repository	Purpose
SolarHive 26B A4B LoRA	This repo	Cloud inference — LoRA adapters via Unsloth, full multimodal + function calling
SolarHive 26B A4B Merged	solarhive-26b-a4b-merged	Full BF16 merged weights (~48 GB) — LoRA pre-applied to base, no PEFT/Unsloth needed at inference
SolarHive 26B A4B NF4	solarhive-26b-a4b-nf4	Pre-quantized 4-bit version of the BF16 merged model — for HF Spaces and 24 GB+ GPUs
SolarHive E4B LoRA	solarhive-e4b-lora	E4B adapter weights (~200 MB) — apply over base via Unsloth
SolarHive E4B safetensors	solarhive-e4b-ollama	Edge model — merged safetensors source for transformers research and GGUF conversion via llama.cpp
SolarHive E4B GGUF	solarhive-e4b-gguf	Edge deployment — Q4_K_M GGUF + mmproj for Ollama / llama.cpp on 16 GB CPU laptop (10/10 benchmark)
SolarHive Dataset	solarhive-community-solar-multimodal	1,727 training examples (1,713 text + 14 image-grounded)
LiteRT-LM Python edge runtime	`solarhive_e4b_litert_v3.1.ipynb`	LiteRT Special Tech Track entry — runs upstream base `litert-community/gemma-4-E4B-it-litert-lm` `.litertlm` (3.66 GB) + SolarHive UX layer + on-device agentic loop with native Gemma 4 function calling. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream `gemma4` example module lands in `ai_edge_torch.generative.examples/`.
GitHub	the-gemma4-good-hackathon-solarhive	Full source code, training and quantization notebooks, `test_ollama_tools.py`, data principles

Versions — v2 Update (Text-Only Training on the Multimodal-Capable Corpus)

The repository was refreshed on April 30, 2026 with a v2 LoRA produced by re-training on the consolidated training corpus (1,727 rows = 1,713 text + 14 image-grounded). The v2 fine-tune trains on the text subset only; image rows are skipped at the data-prep layer. Multimodal training is deferred post-hackathon — a real image corpus and a held-out VQA benchmark would be prerequisites. The base Gemma 4 26B A4B model retains full multimodal capability regardless of which corpus subset is used in any given fine-tune run, so VQA at inference time continues to work on the saved adapters.

The shipped dataset uses the project archive only — fewer images than originally planned, but every label is human-confirmed and every paired Q&A traces back to the same GHI / temperature derating formula used elsewhere. Image-source planning had earlier rejected the SWIM corpora (NUS — CC BY-NC licensing) and NREL SRRL (legacy MIDC SkyCam archive ended May 2017).

The v2 fine-tune is pre-aligned with the official Unsloth Gemma 4 documentation (train guide, bug fixes & tips): explicit loader arguments (max_seq_length, dtype, full_finetuning=False), explicit SFTConfig arguments (weight_decay, lr_scheduler_type), text-only data path (finetune_vision_layers=False, dataset_text_field="text", TRL default text collator, train_on_responses_only wrapper for assistant-only loss masking).

Technical Notes

System prompt repetition: The system prompt is repeated twice in the message format. This technique improves instruction following in causal LLMs, winning 47/70 benchmark-model tests with zero losses (Leviathan et al., 2024, Google Research).
PEFT incompatibility: Standard PEFT cannot handle Gemma 4's Gemma4ClippableLinear layers. Use Unsloth's FastVisionModel for adapter loading.
VRAM requirements: ~48 GB in BF16, ~16 GB in NF4 (4-bit). T4 x2 cannot run this model.
Sampling: temperature=1.0, top_p=0.95, top_k=64 (Kaggle-recommended defaults).

Limitations

Prototype tested on a single community model (12 homes, Ann Arbor). Real-world deployment requires validation across diverse geographies and community sizes.
The model occasionally uses "60 kW" instead of the correct 72 kW community capacity in direct VQA responses (without tool calls). This is a base model tendency that additional fine-tuning examples will address.
Tool responses depend on external API availability. Open-Meteo and EIA have rate limits. OpenWeatherMap free tier allows 1,000 calls/day.
The battery state simulator is deterministic for demonstrations. Real deployment requires integration with actual battery management systems.

Future Iteration — Multi-Token Prediction (MTP) Drafters

Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's final benchmark was captured. The benchmarks above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.

Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.

What Google released on May 5, 2026. Paired drafter checkpoints for all four IT-tuned Gemma 4 variants — gemma-4-E2B-it-assistant, gemma-4-E4B-it-assistant, gemma-4-26B-A4B-it-assistant, gemma-4-31B-it-assistant — discoverable via the google/gemma-4 Hugging Face collection and on Kaggle Models. The drafters share the input embedding table with their paired target and consume the target's last-layer activations (architecture per the MTP overview). For this target the paired drafter is google/gemma-4-26B-A4B-it-assistant (0.4 B params). Google reports up to 3× decode speedup with no quality degradation on the 26B-A4B configuration, and **2.2×** on Apple Silicon at batch sizes 4–8. Tested runtimes named in the blog: LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama.

Integration cost is one kwarg in Hugging Face Transformers:

target_base = AutoModelForCausalLM.from_pretrained("google/gemma-4-26B-A4B-it", dtype=torch.bfloat16, ...)
target = FastVisionModel.from_pretrained("Truthseeker87/solarhive-26b-a4b-lora", ...)  # apply LoRA on top
assistant = AutoModelForCausalLM.from_pretrained("google/gemma-4-26B-A4B-it-assistant", dtype=torch.bfloat16, ...)
target.generate(**inputs, assistant_model=assistant)  # MTP enabled

The integration ships as a gated future-iteration cell (§14, _RUN_MTP_DEMO = False) in solarhive_inference.py; reviewers can flip the flag to reproduce a baseline-vs-MTP comparison under argmax decoding.

Open question specific to this LoRA-adapter target. Per the 2023 speculative-sampling guarantee, correctness is invariant to drafter quality — the target's verification step preserves the exact output distribution regardless of what the drafter proposes. What varies is acceptance rate α, since Google's released drafter was trained against the base gemma-4-26B-A4B-it, not against this LoRA-adapter-on-top target. Measured α and the resulting walltime speedup on this target are the planned post-hackathon contribution.

Citation

@misc{solarhive2026,
  title={SolarHive: AI-Powered Community Solar Energy Intelligence},
  author={Youshen Lim},
  year={2026},
  url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
  note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}

Dataset used to train Truthseeker87/solarhive-26b-a4b-lora

Space using Truthseeker87/solarhive-26b-a4b-lora 1

Papers for Truthseeker87/solarhive-26b-a4b-lora

Evaluation results

Accuracy
self-reported

1.000
Accuracy
self-reported

1.000