SolarHive E4B INT4 (Cactus) — Multimodal Mobile Artifact

LoRA fine-tuned Gemma 4 E4B (8B), converted via Cactus Compute cactus convert --precision INT4 for on-device mobile deployment. Text-decoder weights INT4-quantized; vision encoder and audio Conformer towers retained FP16 alongside the INT4 text. Total artifact 6.94 GB. Loadable via the Cactus Flutter SDK on Apple Silicon Mac, iPhone, iPad, Vision Pro, Android, ARM64 Linux (Raspberry Pi 5), and Android emulators on developer machines — with hardware acceleration through integrated mobile NPUs (Apple Neural Engine, Qualcomm Hexagon, MediaTek / Exynos APU).

For the cloud deployment, use the solarhive-26b-a4b-merged repo (BF16 26B A4B-IT, ~48 GB, A100-class GPU). For Ollama / llama.cpp on a 16 GB laptop CPU, use solarhive-e4b-gguf (Q4_K_M, ~5 GB). For LiteRT-LM Python on Pi 5 / Jetson / Android Kotlin / iOS C++, use the upstream litert-community/gemma-4-E4B-it-litert-lm base bundle plus the SolarHive engineering layer in solarhive_e4b_litert_v3.1.ipynb. This Cactus repo is the only fine-tuned artifact in the deployment family that ships to mobile NPUs end-to-end.

This repository serves three roles:

Deployable mobile artifact — drop-in input for the Cactus Flutter SDK on Android (production-validated) and iOS / Apple Silicon (forward-looking; empirical validation pending physical-device testing). Loaded by the companion Flutter app at mobile-cactus/ on first launch.
Cross-runtime fine-tune anchor — the Cactus tier is the SolarHive deployment family's fine-tuned mobile path; sits alongside the cloud (26B A4B), microgrid hub (E4B GGUF via Ollama), and browser (LiteRT .task via WebGPU) tiers in the broader project architecture.
Reference for re-conversion — cactus convert ... --precision INT4 is deterministic. Re-run against the solarhive-e4b-ollama source if Cactus's converter ships an updated quant strategy.

Built for the Gemma 4 Good Hackathon (Google DeepMind × Kaggle).


Base Model	google/gemma-4-e4b-it
Architecture	Dense + PLE — 8B total, 4.5B effective
Fine-Tuning	LoRA via Unsloth `FastVisionModel` (BF16); merged via Unsloth `save_pretrained_merged`
Conversion	Cactus `cactus convert <model> --precision INT4`
Source Artifact	solarhive-e4b-ollama (BF16 merged safetensors, ~16 GB)
Tensor Mix	343 INT4 + 2 INT8 + 1,732 FP16 (text decoder INT4-quantized; vision encoder + audio Conformer retained FP16)
Total Artifact	6.94 GB across 2,088 files (~2,077 `.weights` tensor files + `vocab.txt` + ~10 config/metadata files)
Quant Fidelity	CosSim 0.9946 mean / SNR 19.8 dB mean / MSE 5.18e-04 mean (per Cactus's converter — deterministic across runs)
Deployment Target	Apple Silicon (Mac, iPhone, iPad, Vision Pro), Android (ARM64), Raspberry Pi 5, ARM64 Linux
Primary SDK	Cactus Flutter SDK (Android, iOS, macOS); also Python / Swift / Kotlin / C++ on ARM hosts
License	Apache-2.0 (matches the SolarHive fine-tune license chain)

Quantization Note

Cactus's cactus convert is a multimodal-aware quantizer with explicit Gemma 4 support: the converter log emits "Normalized gemma4 audio tower key naming for conversion" during the run, and the resulting tensor distribution selectively quantizes the text decoder to INT4 while retaining the vision encoder and audio Conformer at FP16. This is by design — INT4 quantization on the multimodal towers measurably degrades image / audio understanding, while text-decoder INT4 preserves coherent generation at a fraction of the memory cost.

The artifact size of 6.94 GB is therefore larger than the Cactus Gemma 4 deployment blog's "~4 GB" reference figure — that figure is for an INT4 text-only Gemma 4 E4B, while this artifact is the multimodal variant with FP16 vision + audio towers preserved.

Per-tensor fidelity numbers (verbatim from Cactus's converter — deterministic across runs):

Metric	Mean	Max	Median
Cosine Similarity (1.0 = perfect)	0.9946	1.0010	—
Signal-to-Noise Ratio (dB)	19.8	45.2	—
Mean Squared Error	5.18e-04	2.17e-03	8.84e-06

A CosSim of 0.9946 on the INT4-quantized text decoder is well above the typical 0.95+ threshold that is associated with no perceptible quality regression on downstream Q&A and instruction-following tasks. The forensic convert_log.txt in this repo captures the per-tensor breakdown for reproducibility audit without requiring a re-run of the convert step.

Vision-encoder transparency note

Per Google's Gemma 4 model card, Gemma 4 E4B ships a ~150M-parameter vision encoder that handles image understanding as a native pretrained capability. The SolarHive LoRA fine-tune targets only the language-model linear layers (target=all-linear), and the vision tower is not modified by our adapters — this matches the Vertex AI Gemma 4 SFT recipe which explicitly freezes both vision and audio towers during text-focused fine-tuning. The cactus convert step preserves the vision encoder's pretrained FP16 weights end-to-end (visible in the tensor mix above as part of the 1,732 FP16 tensors). VQA at inference time on this Cactus artifact uses Gemma 4's pretrained vision encoder unmodified, and the visual decoder behavior is empirically equivalent to the upstream base model on image inputs.

The same transparency principle applies to the audio Conformer — preserved at FP16, unmodified by the LoRA, behavior equivalent to the upstream base on audio inputs.

Companion Repositories

Model / Asset	Repository	Purpose
SolarHive 26B A4B LoRA	solarhive-26b-a4b-lora	Cloud inference with full multimodal + function calling (LoRA adapters for Unsloth `FastVisionModel`)
SolarHive 26B A4B Merged	solarhive-26b-a4b-merged	Full BF16 cloud model (~48 GB) — production inference, no PEFT/Unsloth dep
SolarHive 26B A4B NF4	solarhive-26b-a4b-nf4	Pre-quantized 4-bit cloud model for HF Spaces / 24 GB+ GPUs
SolarHive E4B LoRA	solarhive-e4b-lora	E4B adapter weights (~200 MB) — apply over base via Unsloth
SolarHive E4B Merged	solarhive-e4b-ollama	BF16 merged safetensors (~16 GB) — the source artifact this Cactus repo was converted from + GGUF conversion source for Ollama / llama.cpp
SolarHive E4B GGUF	solarhive-e4b-gguf	Edge laptop deployment — Q4_K_M GGUF + mmproj for Ollama / llama.cpp on 16 GB CPU laptop. 10/10 benchmark.
SolarHive Cactus mobile artifact	This repo	On-device mobile deployment — Gemma 4 E4B INT4 (Cactus) multimodal artifact for Apple Silicon / iPhone / iPad / Vision Pro / Android / Pi 5 via the Cactus Flutter SDK
SolarHive Dataset	solarhive-community-solar-multimodal	1,727 training examples (1,713 text + 14 image-grounded)
LiteRT-LM Python edge runtime	`solarhive_e4b_litert_v3.1.ipynb`	LiteRT Special Tech Track entry — runs upstream base `litert-community/gemma-4-E4B-it-litert-lm` `.litertlm` (3.66 GB) + SolarHive UX layer + on-device agentic loop. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream `gemma4` example module lands in `ai_edge_torch.generative.examples/`.
Companion Flutter app	`mobile-cactus/`	Flutter Android app that loads this artifact on first launch via the Cactus Flutter SDK and runs fully on-device inference, with a multi-turn chat UI
GitHub	the-gemma4-good-hackathon-solarhive	Full source code: datagen + dual fine-tune + inference + Cactus convert + LiteRT runtime demo

How to use

From the companion Flutter app (recommended)

The simplest path is to clone the GitHub repo and run the companion Flutter app, which downloads this artifact at first launch and handles the device-side wiring:

git clone https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive
cd the-gemma4-good-hackathon-solarhive/mobile-cactus
flutter pub get
flutter run     # connect an Android device or boot an emulator first

On first launch the app downloads this 6.94 GB artifact from Truthseeker87/solarhive-e4b-cactus (one-time, cached to app-local storage), then runs fully locally with no further network round-trips for inference.

From the Cactus Python SDK (ARM hosts only)

The Cactus C++ engine targets ARM by design — see the Cactus Gemma 4 deployment blog for the verbatim scope statement. On Apple Silicon Mac, Raspberry Pi 5, ARM cloud VM, or Android emulator (ARM-via-QEMU on a developer machine):

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
cactus build --python    # compiles libcactus.so for the host's ARM SoC

from huggingface_hub import snapshot_download
artifact_dir = snapshot_download(repo_id="Truthseeker87/solarhive-e4b-cactus")

from src.cactus import CactusEngine
engine = CactusEngine(model_path=artifact_dir)
response = engine.generate(
    prompt="How much should the community generate today?",
    system_prompt="You are SolarHive, an AI energy advisor for a 12-home community...",
)
print(response)

From `cactus run` against a local snapshot (ARM hosts)

cactus run's slug-based auto-download path resolves only against the curated Cactus-Compute/ HF organisation. SolarHive's fine-tune is a third-party repo, so download the artifact directory yourself first via huggingface_hub, then pass its local path to cactus run:

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='Truthseeker87/solarhive-e4b-cactus'))"
# → prints the local cache path, e.g. /Users/you/.cache/huggingface/hub/models--Truthseeker87--solarhive-e4b-cactus/snapshots/<sha>/
cactus run /path/printed/above

The same local-path pattern works with the Python SDK (CactusEngine(model_path=...) example above) and with the FFI cactusInit(modelPath, ...) call documented at docs.cactuscompute.com.

Performance characteristics

The Cactus deployment surface targets ARM mobile NPUs for hardware-accelerated inference:

Apple Neural Engine (M-series Macs, A-series iPhones / iPads) — Cactus's first-class accelerator path
Qualcomm Hexagon (Snapdragon Android phones — Pixel, Galaxy S/A series)
MediaTek / Exynos APU (mid-range Android phones — Galaxy A, Pixel A)
Raspberry Pi 5 16GB — ARM CPU only (no NPU); usable for the SolarHive microgrid hub deployment pattern

Cactus publishes per-device benchmarks on its docs site (page header: "Benchmarks (CPU-only, no GPU)"); the runtime has zero CUDA / NVIDIA / ROCm / OpenCL acceleration paths by design — mobile-NPU acceleration handles inference on the deployment surface.

The conversion step was validated on a Colab Pro CPU + High-RAM runtime in 4–7 minutes (deterministic across runs). x86 development hosts (Colab, AWS x86, GCP x86) can run the convert step but cannot run Cactus inference (the Python SDK requires libcactus.so, which targets ARM hosts) — see the Cactus Gemma 4 deployment blog for the verbatim scope statement.

On-device validation — release-flavor APK sideload test on Android device

Validation device + connectivity

The companion mobile-cactus/ Flutter Android app loads this artifact from on-device cache and runs end-to-end inference. Validation was performed via a release-flavor APK sideloaded to an Android mobile device (specific model/serial intentionally omitted for privacy) with the following hardware profile:

SoC: Qualcomm Snapdragon 865 (octa-core ARM64; the same generation that ships in millions of mid-to-flagship Android phones from 2020 onward)
RAM: 12 GB LPDDR5 (MemTotal: 11.24 GiB reported by /proc/meminfo on the device)
OS: Android 14
ABI: arm64-v8a (the only Android ABI the Cactus runtime targets)

The signed release APK is downloadable from the v1.0-mobile-cactus GitHub Release. The validation cycle exercised both USB-tethered adb debugging (cable-attached for stable build/install/launch loops, no wireless drops) and wireless adb (TLS) debugging (over local Wi-Fi, mDNS-paired for hands-off device-untethered validation). The size-verified resumable downloader in artifact_downloader.dart was specifically designed to survive wireless-adb drops mid-download — each file's local size is verified against HF's reported size from the tree API; mismatches are deleted and re-fetched on the next launch automatically.

Sideloaded-APK chat round-trip design

The sideloaded APK opens directly to a multi-turn chat scaffold where each user message triggers the full on-device inference stack end-to-end:

User sends message
  → ChatScreen._send() (instrumented with CycleTimer)
  → CactusEngine.generate(messages, systemPrompt)
  → ensureLoaded()                                 ← cactus_init(modelPath, contextSize=1024, null) [first call only]
  → cactus_complete(handle, messagesJson, optionsJson, null, null)
  → JSON-envelope unwrap → _parseEnvelope (Cactus canonical + llama.cpp fallbacks)
  → _stripLatex(text)
  → render in chat UI (post-frame callback closes the CycleTimer)

The canonical chat round-trip uses the same warmup probe as the project's cloud inference pipeline — "What is solar GHI?" — so cross-runtime comparisons stay apples-to-apples. Each round-trip is a complete integration test of FFI loader, file mmap, INT4 forward pass, JSON envelope parse, and rendering. The full diagnostic instrumentation lives at mobile-cactus/lib/services/ — diagnostics.dart (artifact-directory audit + /proc/self/status memory sampler + the CycleTimer tap-to-render instrumentation) and cactus_engine.dart (FFI wrapper + multi-schema GenerationStats envelope parser that reads Cactus canonical keys with llama.cpp fallbacks). Together they write a structured trail to ${appDocs}/solarhive_diag.log for every chat round-trip so post-mortem analysis is straightforward via adb shell run-as <package> cat <log>.

Cross-runtime system-prompt parity

The on-device system prompt is a deliberately-narrowed variant of the cloud inference pipeline's SYSTEM_PROMPT constant — same SolarHive identity, same community facts (12 homes, Ann Arbor, rooftop solar + shared battery), same response-length guidance (3–5 sentences). Two intentional divergences keep on-device behaviour predictable:

The cloud prompt instructs the model to call available tools for real-time data; the on-device tier has no tools wired (by design — real-time data integration sits in the cloud tier rather than on phone hardware), so that sentence is dropped.
"Reference actual data" → "reference reasonable assumptions" (no live API access on-device → the model shouldn't hallucinate live numbers).

Same SolarHive fine-tune family (Gemma 4 26B A4B in the cloud, Gemma 4 E4B INT4 multimodal on-device) + same Kaggle-recommended Gemma 4 sampling defaults (temperature=1.0, top_p=0.95, top_k=64) keeps the behaviour comparable across runtimes — the published cloud benchmarks remain meaningful as a reference for the on-device tier.

Prompt-repetition technique

The on-device tier mirrors the cloud's "Repeat to Improve" pattern: the system body is concatenated to itself, separated by a blank line, and sent as a single system-role message. This lets every token in the prompt attend to every other prompt token, demonstrably improving instruction-following on benchmark tests with no measurable latency hit on the prompt side. Reference:

Leviathan, Y., Kalman, M., & Matias, Y. (2024). Repeat to Improve Non-Reasoning LLMs. Google Research. arXiv:2512.14982. Reported result: doubling the system prompt won 47 of 70 benchmark-model tests with zero losses and no latency increase.

This is the same prompt-repetition pattern the project's cloud solarhive_inference.py agentic loop uses — keeping cloud and on-device prompt regimes aligned was an explicit design constraint so cross-runtime comparisons remain apples-to-apples.

Empirical baselines from the canonical chat round-trip

The headline empirical baseline is the tap-to-render cycle — the wall-clock interval between a user tap on Send and the assistant bubble being scheduled for GPU composite. Measured on the canonical chat round-trip (prompt "What is solar GHI?") at production config contextSize=1024 + maxNewTokens=512:

Metric	Value
On-device cache size	6.47 GiB across 2,088 files (one-time download from this repo)
Tap-to-render cycle (user tap on Send → assistant bubble rendered on screen)	53.73 sec
Share of cycle inside the `generate` await (FFI call boundary)	99.88% (`prep` 1 ms / `render` 62 ms — UI / Dart / rendering overhead structurally negligible)
Pure `cactus_complete` time (subset of `generate` after the one-time `cactus_init` prefix)	~~46.15 sec (~~85.9% of cycle)
One-time `cactus_init` cold-load (file mmap + header parse + KV cache pre-allocation, on first chat round-trip)	~7.36 sec
Peak HWM during the round-trip	3.55 GiB
RSS trajectory pre-/post-generate	2.40 → 3.19 GiB
VmSize during inference	17.86 → 18.71 GiB (mmap'd artifact; only hot pages stay resident)
sysFree throughout (system-wide `MemAvailable` from `/proc/meminfo`, lowest observed)	always ≥3.5 GiB
Crash signals (`SIGABRT` / `SIGKILL` / lmkd)	zero
Sample output	"Global Horizontal Irradiance (GHI) is the total solar radiation received on a horizontal surface, averaged over the day. It's a useful metric for long-range forecasting because it doesn't depend on your specific roof tilt or orientation. In Ann Arbor, MI, you might forecast 500-700 W/m2 GHI for a typical summer day. This translates to roughly 35-45 degrees from the horizon to your optimally tilted panels … For daily planning, use GHI forecasts to set production targets for the 24-hour period, rather than just hourly predictions." (Five clean sentences, plain prose, no LaTeX, Ann-Arbor-grounded, domain-correct on GHI vs panel-tilt distinction, no truncation. See the final-run on-device response screenshot for the verbatim rendering on the validation Android device.)

Cactus envelope schema (empirically discovered). The cactus_complete call returns a 15-key JSON envelope distinct from the llama.cpp-canonical schema: [cloud_handoff, confidence, decode_tokens, decode_tps, error, function_calls, prefill_tokens, prefill_tps, ram_usage_mb, response, segments, success, time_to_first_token_ms, total_time_ms, total_tokens]. The production parser in cactus_engine.dart::_parseEnvelope reads Cactus canonical keys (decode_tokens / prefill_tokens / decode_tps / prefill_tps / time_to_first_token_ms / total_time_ms) as primary and falls back to llama.cpp-canonical names (tokens_predicted / tokens_evaluated / nested timings.*) for cross-runtime tolerance.

Structural finding from the cycle decomposition. ~99.88% of any chat round-trip lives inside the Cactus C engine (cactus_init cold-load prefix + cactus_complete forward pass). UI / Dart / rendering overhead is ~0.1% of the cycle and structurally not a viable optimisation surface. Throughput improvement at the on-device tier therefore comes from either tuning contextSize / maxNewTokens or upgrading the Cactus runtime build.

contextSize-to-throughput characterisation (Snapdragon 865, this artifact, measured across earlier exploratory runs): each doubling of contextSize roughly halves throughput. The Cactus runtime pre-allocates KV cache buffers and iterates per-step attention over the full contextSize regardless of how many tokens are actually in use, so a context budget close to the active working set (≈540 tokens for typical chat-length prompts) maximises tokens/sec. contextSize=1024 is the sweet spot — comfortable two-times headroom for the active working set while halving per-step attention work compared to contextSize=2048.

contextSize-to-throughput characterisation (Snapdragon 865, this artifact): each doubling of contextSize roughly halves throughput. The Cactus runtime pre-allocates KV cache buffers and iterates per-step attention over the full contextSize regardless of how many tokens are actually in use, so a context budget close to the active working set (≈540 tokens for typical chat-length prompts) maximises tokens/sec. contextSize=1024 is the sweet spot — comfortable two-times headroom for the active working set while halving per-step attention work compared to contextSize=2048.

Output rendering. A small Dart-side post-processor strips any LaTeX or markdown-math wrappers (\command{...} forms; $...$ , $$...$$, $...$, \[...\] delimiters) before the response reaches the chat UI. Defensive — the system prompt also instructs plain-prose output — but ensures rendering stability if the model ever regresses to LaTeX-flavored unit annotations.

Library versioning note. The deployed libcactus.so binary on the companion app is built locally from cactus-compute/cactus at commit d917981f via the Android NDK cross-compile toolchain. The pub.dev cactus 1.3.0 Flutter package is several months older than the converter that produced this artifact and has both a different cactus_init ABI and an incompatible weight-file header layout. Pinning the convert tool, the libcactus.so binary, and the Dart cactus.dart bindings to a single Cactus commit is the way to avoid the version drift. Build recipe: git clone --recurse-submodules https://github.com/cactus-compute/cactus.git && cd cactus && bash android/build.sh with ANDROID_NDK_HOME set, the Android SDK's cmake on PATH, and CMAKE_GENERATOR=Ninja.

Future Iteration — Multi-Token Prediction (MTP) Drafters

Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's on-device validation was captured. The benchmarks above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.

Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies. Gemma 4 MTP drafters additionally share the input embedding table with the target and consume the target's last-layer activations per the MTP overview.

Released drafter for E4B. google/gemma-4-E4B-it-assistant (~78.8 M params) is the canonical pair for google/gemma-4-E4B-it. Tested runtimes named in the Google blog: LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama. Google reports up to 3× decode speedup on the 26B-A4B configuration; per-variant E4B numbers were not enumerated in the announcement.

Runtime support on Cactus is not yet shipped. The blog's tested-runtime list is LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama — Cactus is not in that list. The Cactus runtime's cactus_complete FFI takes a single artifact path plus JSON-encoded options and returns the response envelope documented above; no drafter-pairing parameter is exposed in the current ABI.

Implementation paths (post-hackathon):

Upstream Cactus contribution. Add a drafter_path parameter (or analogous) to cactus_init / cactus_complete plus the speculative-sampling verify-and-resample loop inside libcactus. Mirrors how llama.cpp gates speculative decoding via --draft-model. Effort: research-grade contribution to the Cactus runtime; drafter weights would additionally need INT4 conversion through the Cactus converter pipeline.
Inherit from llama.cpp. Cactus shares a forked-llama.cpp-derived inference core. If llama.cpp ships first-class Gemma 4 MTP drafter pairing through its existing --draft-model infrastructure, Cactus would naturally inherit by tracking the upstream patch.
Hugging Face Transformers proxy at the cloud tier. Use Hugging Face Transformers assistant_model= against the cloud A4B target (the gated future-iteration cell in solarhive_inference.py §14 loads Truthseeker87/solarhive-26b-a4b-merged for exactly this purpose) and let the on-device Cactus tier handle non-MTP inference. Splits the speedup claim by tier — cloud cycle gets MTP, on-device cycle keeps the standard autoregressive path.

Honest framing of the speedup value at this tier. MTP's headline 3× target — measured by Google on the 26B-A4B cloud configuration, not on E4B at the Cactus edge — would not necessarily translate directly to a Snapdragon-class device. The real E4B-on-Cactus α is unknown and would require the Cactus runtime extension above. Implementation cost is research-grade; concrete user-visible win on a mobile device is high; tracked as a planned post-hackathon contribution.

Citation

Per the Cactus repository's recommended attribution:

@software{cactus,
  title  = {Cactus: AI Inference Engine for Phones & Wearables},
  author = {Ndubuaku, Henry and Cactus Team},
  url    = {https://github.com/cactus-compute/cactus},
  year   = {2025}
}

If this artifact is used in research or production, please also cite the SolarHive project:

@software{solarhive,
  title  = {SolarHive: AI-Powered Community Solar Energy Intelligence},
  author = {Lim, Youshen},
  url    = {https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
  year   = {2026},
  note   = {Gemma 4 Good Hackathon — Google DeepMind \& Kaggle}
}

Built with Gemma 4 in Ann Arbor, Michigan.

Gemma is a trademark of Google LLC.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Truthseeker87/solarhive-e4b-cactus

Space using Truthseeker87/solarhive-e4b-cactus 1

Papers for Truthseeker87/solarhive-e4b-cactus

Prompt Repetition Improves Non-Reasoning LLMs

Paper • 2512.14982 • Published Dec 17, 2025 • 2

Fast Inference from Transformers via Speculative Decoding

Paper • 2211.17192 • Published Nov 30, 2022 • 11

Evaluation results

CosineSimilarity-Mean
self-reported

0.995
SNR-Mean-dB
self-reported

19.800