Access EmbeddingGemma-300M (Tensor G4 NPU) on Hugging Face

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This is a quantized, repackaged derivative of google/embeddinggemma-300m and is governed by the Gemma license. To access it you must review and agree to Google's Gemma Terms of Use. Please ensure you are logged in to Hugging Face and click below.

EmbeddingGemma-300M — Tensor G4 NPU + Termux RAG/MCP stack

A quantized, repackaged variant of google/embeddinggemma-300m that runs its embeddings on the Google Tensor G4 Edge TPU ("rio") NPU (Pixel 9 series), plus a complete on-device RAG stack that runs entirely in Termux on the phone: a native bionic engine runner, a numpy vector index, and an MCP server that exposes semantic search to any MCP client.

To our knowledge this is the first Tensor-G4 NPU build of EmbeddingGemma, and a working example of a fully on-device, NPU-accelerated RAG — query embedding runs on the Edge TPU, retrieval is local, nothing leaves the device.

Verified on-device (2026-06-10): real NPU dispatch (135M tpu_active_cycle delta per embed), cosine 0.9946 vs the CPU reference — i.e. NPU output matches CPU to 3 decimals.

What's in here

Path	What it is
`EmbeddingGemma-300M_seq512_Google_Tensor_G4.litertlm`	NPU model bundle, seq 512 (embeds tokenizer + tflite + metadata)
`EmbeddingGemma-300M_seq256_Google_Tensor_G4.litertlm`	NPU model bundle, seq 256 (smaller / faster)
`EmbeddingGemma-300M_seq512_Google_Tensor_G4.tflite`	Raw TFLite (what the C engine runner consumes directly)
`engine/embed_npu.c`	Engine runner — native bionic LiteRT-C runner, single sequence → 768-d vector
`engine/embed_npu_batch.c`	Batch engine runner — model loaded once, streams N rows (for indexing)
`engine/embed_hetero.c`	Heterogeneous (NPU + CPU split) variant
`engine/build.sh`	Termux-clang build line for the runners
`rag/npu_embedder.py`	Python backend: shells to the runner via `rish`, returns the 768-d vector
`rag/index_corpus.py` / `reindex_npu.py`	Build the vector index over a markdown corpus → `index.npz`
`rag/query.py` / `query_npu.py`	CLI semantic search against the index
`rag/tokenize_*.py`, `rag/litertlm_loader.py`, `rag/unbundle_litertlm.py`	Tokenization + bundle plumbing
`rag/mcp_server_npu.py`	MCP server (stdio JSON-RPC), tool `search_context(query, k)`
`rag/mcp-litertlm-rag-npu`	MCP launcher script

Spec: EmbeddingGemma-300M, 768-d output, L2-normalized, seq 256 / 512, mixed-precision, AOT-compiled for the Tensor G4 Edge TPU. The index.npz vector index is not shipped (it's a private corpus) — build your own with rag/index_corpus.py.

Architecture (fully on-device, Termux-native)

  MCP client (Claude Code / any agent)
        │  stdio JSON-RPC: search_context(query, k)
        ▼
  rag/mcp_server_npu.py            (Termux python3 — no pip 'mcp'/pydantic needed)
        │  tokenize query  ──► rag/tokenize_query.py
        │  embed query     ──► rag/npu_embedder.py
        │                          │  subprocess via rish (root)
        │                          ▼
        │                     engine/embed_npu  (native bionic ELF)
        │                          │  LiteRT C API → libLiteRt.so + GoogleTensor dispatch
        │                          ▼
        │                     ░░ Tensor G4 Edge TPU ░░  → 768-d vector
        │  cosine(query, index)  (numpy, pure-Python)
        ▼
  ranked chunks: score + source + text

The embedder runs on the NPU; tokenization and cosine search stay in pure-Python Termux. The glibc ai_edge_litert / sentencepiece path is only needed to unbundle / re-export the model — query-time embedding uses the native runner and needs neither.

Build the engine runner (Termux)

cd engine
# LIBDIR holds libLiteRt.so + the dispatch lib (extract from the Edge Gallery APK lib/arm64-v8a/;
# Google proprietary, not shipped here). The LiteRT C API is inlined in the .c (verified vs v2.1.1 headers).
LIBDIR=$HOME/litert_libs
clang -O2 -std=c11 embed_npu.c -o embed_npu \
    -L"$LIBDIR" -lLiteRt -Wl,--allow-shlib-undefined -ldl -lm
clang -O2 -std=c11 embed_npu_batch.c -o embed_npu_batch \
    -L"$LIBDIR" -lLiteRt -Wl,--allow-shlib-undefined -ldl -lm

Run (via rish/root, with edgetpu_vendor_service up):

LD_LIBRARY_PATH="$LIBDIR" ./embed_npu  EmbeddingGemma-300M_seq512_Google_Tensor_G4.tflite  "$LIBDIR"  <tok0> <tok1> ...
# writes ./embed_npu_out.f32 = 768 little-endian float32 (L2 norm ~1.0 = correctness signal)

Run the RAG / MCP server

# 1) Build an index over your markdown corpus (chunk = 200 tokens / 30 overlap)
python3 rag/index_corpus.py                      # -> index.npz

# 2) CLI query
python3 rag/query.py "your question" --k 5

# 3) MCP server (stdio) — what an agent connects to
bash rag/mcp-litertlm-rag-npu

{
  "mcpServers": {
    "litertlm-rag-npu": {
      "command": "/path/to/rag/mcp-litertlm-rag-npu",
      "args": []
    }
  }
}

Tool contract: search_context(query: str, k: int = 5) -> str — ranked chunks (score + source + text), query embedded on the Tensor G4 NPU.

Paths: the scripts use Termux fleet paths (/sdcard/agents/EmbeddingGemma, $HOME/npu_build, rish). Adjust the constants at the top of npu_embedder.py / mcp_server_npu.py for your layout.

Requirements

Pixel 9 / 9 Pro / 9 Pro XL / 9 Pro Fold (Tensor G4) with root (rish/Shizuku or su).
Termux with python3 + numpy.
libLiteRt.so + a Tensor-G4 dispatch lib (Google proprietary — extract from the Edge Gallery APK; not redistributed here).
edgetpu_vendor_service running.

License & attribution

Base model EmbeddingGemma-300M © Google, under the Gemma Terms of Use. This quantized/repackaged derivative inherits the Gemma license. Runtime: LiteRT / LiteRT-LM (Apache-2.0). LiteRT C API in engine/*.c is reproduced from the open-source headers. Google proprietary libraries are not included.

NPU port, native engine runner, Termux RAG, and MCP server by @xThr45hx.

Downloads last month: -

Model tree for xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU

Base model

google/embeddinggemma-300m

Quantized

(290)

this model

Collection including xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU

Tensor G4 Edge TPU — On-device Gemma + RAG (LiteRT-LM)

Collection

int4 AOT Gemma 3 1B + on-device EmbeddingGemma RAG (Termux engine + MCP) on the Pixel 9 Tensor G4 Edge TPU via LiteRT-LM. • 2 items • Updated about 24 hours ago