Access EmbeddingGemma-300M (Tensor G4 NPU) on Hugging Face

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This is a quantized, repackaged derivative of google/embeddinggemma-300m and is governed by the Gemma license. To access it you must review and agree to Google's Gemma Terms of Use. Please ensure you are logged in to Hugging Face and click below.

Log in or Sign Up to review the conditions and access this model content.

EmbeddingGemma-300M β€” Tensor G4 NPU + Termux RAG/MCP stack

A quantized, repackaged variant of google/embeddinggemma-300m that runs its embeddings on the Google Tensor G4 Edge TPU ("rio") NPU (Pixel 9 series), plus a complete on-device RAG stack that runs entirely in Termux on the phone: a native bionic engine runner, a numpy vector index, and an MCP server that exposes semantic search to any MCP client.

To our knowledge this is the first Tensor-G4 NPU build of EmbeddingGemma, and a working example of a fully on-device, NPU-accelerated RAG β€” query embedding runs on the Edge TPU, retrieval is local, nothing leaves the device.

Verified on-device (2026-06-10): real NPU dispatch (135M tpu_active_cycle delta per embed), cosine 0.9946 vs the CPU reference β€” i.e. NPU output matches CPU to 3 decimals.


What's in here

Path What it is
EmbeddingGemma-300M_seq512_Google_Tensor_G4.litertlm NPU model bundle, seq 512 (embeds tokenizer + tflite + metadata)
EmbeddingGemma-300M_seq256_Google_Tensor_G4.litertlm NPU model bundle, seq 256 (smaller / faster)
EmbeddingGemma-300M_seq512_Google_Tensor_G4.tflite Raw TFLite (what the C engine runner consumes directly)
engine/embed_npu.c Engine runner β€” native bionic LiteRT-C runner, single sequence β†’ 768-d vector
engine/embed_npu_batch.c Batch engine runner β€” model loaded once, streams N rows (for indexing)
engine/embed_hetero.c Heterogeneous (NPU + CPU split) variant
engine/build.sh Termux-clang build line for the runners
rag/npu_embedder.py Python backend: shells to the runner via rish, returns the 768-d vector
rag/index_corpus.py / reindex_npu.py Build the vector index over a markdown corpus β†’ index.npz
rag/query.py / query_npu.py CLI semantic search against the index
rag/tokenize_*.py, rag/litertlm_loader.py, rag/unbundle_litertlm.py Tokenization + bundle plumbing
rag/mcp_server_npu.py MCP server (stdio JSON-RPC), tool search_context(query, k)
rag/mcp-litertlm-rag-npu MCP launcher script

Spec: EmbeddingGemma-300M, 768-d output, L2-normalized, seq 256 / 512, mixed-precision, AOT-compiled for the Tensor G4 Edge TPU. The index.npz vector index is not shipped (it's a private corpus) β€” build your own with rag/index_corpus.py.


Architecture (fully on-device, Termux-native)

  MCP client (Claude Code / any agent)
        β”‚  stdio JSON-RPC: search_context(query, k)
        β–Ό
  rag/mcp_server_npu.py            (Termux python3 β€” no pip 'mcp'/pydantic needed)
        β”‚  tokenize query  ──► rag/tokenize_query.py
        β”‚  embed query     ──► rag/npu_embedder.py
        β”‚                          β”‚  subprocess via rish (root)
        β”‚                          β–Ό
        β”‚                     engine/embed_npu  (native bionic ELF)
        β”‚                          β”‚  LiteRT C API β†’ libLiteRt.so + GoogleTensor dispatch
        β”‚                          β–Ό
        β”‚                     β–‘β–‘ Tensor G4 Edge TPU β–‘β–‘  β†’ 768-d vector
        β”‚  cosine(query, index)  (numpy, pure-Python)
        β–Ό
  ranked chunks: score + source + text

The embedder runs on the NPU; tokenization and cosine search stay in pure-Python Termux. The glibc ai_edge_litert / sentencepiece path is only needed to unbundle / re-export the model β€” query-time embedding uses the native runner and needs neither.


Build the engine runner (Termux)

cd engine
# LIBDIR holds libLiteRt.so + the dispatch lib (extract from the Edge Gallery APK lib/arm64-v8a/;
# Google proprietary, not shipped here). The LiteRT C API is inlined in the .c (verified vs v2.1.1 headers).
LIBDIR=$HOME/litert_libs
clang -O2 -std=c11 embed_npu.c -o embed_npu \
    -L"$LIBDIR" -lLiteRt -Wl,--allow-shlib-undefined -ldl -lm
clang -O2 -std=c11 embed_npu_batch.c -o embed_npu_batch \
    -L"$LIBDIR" -lLiteRt -Wl,--allow-shlib-undefined -ldl -lm

Run (via rish/root, with edgetpu_vendor_service up):

LD_LIBRARY_PATH="$LIBDIR" ./embed_npu  EmbeddingGemma-300M_seq512_Google_Tensor_G4.tflite  "$LIBDIR"  <tok0> <tok1> ...
# writes ./embed_npu_out.f32 = 768 little-endian float32 (L2 norm ~1.0 = correctness signal)

Run the RAG / MCP server

# 1) Build an index over your markdown corpus (chunk = 200 tokens / 30 overlap)
python3 rag/index_corpus.py                      # -> index.npz

# 2) CLI query
python3 rag/query.py "your question" --k 5

# 3) MCP server (stdio) β€” what an agent connects to
bash rag/mcp-litertlm-rag-npu

Register with an MCP client (e.g. Claude Code .mcp.json / claude mcp add):

{
  "mcpServers": {
    "litertlm-rag-npu": {
      "command": "/path/to/rag/mcp-litertlm-rag-npu",
      "args": []
    }
  }
}

Tool contract: search_context(query: str, k: int = 5) -> str β€” ranked chunks (score + source + text), query embedded on the Tensor G4 NPU.

Paths: the scripts use Termux fleet paths (/sdcard/agents/EmbeddingGemma, $HOME/npu_build, rish). Adjust the constants at the top of npu_embedder.py / mcp_server_npu.py for your layout.


Requirements

  • Pixel 9 / 9 Pro / 9 Pro XL / 9 Pro Fold (Tensor G4) with root (rish/Shizuku or su).
  • Termux with python3 + numpy.
  • libLiteRt.so + a Tensor-G4 dispatch lib (Google proprietary β€” extract from the Edge Gallery APK; not redistributed here).
  • edgetpu_vendor_service running.

License & attribution

Base model EmbeddingGemma-300M Β© Google, under the Gemma Terms of Use. This quantized/repackaged derivative inherits the Gemma license. Runtime: LiteRT / LiteRT-LM (Apache-2.0). LiteRT C API in engine/*.c is reproduced from the open-source headers. Google proprietary libraries are not included.

NPU port, native engine runner, Termux RAG, and MCP server by @xThr45hx.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU

Quantized
(290)
this model

Collection including xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU