Instructions to use xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU \ model.litertlm \ --prompt="Write me a poem"
- Notebooks
- Google Colab
- Kaggle
Access EmbeddingGemma-300M (Tensor G4 NPU) on Hugging Face
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This is a quantized, repackaged derivative of google/embeddinggemma-300m and is governed by the Gemma license. To access it you must review and agree to Google's Gemma Terms of Use. Please ensure you are logged in to Hugging Face and click below.
Log in or Sign Up to review the conditions and access this model content.
EmbeddingGemma-300M β Tensor G4 NPU + Termux RAG/MCP stack
A quantized, repackaged variant of
google/embeddinggemma-300mthat runs its embeddings on the Google Tensor G4 Edge TPU ("rio") NPU (Pixel 9 series), plus a complete on-device RAG stack that runs entirely in Termux on the phone: a native bionic engine runner, a numpy vector index, and an MCP server that exposes semantic search to any MCP client.
To our knowledge this is the first Tensor-G4 NPU build of EmbeddingGemma, and a working example of a fully on-device, NPU-accelerated RAG β query embedding runs on the Edge TPU, retrieval is local, nothing leaves the device.
Verified on-device (2026-06-10): real NPU dispatch (135M tpu_active_cycle delta per embed), cosine 0.9946
vs the CPU reference β i.e. NPU output matches CPU to 3 decimals.
What's in here
| Path | What it is |
|---|---|
EmbeddingGemma-300M_seq512_Google_Tensor_G4.litertlm |
NPU model bundle, seq 512 (embeds tokenizer + tflite + metadata) |
EmbeddingGemma-300M_seq256_Google_Tensor_G4.litertlm |
NPU model bundle, seq 256 (smaller / faster) |
EmbeddingGemma-300M_seq512_Google_Tensor_G4.tflite |
Raw TFLite (what the C engine runner consumes directly) |
engine/embed_npu.c |
Engine runner β native bionic LiteRT-C runner, single sequence β 768-d vector |
engine/embed_npu_batch.c |
Batch engine runner β model loaded once, streams N rows (for indexing) |
engine/embed_hetero.c |
Heterogeneous (NPU + CPU split) variant |
engine/build.sh |
Termux-clang build line for the runners |
rag/npu_embedder.py |
Python backend: shells to the runner via rish, returns the 768-d vector |
rag/index_corpus.py / reindex_npu.py |
Build the vector index over a markdown corpus β index.npz |
rag/query.py / query_npu.py |
CLI semantic search against the index |
rag/tokenize_*.py, rag/litertlm_loader.py, rag/unbundle_litertlm.py |
Tokenization + bundle plumbing |
rag/mcp_server_npu.py |
MCP server (stdio JSON-RPC), tool search_context(query, k) |
rag/mcp-litertlm-rag-npu |
MCP launcher script |
Spec: EmbeddingGemma-300M, 768-d output, L2-normalized, seq 256 / 512, mixed-precision, AOT-compiled for
the Tensor G4 Edge TPU. The index.npz vector index is not shipped (it's a private corpus) β build your
own with rag/index_corpus.py.
Architecture (fully on-device, Termux-native)
MCP client (Claude Code / any agent)
β stdio JSON-RPC: search_context(query, k)
βΌ
rag/mcp_server_npu.py (Termux python3 β no pip 'mcp'/pydantic needed)
β tokenize query βββΊ rag/tokenize_query.py
β embed query βββΊ rag/npu_embedder.py
β β subprocess via rish (root)
β βΌ
β engine/embed_npu (native bionic ELF)
β β LiteRT C API β libLiteRt.so + GoogleTensor dispatch
β βΌ
β ββ Tensor G4 Edge TPU ββ β 768-d vector
β cosine(query, index) (numpy, pure-Python)
βΌ
ranked chunks: score + source + text
The embedder runs on the NPU; tokenization and cosine search stay in pure-Python Termux. The glibc
ai_edge_litert / sentencepiece path is only needed to unbundle / re-export the model β query-time
embedding uses the native runner and needs neither.
Build the engine runner (Termux)
cd engine
# LIBDIR holds libLiteRt.so + the dispatch lib (extract from the Edge Gallery APK lib/arm64-v8a/;
# Google proprietary, not shipped here). The LiteRT C API is inlined in the .c (verified vs v2.1.1 headers).
LIBDIR=$HOME/litert_libs
clang -O2 -std=c11 embed_npu.c -o embed_npu \
-L"$LIBDIR" -lLiteRt -Wl,--allow-shlib-undefined -ldl -lm
clang -O2 -std=c11 embed_npu_batch.c -o embed_npu_batch \
-L"$LIBDIR" -lLiteRt -Wl,--allow-shlib-undefined -ldl -lm
Run (via rish/root, with edgetpu_vendor_service up):
LD_LIBRARY_PATH="$LIBDIR" ./embed_npu EmbeddingGemma-300M_seq512_Google_Tensor_G4.tflite "$LIBDIR" <tok0> <tok1> ...
# writes ./embed_npu_out.f32 = 768 little-endian float32 (L2 norm ~1.0 = correctness signal)
Run the RAG / MCP server
# 1) Build an index over your markdown corpus (chunk = 200 tokens / 30 overlap)
python3 rag/index_corpus.py # -> index.npz
# 2) CLI query
python3 rag/query.py "your question" --k 5
# 3) MCP server (stdio) β what an agent connects to
bash rag/mcp-litertlm-rag-npu
Register with an MCP client (e.g. Claude Code .mcp.json / claude mcp add):
{
"mcpServers": {
"litertlm-rag-npu": {
"command": "/path/to/rag/mcp-litertlm-rag-npu",
"args": []
}
}
}
Tool contract: search_context(query: str, k: int = 5) -> str β ranked chunks (score + source + text),
query embedded on the Tensor G4 NPU.
Paths: the scripts use Termux fleet paths (
/sdcard/agents/EmbeddingGemma,$HOME/npu_build,rish). Adjust the constants at the top ofnpu_embedder.py/mcp_server_npu.pyfor your layout.
Requirements
- Pixel 9 / 9 Pro / 9 Pro XL / 9 Pro Fold (Tensor G4) with root (
rish/Shizuku orsu). - Termux with
python3+numpy. libLiteRt.so+ a Tensor-G4 dispatch lib (Google proprietary β extract from the Edge Gallery APK; not redistributed here).edgetpu_vendor_servicerunning.
License & attribution
Base model EmbeddingGemma-300M Β© Google, under the Gemma Terms of Use.
This quantized/repackaged derivative inherits the Gemma license. Runtime: LiteRT / LiteRT-LM
(Apache-2.0). LiteRT C API in engine/*.c is reproduced from the open-source headers. Google proprietary
libraries are not included.
NPU port, native engine runner, Termux RAG, and MCP server by @xThr45hx.
- Downloads last month
- -
Model tree for xThr45hx/EmbeddingGemma-300M-Tensor-G4-NPU
Base model
google/embeddinggemma-300m