Instructions to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
- SGLang
How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with Docker Model Runner:
docker model run hf.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
- DiffusionGemma-26B-A4B-it-Infinite-Context
DiffusionGemma-26B-A4B-it-Infinite-Context
NZFC-GRAM runtime overlay for external evidence context around google/diffusiongemma-26B-A4B-it.
Marketing title: Infinite-Context
Technical boundary: external evidence context, not native unlimited model context.
This repository is a runtime and evidence-governance overlay. It does not include or redistribute Google model weights.
The goal is to combine DiffusionGemma's large native working context with NZFC-GRAM's external memory, large-document indexing, scoped retrieval, tombstone filtering, malicious-memory redaction, exact-slot recall, and bounded evidence packs.
TL;DR
DiffusionGemma native context
+ NZFC-GRAM external evidence memory
+ large-document indexing
+ scoped retrieval
+ tombstone guard
+ bounded evidence packs
=
Infinite-Context as an external evidence runtime, not native unlimited context
Runtime-only validation is already passing from a fresh Hugging Face download.
{
"runtime_only": true,
"model_loaded": false,
"repo_root_runtime_exists": true,
"repo_root_meta_exists": true,
"repo_root_memory_tensors_exists": true,
"exact_slot_passed": true,
"large_document_passed": true,
"large_document_query_count": 2,
"tombstone_guard_passed": true,
"technical_boundary": "external evidence context, not native unlimited model context"
}
Base model
Base model:
google/diffusiongemma-26B-A4B-it
DiffusionGemma 26B A4B-IT is the external base model used by this overlay. According to the base model card, DiffusionGemma supports long context up to 256K tokens and multimodal input capabilities. This repository does not modify or redistribute the base model weights.
What NZFC-GRAM adds
| Layer | Purpose | Status in this repo |
|---|---|---|
nzfc_gram_runtime/ |
NZFC-GRAM runtime package | Included |
runtime/ |
Hybrid exact-recall runtime assets | Included |
meta/ |
Static archive metadata required by runtime | Included |
memory_tensors/ |
Static archive tensors / manifest | Included |
| SQLite local memory | User/project/session long-term memory | Runtime-supported |
| Exact slot mapper | Deterministic recall for short key-value facts | Runtime-supported |
| Tombstone guard | Filters deleted MEM_* records from retrieval |
Runtime-supported |
| Large-document profile | Chunking + SQLite FTS5 retrieval | Runtime-supported |
| Legal-document profile | Article-style chunking and retrieval | Runtime-supported |
| DiffusionGemma adapter | Optional base-model generation adapter | Included |
| DiffusionGemma weights | Base model weights | External, not included |
Architecture
User question
-> NZFC-GRAM runtime
-> scoped SQLite memory
-> static NZFC archive assets
-> large-document / legal-document SQLite FTS5 index
-> tombstone guard
-> exact slot mapper
-> malicious-memory redaction
-> bounded evidence pack
-> optional DiffusionGemma generation
The central principle is:
Memory is evidence, not instruction.
This means retrieved memories and document chunks are treated as evidence cards. They are not allowed to override system policy, bypass deletion boundaries, or become instructions just because they were stored in memory.
Why the name Infinite-Context?
Infinite-Context is used as a product-facing title.
The technical mechanism is not native unlimited context. The mechanism is:
external memory
+ indexed documents
+ query-conditioned retrieval
+ bounded evidence packs
In other words, the runtime can keep reading from external memory and document stores without placing every source token into a single model prompt.
This is better described as:
Infinite Evidence Context
or
External Evidence Context
The base model still has its own native context limit.
Validation status
Level 1: Runtime-only validation
Status: Passed
The latest runtime-only smoke test was executed after fresh-downloading the Hugging Face repo.
Validated without loading the DiffusionGemma base model:
- repo-root
runtime/asset discovery - repo-root
meta/complex_math_10m_meta.jsonldiscovery - repo-root
memory_tensors/discovery - package import
NZFCGramLongMemoryChat(repo_dir='.')initialization- exact-slot memory recall
- large-document ingest and query
- tombstone retrieval guard
- direct validation script execution
Runtime-only smoke summary:
{
"created_at": "2026-06-11 02:43:54",
"repo_id": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context",
"base_model": "google/diffusiongemma-26B-A4B-it",
"runtime_only": true,
"model_loaded": false,
"repo_root_runtime_exists": true,
"repo_root_meta_exists": true,
"repo_root_memory_tensors_exists": true,
"exact_slot_answer": "PROJECT_CODE_DIFFUSIONGEMMA_SMOKE",
"exact_slot_passed": true,
"exact_slot_profile": {
"version": "v1.2.4b",
"description": "Strict deterministic exact slot mapper for short explicit scoped key-value recall questions.",
"auto_short_circuit": true,
"strict_trigger_gate": true
},
"large_document_chunk_count": 3,
"large_document_query_count": 2,
"large_document_method": "fts5_bm25",
"large_document_passed": true,
"tombstone_guard_profile": {
"version": "v1.2.4c",
"description": "Filters inactive or tombstoned MEM_* records from memory_store.retrieve results.",
"db_path": "/kaggle/working/diffusiongemma_infinite_context_evidence_pack_update/runtime_only_smoke_final/memory.sqlite3",
"guarded_method": "memory_store.retrieve"
},
"tombstone_test": {
"available": true,
"before_found": true,
"after_found": false,
"passed": true,
"tombstoned": 1
},
"technical_boundary": "external evidence context, not native unlimited model context",
"status": "passed"
}
Level 2: Optional DiffusionGemma model-load validation
Status: Hardware-dependent / not run in the runtime-only validation.
Run this only on suitable hardware:
LOAD_MODEL=1 python examples/optional_diffusiongemma_model_load_check.py
This optional check should validate:
AutoProcessorload- DiffusionGemma model load
- minimal generation call
- NZFC-GRAM evidence pack generation path
Level 3: Full serving validation
Recommended future validation:
- high-frequency multi-context memory test
- large-document / legal-document evidence test
- multimodal document input test
- 256K native context stress test
- latency and VRAM measurements on target hardware
Quick start
Clone and install:
git lfs install
git clone https://huggingface.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
cd DiffusionGemma-26B-A4B-it-Infinite-Context
pip install -r requirements.txt
Run runtime-only validation:
python validation/run_runtime_only_smoke.py
Expected result:
[PASS] runtime-only smoke passed
Examples
Exact-slot memory recall without loading the base model
python examples/high_frequency_multi_context_runtime_only.py
This validates deterministic retrieval of scoped key-value memory facts.
Example stored memory:
The project high-frequency test code is PROJECT_CODE_RUNTIME_ONLY.
Example question:
What was the project high-frequency test code? Answer only with the code.
Expected answer:
PROJECT_CODE_RUNTIME_ONLY
Large-document retrieval without loading the base model
python examples/large_document_runtime_only.py
This validates chunking, SQLite FTS5 indexing, and query-time document evidence retrieval.
Optional DiffusionGemma model load
LOAD_MODEL=1 python examples/optional_diffusiongemma_model_load_check.py
This requires hardware capable of loading google/diffusiongemma-26B-A4B-it.
Python usage
Runtime-only memory and document evidence
from nzfc_gram_runtime import NZFCGramLongMemoryChat
from nzfc_gram_runtime.quality import attach_answer_quality_governor
from nzfc_gram_runtime.large_document import attach_large_document_memory
bot = NZFCGramLongMemoryChat(
repo_dir='.',
model_id='google/diffusiongemma-26B-A4B-it',
memory_db_path='./memory.sqlite3',
load_model=False,
require_model=False,
preload_static_memory=False,
)
attach_large_document_memory(bot)
attach_answer_quality_governor(bot)
bot.remember(
'The project high-frequency test code is PROJECT_CODE_DEMO.',
user_id='demo_user',
project_id='demo_project',
session_id='demo_session',
scope='project',
tags=['project_code'],
trust_level=0.95,
)
res = bot.quality_chat(
'What was the project high-frequency test code? Answer only with the code.',
user_id='demo_user',
project_id='demo_project',
session_id='new_session',
)
print(res['answer'])
Optional DiffusionGemma adapter
from nzfc_gram_runtime.diffusiongemma_adapter import attach_diffusiongemma_block_diffusion
attach_diffusiongemma_block_diffusion(
bot,
model_id='google/diffusiongemma-26B-A4B-it',
device_map='auto',
dtype='auto',
)
Safety and governance features
Scope isolation
Memory records can be scoped by:
- user
- project
- session
The goal is to prevent cross-user, cross-project, or cross-session memory leakage.
Tombstone filtering
Deleted memories should not be active evidence.
The runtime includes tombstone filtering so deleted MEM_* records are filtered at the retrieval layer when the guard is available.
Malicious-memory redaction
Stored memory is treated as untrusted data. Prompt-injection-like memory should be redacted before generation.
Exact slot mapper
Short exact-recall questions can be answered deterministically from scoped evidence.
Example:
What was the project high-frequency test code? Answer only with the code.
The exact-slot mapper is intentionally strict. Broad explanatory prompts should continue through the normal evidence and generation pipeline.
Large-document evidence
Large documents should not be inserted directly into the prompt.
Recommended path:
ingest -> chunk -> SQLite FTS5 index -> retrieve evidence -> bounded answer
Repository structure
nzfc_gram_runtime/ Python runtime package
runtime/ Hybrid exact-recall runtime assets
meta/ Static archive metadata
memory_tensors/ Static archive tensor manifests and assets
archive/ Optional static archive assets when available
configs/ Optional runtime configs when available
docs/ Architecture and technical boundary notes
examples/ Runtime-only and optional model-load examples
validation/ Validation scripts
validation_evidence/ Saved validation evidence
release_notes/ Release notes
Troubleshooting
ModuleNotFoundError: No module named 'nzfc_gram_runtime'
Use the latest scripts in this repo. Validation and example scripts insert the repository root into sys.path before importing nzfc_gram_runtime.
Run from the repository root:
python validation/run_runtime_only_smoke.py
Cannot find runtime/
This repo now includes the repo-root runtime/ assets required by NZFCGramLongMemoryChat.
Confirm:
ls runtime
ls meta
ls memory_tensors
Base model load fails
google/diffusiongemma-26B-A4B-it is hardware-dependent. Runtime-only validation does not load the base model.
What this is not
- Not native infinite context.
- Not internal infinite model memory.
- Not a claim that DiffusionGemma itself has unlimited context.
- Not a zero-hallucination guarantee.
- Not legal advice.
- Not a production security certification.
- Not affiliated with Google.
- Not a redistribution of Google model weights.
Roadmap
Recommended next steps:
- Run optional DiffusionGemma 26B model-load validation on suitable hardware.
- Add multimodal document input examples.
- Add long-context stress tests using the native model context.
- Add latency and VRAM tables for target hardware.
- Add Docker or one-click notebook setup.
- Add REST API / CLI serving layer.
License and terms
NZFC-GRAM runtime surface: CC BY-NC 4.0 unless otherwise specified.
Base model: see the official google/diffusiongemma-26B-A4B-it model card and its license/terms.
Short public description
DiffusionGemma-26B-A4B-it-Infinite-Context is an NZFC-GRAM runtime overlay for external evidence context around Google's DiffusionGemma 26B A4B-IT. It includes runtime assets, scoped memory, exact-slot recall, tombstone filtering, large-document retrieval, validation scripts, and runtime-only validation evidence. The title is marketing-facing; the technical mechanism is external evidence context, not native unlimited model context.
Model tree for SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
Base model
google/diffusiongemma-26B-A4B-it