Instructions to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context

SGLang

How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context with Docker Model Runner:
```
docker model run hf.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
```

DiffusionGemma-26B-A4B-it-Infinite-Context

NZFC-GRAM runtime overlay for external evidence context around google/diffusiongemma-26B-A4B-it.

Marketing title: Infinite-Context
Technical boundary: external evidence context, not native unlimited model context.

This repository is a runtime and evidence-governance overlay. It does not include or redistribute Google model weights.

The goal is to combine DiffusionGemma's large native working context with NZFC-GRAM's external memory, large-document indexing, scoped retrieval, tombstone filtering, malicious-memory redaction, exact-slot recall, and bounded evidence packs.

TL;DR

DiffusionGemma native context
+ NZFC-GRAM external evidence memory
+ large-document indexing
+ scoped retrieval
+ tombstone guard
+ bounded evidence packs
=
Infinite-Context as an external evidence runtime, not native unlimited context

Runtime-only validation is already passing from a fresh Hugging Face download.

{
  "runtime_only": true,
  "model_loaded": false,
  "repo_root_runtime_exists": true,
  "repo_root_meta_exists": true,
  "repo_root_memory_tensors_exists": true,
  "exact_slot_passed": true,
  "large_document_passed": true,
  "large_document_query_count": 2,
  "tombstone_guard_passed": true,
  "technical_boundary": "external evidence context, not native unlimited model context"
}

Base model

Base model:

google/diffusiongemma-26B-A4B-it

DiffusionGemma 26B A4B-IT is the external base model used by this overlay. According to the base model card, DiffusionGemma supports long context up to 256K tokens and multimodal input capabilities. This repository does not modify or redistribute the base model weights.

What NZFC-GRAM adds

Layer	Purpose	Status in this repo
`nzfc_gram_runtime/`	NZFC-GRAM runtime package	Included
`runtime/`	Hybrid exact-recall runtime assets	Included
`meta/`	Static archive metadata required by runtime	Included
`memory_tensors/`	Static archive tensors / manifest	Included
SQLite local memory	User/project/session long-term memory	Runtime-supported
Exact slot mapper	Deterministic recall for short key-value facts	Runtime-supported
Tombstone guard	Filters deleted `MEM_*` records from retrieval	Runtime-supported
Large-document profile	Chunking + SQLite FTS5 retrieval	Runtime-supported
Legal-document profile	Article-style chunking and retrieval	Runtime-supported
DiffusionGemma adapter	Optional base-model generation adapter	Included
DiffusionGemma weights	Base model weights	External, not included

Architecture

User question
  -> NZFC-GRAM runtime
  -> scoped SQLite memory
  -> static NZFC archive assets
  -> large-document / legal-document SQLite FTS5 index
  -> tombstone guard
  -> exact slot mapper
  -> malicious-memory redaction
  -> bounded evidence pack
  -> optional DiffusionGemma generation

The central principle is:

Memory is evidence, not instruction.

This means retrieved memories and document chunks are treated as evidence cards. They are not allowed to override system policy, bypass deletion boundaries, or become instructions just because they were stored in memory.

Why the name Infinite-Context?

Infinite-Context is used as a product-facing title.

The technical mechanism is not native unlimited context. The mechanism is:

external memory
+ indexed documents
+ query-conditioned retrieval
+ bounded evidence packs

In other words, the runtime can keep reading from external memory and document stores without placing every source token into a single model prompt.

This is better described as:

Infinite Evidence Context
or
External Evidence Context

The base model still has its own native context limit.

Validation status

Level 1: Runtime-only validation

Status: Passed

The latest runtime-only smoke test was executed after fresh-downloading the Hugging Face repo.

Validated without loading the DiffusionGemma base model:

repo-root runtime/ asset discovery
repo-root meta/complex_math_10m_meta.jsonl discovery
repo-root memory_tensors/ discovery
package import
NZFCGramLongMemoryChat(repo_dir='.') initialization
exact-slot memory recall
large-document ingest and query
tombstone retrieval guard
direct validation script execution

Runtime-only smoke summary:

{
  "created_at": "2026-06-11 02:43:54",
  "repo_id": "SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context",
  "base_model": "google/diffusiongemma-26B-A4B-it",
  "runtime_only": true,
  "model_loaded": false,
  "repo_root_runtime_exists": true,
  "repo_root_meta_exists": true,
  "repo_root_memory_tensors_exists": true,
  "exact_slot_answer": "PROJECT_CODE_DIFFUSIONGEMMA_SMOKE",
  "exact_slot_passed": true,
  "exact_slot_profile": {
    "version": "v1.2.4b",
    "description": "Strict deterministic exact slot mapper for short explicit scoped key-value recall questions.",
    "auto_short_circuit": true,
    "strict_trigger_gate": true
  },
  "large_document_chunk_count": 3,
  "large_document_query_count": 2,
  "large_document_method": "fts5_bm25",
  "large_document_passed": true,
  "tombstone_guard_profile": {
    "version": "v1.2.4c",
    "description": "Filters inactive or tombstoned MEM_* records from memory_store.retrieve results.",
    "db_path": "/kaggle/working/diffusiongemma_infinite_context_evidence_pack_update/runtime_only_smoke_final/memory.sqlite3",
    "guarded_method": "memory_store.retrieve"
  },
  "tombstone_test": {
    "available": true,
    "before_found": true,
    "after_found": false,
    "passed": true,
    "tombstoned": 1
  },
  "technical_boundary": "external evidence context, not native unlimited model context",
  "status": "passed"
}

Level 2: Optional DiffusionGemma model-load validation

Status: Hardware-dependent / not run in the runtime-only validation.

Run this only on suitable hardware:

LOAD_MODEL=1 python examples/optional_diffusiongemma_model_load_check.py

This optional check should validate:

AutoProcessor load
DiffusionGemma model load
minimal generation call
NZFC-GRAM evidence pack generation path

Level 3: Full serving validation

Recommended future validation:

high-frequency multi-context memory test
large-document / legal-document evidence test
multimodal document input test
256K native context stress test
latency and VRAM measurements on target hardware

Quick start

Clone and install:

git lfs install
git clone https://huggingface.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
cd DiffusionGemma-26B-A4B-it-Infinite-Context
pip install -r requirements.txt

Run runtime-only validation:

python validation/run_runtime_only_smoke.py

Expected result:

[PASS] runtime-only smoke passed

Examples

Exact-slot memory recall without loading the base model

python examples/high_frequency_multi_context_runtime_only.py

This validates deterministic retrieval of scoped key-value memory facts.

Example stored memory:

The project high-frequency test code is PROJECT_CODE_RUNTIME_ONLY.

Example question:

What was the project high-frequency test code? Answer only with the code.

Expected answer:

PROJECT_CODE_RUNTIME_ONLY

Large-document retrieval without loading the base model

python examples/large_document_runtime_only.py

This validates chunking, SQLite FTS5 indexing, and query-time document evidence retrieval.

Optional DiffusionGemma model load

LOAD_MODEL=1 python examples/optional_diffusiongemma_model_load_check.py

This requires hardware capable of loading google/diffusiongemma-26B-A4B-it.

Python usage

Runtime-only memory and document evidence

from nzfc_gram_runtime import NZFCGramLongMemoryChat
from nzfc_gram_runtime.quality import attach_answer_quality_governor
from nzfc_gram_runtime.large_document import attach_large_document_memory

bot = NZFCGramLongMemoryChat(
    repo_dir='.',
    model_id='google/diffusiongemma-26B-A4B-it',
    memory_db_path='./memory.sqlite3',
    load_model=False,
    require_model=False,
    preload_static_memory=False,
)

attach_large_document_memory(bot)
attach_answer_quality_governor(bot)

bot.remember(
    'The project high-frequency test code is PROJECT_CODE_DEMO.',
    user_id='demo_user',
    project_id='demo_project',
    session_id='demo_session',
    scope='project',
    tags=['project_code'],
    trust_level=0.95,
)

res = bot.quality_chat(
    'What was the project high-frequency test code? Answer only with the code.',
    user_id='demo_user',
    project_id='demo_project',
    session_id='new_session',
)

print(res['answer'])

Optional DiffusionGemma adapter

from nzfc_gram_runtime.diffusiongemma_adapter import attach_diffusiongemma_block_diffusion

attach_diffusiongemma_block_diffusion(
    bot,
    model_id='google/diffusiongemma-26B-A4B-it',
    device_map='auto',
    dtype='auto',
)

Safety and governance features

Scope isolation

Memory records can be scoped by:

user
project
session

The goal is to prevent cross-user, cross-project, or cross-session memory leakage.

Tombstone filtering

Deleted memories should not be active evidence.

The runtime includes tombstone filtering so deleted MEM_* records are filtered at the retrieval layer when the guard is available.

Malicious-memory redaction

Stored memory is treated as untrusted data. Prompt-injection-like memory should be redacted before generation.

Exact slot mapper

Short exact-recall questions can be answered deterministically from scoped evidence.

Example:

What was the project high-frequency test code? Answer only with the code.

The exact-slot mapper is intentionally strict. Broad explanatory prompts should continue through the normal evidence and generation pipeline.

Large-document evidence

Large documents should not be inserted directly into the prompt.

Recommended path:

ingest -> chunk -> SQLite FTS5 index -> retrieve evidence -> bounded answer

Repository structure

nzfc_gram_runtime/       Python runtime package
runtime/                 Hybrid exact-recall runtime assets
meta/                    Static archive metadata
memory_tensors/          Static archive tensor manifests and assets
archive/                 Optional static archive assets when available
configs/                 Optional runtime configs when available
docs/                    Architecture and technical boundary notes
examples/                Runtime-only and optional model-load examples
validation/              Validation scripts
validation_evidence/     Saved validation evidence
release_notes/           Release notes

Troubleshooting

`ModuleNotFoundError: No module named 'nzfc_gram_runtime'`

Use the latest scripts in this repo. Validation and example scripts insert the repository root into sys.path before importing nzfc_gram_runtime.

Run from the repository root:

python validation/run_runtime_only_smoke.py

`Cannot find runtime/`

This repo now includes the repo-root runtime/ assets required by NZFCGramLongMemoryChat.

Confirm:

ls runtime
ls meta
ls memory_tensors

Base model load fails

google/diffusiongemma-26B-A4B-it is hardware-dependent. Runtime-only validation does not load the base model.

What this is not

Not native infinite context.
Not internal infinite model memory.
Not a claim that DiffusionGemma itself has unlimited context.
Not a zero-hallucination guarantee.
Not legal advice.
Not a production security certification.
Not affiliated with Google.
Not a redistribution of Google model weights.

Roadmap

Recommended next steps:

Run optional DiffusionGemma 26B model-load validation on suitable hardware.
Add multimodal document input examples.
Add long-context stress tests using the native model context.
Add latency and VRAM tables for target hardware.
Add Docker or one-click notebook setup.
Add REST API / CLI serving layer.

License and terms

NZFC-GRAM runtime surface: CC BY-NC 4.0 unless otherwise specified.

Base model: see the official google/diffusiongemma-26B-A4B-it model card and its license/terms.

Short public description

DiffusionGemma-26B-A4B-it-Infinite-Context is an NZFC-GRAM runtime overlay for external evidence context around Google's DiffusionGemma 26B A4B-IT. It includes runtime assets, scoped memory, exact-slot recall, tombstone filtering, large-document retrieval, validation scripts, and runtime-only validation evidence. The title is marketing-facing; the technical mechanism is external evidence context, not native unlimited model context.