OkeyMeta commited on 5 days ago

Commit

2147ce8

verified ·

1 Parent(s): d641a1b

Release Reframr-RFM-v1-Base public checkpoint

Public v1 base release for Reframr RFM. Internal provenance: v95 computed checkpoint. Includes model.safetensors, tokenizer, runtime source, config, generation examples, and model card.

Files changed (31) hide show

README.md +141 -0
config.json +107 -0
examples/jsonl_serve.ps1 +7 -0
examples/python_inference.py +44 -0
generation_config.json +8 -0
model.safetensors +3 -0
pyproject.toml +17 -0
reframr/__init__.py +32 -0
reframr/__main__.py +5 -0
reframr/checkpoint.py +274 -0
reframr/cli.py +760 -0
reframr/config.py +68 -0
reframr/corpus.py +123 -0
reframr/corpus_recipes.py +1257 -0
reframr/curriculum.py +0 -0
reframr/datasets.py +165 -0
reframr/embeddings.py +457 -0
reframr/evaluation.py +265 -0
reframr/hf_import.py +662 -0
reframr/hippo.py +145 -0
reframr/linalg.py +271 -0
reframr/model.py +0 -0
reframr/reasoning.py +26 -0
reframr/reservoir.py +94 -0
reframr/streaming.py +1852 -0
reframr/ternary.py +63 -0
reframr/text_quality.py +98 -0
reframr/tokenizer.py +665 -0
requirements.txt +3 -0
sample_prompts.jsonl +5 -0
tokenizer.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,141 @@

+---
+language:
+- en
+tags:
+- reframr
+- okeymeta
+- non-transformer
+- recurrent-memory
+- computed-weights
+- cpu-inference
+- safetensors
+library_name: reframr
+pipeline_tag: text-generation
+license: other
+base_model: scratch
+---
+# Reframr-RFM-v1-Base
+**Reframr-RFM-v1-Base** is the first public base checkpoint from **OkeyMeta Ltd** for the Reframr line of non-Transformer language models. Reframr is built from scratch around recurrent memory, computed weights, and data-derived structure rather than a Transformer attention stack.
+This release is packaged as `model.safetensors` with the matching `tokenizer.json`, runtime source, config, and runnable examples. A larger production Reframr line is being computed after this release, including tool-use and web-freshness data.
+## What It Is
+Reframr-RFM means **Recurrent Flow Memory**. The model is designed around a persistent recurrent state instead of a fixed quadratic attention map. That gives the architecture no fixed attention-window context limit; practical limits are determined by runtime session length, machine memory, and deployment policy.
+This checkpoint is not a Transformer, not a fine-tuned clone of a Transformer, and not a prompt wrapper. It uses the Reframr runtime included in this repository and a checkpoint kind of `reframr-analytical`.
+## Model Files
+- `model.safetensors`: Reframr v1 computed-weight checkpoint.
+- `tokenizer.json`: FrameToken tokenizer exported from the checkpoint metadata.
+- `config.json`: Release metadata and tensor layout.
+- `generation_config.json`: Recommended default generation settings.
+- `reframr/`: CPU-first Reframr runtime source.
+- `examples/`: Minimal CLI, JSONL, and Python usage examples.
+## Quick Start
+Use Python 3.13 or newer from the root of this model repository:
+```bash
+python -m pip install -r requirements.txt
+python -m reframr generate \
+  --model model.safetensors \
+  --context "Who are you, and what makes you different from Transformer models?" \
+  --max-tokens 90 \
+  --temperature 0.92 \
+  --decode-top-k 72 \
+  --decode-top-p 0.92
+```
+System instructions are passed as learned context:
+```bash
+python -m reframr generate \
+  --model model.safetensors \
+  --system "Answer in two short paragraphs. Be direct and warm." \
+  --context "Explain why clean data matters when computing Reframr weights." \
+  --max-tokens 90 \
+  --temperature 0.9
+```
+For a persistent process that loads the checkpoint once and accepts JSONL requests:
+```bash
+python -m reframr serve --model model.safetensors --max-tokens 96
+```
+Then send one JSON object per line:
+```jsonl
+{"prompt":"Tell a short story about a glass library under the sea.","temperature":1.05,"decode_top_k":90,"max_tokens":120}
+{"system":"Use exactly one fitting emoji.","prompt":"Encourage a tired engineer without sounding generic.","max_tokens":70}
+```
+## Python Example
+```python
+from pathlib import Path
+from reframr.model import ReframrModel
+root = Path(__file__).resolve().parent
+model = ReframrModel.load(root / "model.safetensors")
+text = model.generate_text(
+    "Who are you?",
+    max_tokens=80,
+    temperature=0.92,
+    top_k=72,
+    top_p=0.92,
+    repetition_penalty=1.18,
+)
+print(text)
+```
+## Generation Controls
+- `temperature`: Higher values increase variation. Try `0.85` for focused answers and `1.05` for story or brainstorming prompts.
+- `--decode-top-k`: Limits sampling to the strongest candidate set. Recommended range: `50` to `100`.
+- `--decode-top-p`: Nucleus cutoff. Recommended default: `0.92`.
+- `--repetition-penalty`: Penalizes repeated tokens. Recommended default: `1.18`.
+- `--system`: Adds a system instruction before the user prompt.
+- `--reasoning-mode`: Supports `none`, `deep`, `memory`, and `tool` profiles in the runtime. The current public checkpoint is a base release; the dedicated tool/web-freshness line is still being computed.
+## Identity
+Reframr is built by **OkeyMeta Ltd**. The Reframr line reframes language intelligence around recurrent memory, computed weights, and evidence from data. OkeyMeta Ltd was founded in 2022. The founder and CEO is **Okechukwu Goodnews Nwaozor**.
+## Architecture Snapshot
+| Property | Reframr-RFM-v1-Base |
+| --- | --- |
+| Family | Reframr / Recurrent Flow Memory |
+| Organization | OkeyMeta Ltd |
+| Checkpoint kind | `reframr-analytical` |
+| Attention stack | None |
+| Transformer layers | None |
+| Tokenizer | FrameToken |
+| Weight file | `model.safetensors` |
+| Runtime | CPU-first Reframr Python runtime |
+| Embedding dim | 96 |
+| State dim | 48 |
+| State width | 576 |
+| Output vocab rows | 2,793 |
+| Tokenizer vocab size | 3,741 |
+## Intended Use
+This checkpoint is intended for public testing of the Reframr runtime, open-ended generation experiments, system-instruction experiments, story generation, safety behavior, identity prompts, and CPU-first research into non-Transformer language modeling.
+It is a base checkpoint, not a medical, legal, financial, or safety-critical authority. For fresh factual questions, connect a retrieval or web-search tool in the next tool-aware Reframr line rather than relying on static checkpoint knowledge alone.
+## Release Note
+This release is the public v1 base checkpoint. Internally, it comes from the v95 tracked compute run; publicly, it begins the Reframr-RFM v1 line. The next production line is being computed with broader data, tool-use supervision, web-search protocol tokens, and larger generalization probes. The goal is simple: make Reframr a serious, CPU-first, non-Transformer model family that learns from data rather than from hardcoded responses.
+## Ownership
+Copyright OkeyMeta Ltd. All rights reserved unless a separate license is supplied by OkeyMeta Ltd.

config.json ADDED Viewed

	@@ -0,0 +1,107 @@

+{
+  "model_type": "reframr-rfm",
+  "model_name": "Reframr-RFM-v1-Base",
+  "library_name": "reframr",
+  "checkpoint_kind": "reframr-analytical",
+  "schema_version": "1",
+  "architecture": "Reverse-Flow Recurrent Analytical Memory / Recurrent Flow Memory",
+  "organization": "OkeyMeta Ltd",
+  "creator": "OkeyMeta Ltd",
+  "runtime": "CPU-first Reframr Python runtime included in this repository",
+  "format": "safetensors",
+  "weights_file": "model.safetensors",
+  "tokenizer_file": "tokenizer.json",
+  "tokenizer_name": "FrameToken",
+  "tokenizer_vocab_size": 3741,
+  "vocab_size": 2793,
+  "embedding_dim": 96,
+  "state_dim": 48,
+  "state_width": 576,
+  "tensor_count": 21,
+  "tensor_shapes": {
+    "answer_keys": [
+      18000,
+      576
+    ],
+    "answer_sequence_keys": [
+      8400,
+      576
+    ],
+    "answer_sequence_prompt_tokens": [
+      8400,
+      192
+    ],
+    "answer_sequence_tokens": [
+      8400,
+      192
+    ],
+    "answer_start_keys": [
+      18000,
+      576
+    ],
+    "answer_start_values": [
+      18000
+    ],
+    "answer_values": [
+      18000
+    ],
+    "associative_keys": [
+      18000,
+      576
+    ],
+    "associative_values": [
+      18000
+    ],
+    "embedding_table": [
+      2793,
+      96
+    ],
+    "preference_bias": [
+      2793
+    ],
+    "prompt_answer_bias": [
+      2793
+    ],
+    "prompt_answer_start_bias": [
+      2793
+    ],
+    "prompt_answer_start_weights": [
+      2793,
+      576
+    ],
+    "prompt_answer_weights": [
+      2793,
+      576
+    ],
+    "readout_bias": [
+      2793
+    ],
+    "readout_weights": [
+      2793,
+      576
+    ],
+    "state_offset": [
+      576
+    ],
+    "ternary_mask": [
+      576
+    ],
+    "ternary_scale": [
+      1
+    ],
+    "trace_token_weights": [
+      2793
+    ]
+  },
+  "lowercase": false,
+  "default_reasoning_profile": "none",
+  "attention": "none",
+  "transformer": "false",
+  "weight_derivation": "computed analytical/statistical checkpoint from OkeyMeta curriculum data; no Transformer attention stack",
+  "context_model": "recurrent persistent memory state; practical limits depend on runtime session and machine memory",
+  "current_release": "public base checkpoint",
+  "next_line": "tool-aware and web-freshness data line is being computed after this release",
+  "public_version": "v1",
+  "internal_compute_run": "v95",
+  "internal_source_checkpoint": "reframr-v95-500b-effective-fullreadout-outside-probe-generalization-e96-s48.safetensors"
+}

examples/jsonl_serve.ps1 ADDED Viewed

	@@ -0,0 +1,7 @@

+$requests = @'
+{"prompt":"Who are you, and who built you?","max_tokens":80,"temperature":0.9}
+{"system":"Answer in two short paragraphs and use exactly one fitting emoji.","prompt":"Encourage a tired engineer who is still building carefully.","max_tokens":80,"temperature":0.95}
+{"prompt":"Tell a short story about a glass library under the sea.","max_tokens":120,"temperature":1.05,"decode_top_k":90}
+'@
+$requests | python -m reframr serve --model model.safetensors --max-tokens 96 --temperature 0.92 --decode-top-k 72 --decode-top-p 0.92

examples/python_inference.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from __future__ import annotations
+import argparse
+import sys
+from pathlib import Path
+REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+from reframr.model import ReframrModel
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run Reframr-RFM-v1-Base locally.")
+    parser.add_argument("--model", default=str(REPO_ROOT / "model.safetensors"))
+    parser.add_argument("--prompt", default="Who are you, and what makes Reframr different?")
+    parser.add_argument("--system", default="")
+    parser.add_argument("--max-tokens", type=int, default=90)
+    parser.add_argument("--temperature", type=float, default=0.92)
+    parser.add_argument("--top-k", type=int, default=72)
+    parser.add_argument("--top-p", type=float, default=0.92)
+    parser.add_argument("--repetition-penalty", type=float, default=1.18)
+    args = parser.parse_args()
+    context = args.prompt
+    if args.system.strip():
+        context = f"System instruction: {args.system.strip()}\nUser: {args.prompt}"
+    model = ReframrModel.load(args.model)
+    print(
+        model.generate_text(
+            context,
+            max_tokens=args.max_tokens,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            repetition_penalty=args.repetition_penalty,
+        )
+    )
+if __name__ == "__main__":
+    main()

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "max_tokens": 96,
+  "temperature": 0.92,
+  "decode_top_k": 72,
+  "decode_top_p": 0.92,
+  "repetition_penalty": 1.18,
+  "reasoning_profile": "none"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:28d9eb4844b8aa4e337c18bf78e5b12fcf214b876fb5cd2e6e1fa556c7f70f2b
+size 205798796

pyproject.toml ADDED Viewed

	@@ -0,0 +1,17 @@

+[project]
+name = "reframr"
+version = "0.1.0"
+description = "CPU-first analytical language modeling research framework for REFRAMR."
+requires-python = ">=3.13"
+dependencies = [
+    "numpy>=2.1,<3",
+    "scipy>=1.14,<2",
+    "datasets>=4.1,<5",
+]
+[project.scripts]
+reframr = "reframr.cli:main"
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"

reframr/__init__.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import sys
+from pathlib import Path
+_VENDOR_ROOT = Path(__file__).resolve().parent.parent / ".vendor"
+for _vendor_path in (_VENDOR_ROOT / "python", _VENDOR_ROOT / "sitepkgs"):
+    if _vendor_path.exists():
+        vendor_text = str(_vendor_path)
+        if vendor_text not in sys.path:
+            sys.path.insert(0, vendor_text)
+from .checkpoint import inspect_checkpoint, read_safetensor_file
+from .config import ReframrConfig
+from .embeddings import EmbeddingModel, fit_ppmi_embedding
+from .hippo import AnalyticalMemoryUnit, hippo_legs_matrix
+from .model import ReframrModel
+from .reasoning import REASONING_CONTROL_TOKENS, REASONING_PROFILES, TOKENIZER_NAME
+from .tokenizer import NativeTokenizer
+__all__ = [
+    "AnalyticalMemoryUnit",
+    "EmbeddingModel",
+    "NativeTokenizer",
+    "REASONING_CONTROL_TOKENS",
+    "REASONING_PROFILES",
+    "ReframrConfig",
+    "ReframrModel",
+    "TOKENIZER_NAME",
+    "fit_ppmi_embedding",
+    "hippo_legs_matrix",
+    "inspect_checkpoint",
+    "read_safetensor_file",
+]

reframr/__main__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from .cli import main
+if __name__ == "__main__":
+    raise SystemExit(main())

reframr/checkpoint.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import json
+import math
+import site
+import struct
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+_VENDOR_ROOT = Path(__file__).resolve().parent.parent / ".vendor"
+for _vendor_path in (_VENDOR_ROOT / "python", _VENDOR_ROOT / "sitepkgs"):
+    if _vendor_path.exists():
+        vendor_text = str(_vendor_path)
+        if vendor_text not in sys.path:
+            sys.path.insert(0, vendor_text)
+try:
+    import numpy as np
+except ModuleNotFoundError:
+    user_site = site.getusersitepackages()
+    if user_site and user_site not in sys.path:
+        sys.path.append(user_site)
+    try:
+        import numpy as np
+    except ModuleNotFoundError:
+        np = None
+if np is not None and not hasattr(np, "asarray"):
+    np = None
+DTYPE_CODES = {
+    "F32": ("f", 4),
+    "F64": ("d", 8),
+    "I32": ("i", 4),
+}
+@dataclass(slots=True)
+class SafeTensorFile:
+    tensors: dict[str, Any]
+    metadata: dict[str, str]
+def _read_safetensor_header(path: str | Path) -> dict[str, Any]:
+    with Path(path).open("rb") as handle:
+        length_bytes = handle.read(8)
+        if len(length_bytes) < 8:
+            raise ValueError("Invalid safetensors file: missing header length.")
+        header_length = struct.unpack("<Q", length_bytes)[0]
+        header_bytes = handle.read(header_length)
+        if len(header_bytes) != header_length:
+            raise ValueError("Invalid safetensors file: truncated header.")
+    return json.loads(header_bytes.decode("utf-8"))
+def _shape_of(value: Any) -> list[int]:
+    if np is not None and hasattr(value, "shape"):
+        return [int(axis) for axis in value.shape]
+    if not isinstance(value, list):
+        return []
+    if not value:
+        return [0]
+    first_shape = _shape_of(value[0])
+    for item in value[1:]:
+        if _shape_of(item) != first_shape:
+            raise ValueError("Safetensor writer does not support ragged tensors.")
+    return [len(value)] + first_shape
+def _flatten(value: Any) -> list[Any]:
+    if np is not None and hasattr(value, "reshape"):
+        return value.reshape(-1).tolist()
+    if isinstance(value, list):
+        flattened: list[Any] = []
+        for item in value:
+            flattened.extend(_flatten(item))
+        return flattened
+    return [value]
+def _dtype_of(flat_values: list[Any]) -> str:
+    if all(isinstance(value, int) and not isinstance(value, bool) for value in flat_values):
+        return "I32"
+    return "F64"
+def _pack_tensor(dtype: str, values: list[Any]) -> bytes:
+    if not values:
+        return b""
+    code, _ = DTYPE_CODES[dtype]
+    cast_values = [int(value) for value in values] if dtype == "I32" else [float(value) for value in values]
+    return struct.pack(f"<{len(cast_values)}{code}", *cast_values)
+def _array_payload(value: Any) -> tuple[str, list[int], Any] | None:
+    if np is None:
+        return None
+    try:
+        array = np.asarray(value)
+    except (TypeError, ValueError):
+        return None
+    if array.dtype == object:
+        return None
+    shape = [int(axis) for axis in array.shape]
+    if np.issubdtype(array.dtype, np.integer) and not np.issubdtype(array.dtype, np.bool_):
+        return "I32", shape, np.ascontiguousarray(array.astype("<i4", copy=False))
+    if np.issubdtype(array.dtype, np.floating):
+        if array.dtype == np.float32:
+            return "F32", shape, np.ascontiguousarray(array.astype("<f4", copy=False))
+        return "F64", shape, np.ascontiguousarray(array.astype("<f8", copy=False))
+    return "F64", shape, np.ascontiguousarray(array.astype("<f8", copy=False))
+def _reshape(values: list[Any], shape: list[int]) -> Any:
+    if not shape:
+        return values[0]
+    if len(shape) == 1:
+        return values[: shape[0]]
+    chunk = math.prod(shape[1:])
+    return [
+        _reshape(values[index * chunk : (index + 1) * chunk], shape[1:])
+        for index in range(shape[0])
+    ]
+def write_safetensor_file(
+    path: str | Path,
+    tensors: dict[str, Any],
+    *,
+    metadata: dict[str, str] | None = None,
+) -> None:
+    tensor_header: dict[str, Any] = {}
+    payloads: list[Any] = []
+    offset = 0
+    for name, value in tensors.items():
+        array_payload = _array_payload(value)
+        if array_payload is None:
+            flat_values = _flatten(value)
+            dtype = _dtype_of(flat_values)
+            shape = _shape_of(value)
+            payload = _pack_tensor(dtype, flat_values)
+        else:
+            dtype, shape, payload = array_payload
+        payload_size = int(payload.nbytes) if hasattr(payload, "nbytes") else len(payload)
+        tensor_header[name] = {
+            "dtype": dtype,
+            "shape": shape,
+            "data_offsets": [offset, offset + payload_size],
+        }
+        payloads.append(payload)
+        offset += payload_size
+    if metadata:
+        tensor_header["__metadata__"] = metadata
+    header_bytes = json.dumps(tensor_header, separators=(",", ":")).encode("utf-8")
+    output_path = Path(path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with output_path.open("wb") as handle:
+        handle.write(struct.pack("<Q", len(header_bytes)))
+        handle.write(header_bytes)
+        for payload in payloads:
+            if hasattr(payload, "nbytes"):
+                if payload.nbytes:
+                    handle.write(memoryview(payload).cast("B"))
+            else:
+                handle.write(payload)
+def read_safetensor_file(path: str | Path, *, arrays: bool = False) -> SafeTensorFile:
+    tensor_path = Path(path)
+    if arrays and np is not None:
+        with tensor_path.open("rb") as handle:
+            length_bytes = handle.read(8)
+            if len(length_bytes) < 8:
+                raise ValueError("Invalid safetensors file: missing header length.")
+            header_length = struct.unpack("<Q", length_bytes)[0]
+            header_bytes = handle.read(header_length)
+            if len(header_bytes) != header_length:
+                raise ValueError("Invalid safetensors file: truncated header.")
+        header = json.loads(header_bytes.decode("utf-8"))
+        data_start = 8 + header_length
+        metadata = {str(key): str(value) for key, value in header.get("__metadata__", {}).items()}
+        tensors: dict[str, Any] = {}
+        for name, spec in header.items():
+            if name == "__metadata__":
+                continue
+            start, end = spec["data_offsets"]
+            dtype = str(spec["dtype"])
+            shape = [int(value) for value in spec["shape"]]
+            _, width = DTYPE_CODES[dtype]
+            payload_width = end - start
+            element_count = payload_width // width if width else 0
+            if payload_width <= 0:
+                tensors[name] = np.asarray([], dtype={"I32": "<i4", "F32": "<f4", "F64": "<f8"}[dtype])
+                continue
+            array_dtype = {"I32": "<i4", "F32": "<f4", "F64": "<f8"}[dtype]
+            mapped_shape = tuple(shape) if shape else (element_count,)
+            mapped = np.memmap(
+                tensor_path,
+                dtype=array_dtype,
+                mode="r",
+                offset=data_start + start,
+                shape=mapped_shape,
+                order="C",
+            )
+            tensors[name] = mapped if shape else mapped[0]
+        return SafeTensorFile(tensors=tensors, metadata=metadata)
+    raw = tensor_path.read_bytes()
+    if len(raw) < 8:
+        raise ValueError("Invalid safetensors file: missing header length.")
+    header_length = struct.unpack("<Q", raw[:8])[0]
+    header = json.loads(raw[8 : 8 + header_length].decode("utf-8"))
+    data_buffer = raw[8 + header_length :]
+    metadata = {str(key): str(value) for key, value in header.get("__metadata__", {}).items()}
+    tensors: dict[str, Any] = {}
+    for name, spec in header.items():
+        if name == "__metadata__":
+            continue
+        start, end = spec["data_offsets"]
+        dtype = str(spec["dtype"])
+        shape = [int(value) for value in spec["shape"]]
+        code, width = DTYPE_CODES[dtype]
+        payload = data_buffer[start:end]
+        element_count = len(payload) // width if width else 0
+        if np is not None and payload:
+            array_dtype = {"I32": "<i4", "F32": "<f4", "F64": "<f8"}[dtype]
+            values = np.frombuffer(payload, dtype=array_dtype, count=element_count)
+            reshaped = values.reshape(shape) if shape else values
+            if arrays:
+                tensors[name] = reshaped.copy() if shape else values.copy()[0]
+            else:
+                tensors[name] = reshaped.tolist() if shape else values.tolist()[0]
+        else:
+            values = list(struct.unpack(f"<{element_count}{code}", payload)) if payload else []
+            tensors[name] = _reshape(values, shape)
+    return SafeTensorFile(tensors=tensors, metadata=metadata)
+def inspect_checkpoint(path: str | Path) -> dict[str, Any]:
+    header = _read_safetensor_header(path)
+    metadata = {str(key): str(value) for key, value in header.get("__metadata__", {}).items()}
+    tensor_names = sorted(name for name in header if name != "__metadata__")
+    config = json.loads(metadata["config"]) if "config" in metadata else {}
+    return {
+        "format": "safetensors",
+        "path": str(Path(path).resolve()),
+        "checkpoint_kind": metadata.get("checkpoint_kind", "unknown"),
+        "schema_version": metadata.get("schema_version", "0"),
+        "tokenizer_name": metadata.get("tokenizer_name", ""),
+        "default_reasoning_profile": str(config.get("default_reasoning_profile", "none")) if config else "none",
+        "lowercase": bool(config.get("lowercase", False)) if config else False,
+        "tensor_count": len(tensor_names),
+        "tensor_names": tensor_names,
+        "tensor_dtypes": {
+            name: str(header[name]["dtype"])
+            for name in tensor_names
+        },
+        "tensor_shapes": {
+            name: [int(axis) for axis in header[name]["shape"]]
+            for name in tensor_names
+        },
+        "tokenizer_vocab_size": int(metadata.get("tokenizer_vocab_size", "0")),
+        "embedding_dim": int(config.get("embedding_dim", 0)) if config else 0,
+        "state_dim": int(config.get("state_dim", 0)) if config else 0,
+    }

reframr/cli.py ADDED Viewed

	@@ -0,0 +1,760 @@

+import argparse
+import json
+import sys
+from pathlib import Path
+from .checkpoint import inspect_checkpoint
+from .config import ReframrConfig
+from .corpus_recipes import (
+    build_foundation_corpus,
+    build_generalization_corpus,
+    write_corpus_package,
+)
+from .curriculum import CurriculumConfig, write_curriculum_package
+from .datasets import load_prompt_suite, load_text_corpus
+from .evaluation import benchmark_open_prompts, evaluate_manifest, load_manifest
+from .hf_import import import_hf_dataset
+from .model import ReframrModel
+from .reasoning import REASONING_PROFILES, TOKENIZER_NAME, reasoning_prefix
+from .streaming import fit_model_from_corpus_plan, load_corpus_plan
+from .tokenizer import MAX_TOKENIZER_VOCAB_SIZE, clamp_vocab_size, recommend_vocab_size
+def configure_stdio() -> None:
+    for stream in (sys.stdout, sys.stderr):
+        reconfigure = getattr(stream, "reconfigure", None)
+        if reconfigure is not None:
+            reconfigure(encoding="utf-8")
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        prog="reframr",
+        description="Compute and query REFRAMR analytical language model checkpoints.",
+    )
+    subparsers = parser.add_subparsers(dest="command", required=True)
+    compute = subparsers.add_parser(
+        "compute",
+        aliases=["train"],
+        help="Compute a REFRAMR checkpoint from a text corpus with no epoch loop.",
+    )
+    compute.add_argument(
+        "--input",
+        required=True,
+        help="Path to a text, JSON, or JSONL corpus file, or a directory of such files.",
+    )
+    compute.add_argument("--output", required=True, help="Path to write the .safetensors checkpoint.")
+    compute.add_argument("--embedding-dim", type=int, default=16)
+    compute.add_argument("--state-dim", type=int, default=32)
+    compute.add_argument("--timescales", default="1.0,0.5,0.25,0.125")
+    compute.add_argument("--window-size", type=int, default=2)
+    compute.add_argument("--regularization", type=float, default=1e-3)
+    compute.add_argument("--min-frequency", type=int, default=1)
+    compute.add_argument(
+        "--max-vocab",
+        type=int,
+        default=256,
+        help="Cap analytical embedding vocabulary to keep weight computation fast on CPU.",
+    )
+    compute.add_argument("--tokenizer-vocab-size", type=int, default=0)
+    compute.add_argument("--tokenizer-min-pair-frequency", type=int, default=2)
+    compute.add_argument(
+        "--max-training-examples",
+        type=int,
+        default=60000,
+        help="Cap sampled recurrent training states while still reading the full corpus for tokenizer, embeddings, and transitions.",
+    )
+    compute.add_argument(
+        "--max-transition-contexts",
+        type=int,
+        default=4096,
+        help="Keep only the strongest learned transition contexts per order. Use 0 to disable the cap.",
+    )
+    compute.add_argument(
+        "--max-transition-next-tokens",
+        type=int,
+        default=4,
+        help="Keep this many learned next-token choices per transition context.",
+    )
+    case_group = compute.add_mutually_exclusive_group()
+    case_group.add_argument(
+        "--lowercase",
+        action="store_true",
+        help="Normalize corpus text to lowercase before tokenization.",
+    )
+    case_group.add_argument("--preserve-case", action="store_true", help=argparse.SUPPRESS)
+    compute.add_argument(
+        "--reasoning-profile",
+        choices=sorted(REASONING_PROFILES),
+        default="none",
+        help="Default reasoning-control profile baked into the checkpoint.",
+    )
+    recompute = subparsers.add_parser(
+        "recompute",
+        help="Compute a REFRAMR checkpoint from a streaming corpus plan with no raw-text cache.",
+    )
+    recompute.add_argument("--plan", required=True, help="Path to a streaming corpus plan JSON file.")
+    recompute.add_argument("--output", required=True, help="Path to write the .safetensors checkpoint.")
+    recompute.add_argument("--embedding-dim", type=int, default=16)
+    recompute.add_argument("--state-dim", type=int, default=32)
+    recompute.add_argument("--timescales", default="1.0,0.5,0.25,0.125")
+    recompute.add_argument("--window-size", type=int, default=2)
+    recompute.add_argument("--regularization", type=float, default=1e-3)
+    recompute.add_argument("--min-frequency", type=int, default=1)
+    recompute.add_argument("--max-vocab", type=int, default=256)
+    recompute.add_argument("--tokenizer-vocab-size", type=int, default=0)
+    recompute.add_argument("--tokenizer-min-pair-frequency", type=int, default=2)
+    recompute.add_argument("--max-training-examples", type=int, default=60000)
+    recompute.add_argument("--max-transition-contexts", type=int, default=4096)
+    recompute.add_argument("--max-transition-next-tokens", type=int, default=4)
+    recompute.add_argument("--log-every", type=int, default=0)
+    recompute_case_group = recompute.add_mutually_exclusive_group()
+    recompute_case_group.add_argument("--lowercase", action="store_true")
+    recompute_case_group.add_argument("--preserve-case", action="store_true", help=argparse.SUPPRESS)
+    recompute.add_argument(
+        "--reasoning-profile",
+        choices=sorted(REASONING_PROFILES),
+        default="none",
+        help="Default reasoning-control profile baked into the checkpoint.",
+    )
+    predict = subparsers.add_parser("predict", help="Predict the next-token distribution from a saved model.")
+    predict.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
+    predict.add_argument("--context", required=True, help="Input context text.")
+    predict.add_argument("--top-k", type=int, default=5)
+    predict.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile.",
+    )
+    generate = subparsers.add_parser("generate", help="Generate long-form text from a saved model.")
+    generate.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
+    generate.add_argument("--context", required=True, help="Prompt or starting context text.")
+    generate.add_argument("--system", default="", help="Optional system instruction to prepend as learned context.")
+    generate.add_argument("--max-tokens", type=int, default=64)
+    generate.add_argument("--temperature", type=float, default=0.82)
+    generate.add_argument("--decode-top-k", type=int, default=24)
+    generate.add_argument("--decode-top-p", type=float, default=0.92)
+    generate.add_argument("--repetition-penalty", type=float, default=1.18)
+    generate.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile.",
+    )
+    generate_batch = subparsers.add_parser(
+        "generate-batch",
+        help="Generate answers for a prompt file while keeping one checkpoint loaded.",
+    )
+    generate_batch.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
+    generate_batch.add_argument("--prompts", required=True, help="Path to a TXT, JSON, or JSONL prompt suite.")
+    generate_batch.add_argument("--output", required=True, help="Path to write JSONL generations.")
+    generate_batch.add_argument("--max-tokens", type=int, default=64)
+    generate_batch.add_argument("--temperature", type=float, default=0.82)
+    generate_batch.add_argument("--decode-top-k", type=int, default=24)
+    generate_batch.add_argument("--decode-top-p", type=float, default=0.92)
+    generate_batch.add_argument("--repetition-penalty", type=float, default=1.18)
+    generate_batch.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile.",
+    )
+    serve = subparsers.add_parser(
+        "serve",
+        help="Keep one checkpoint loaded and answer JSONL generation requests from stdin.",
+    )
+    serve.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
+    serve.add_argument("--max-tokens", type=int, default=64)
+    serve.add_argument("--temperature", type=float, default=0.82)
+    serve.add_argument("--decode-top-k", type=int, default=24)
+    serve.add_argument("--decode-top-p", type=float, default=0.92)
+    serve.add_argument("--repetition-penalty", type=float, default=1.18)
+    serve.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile.",
+    )
+    trace = subparsers.add_parser("trace", help="Trace REFRAMR reasoning components through generation steps.")
+    trace.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
+    trace.add_argument("--context", required=True, help="Prompt or starting context text.")
+    trace.add_argument("--max-tokens", type=int, default=8)
+    trace.add_argument("--top-k", type=int, default=5)
+    trace.add_argument("--temperature", type=float, default=0.82)
+    trace.add_argument("--decode-top-p", type=float, default=0.92)
+    trace.add_argument("--repetition-penalty", type=float, default=1.18)
+    trace.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile.",
+    )
+    inspect = subparsers.add_parser("inspect", help="Inspect a REFRAMR safetensors checkpoint.")
+    inspect.add_argument("--model", required=True, help="Path to a .safetensors checkpoint.")
+    craft = subparsers.add_parser(
+        "craft-corpus",
+        help="Generate a JSON-first bootstrap corpus, manifest, and generalization prompt suite.",
+    )
+    craft.add_argument("--output-dir", required=True, help="Directory to write corpus and manifest files.")
+    craft.add_argument(
+        "--variant",
+        choices=("foundation", "generalization"),
+        default="foundation",
+        help="Choose between the mixed foundation corpus and the language-first generalization corpus.",
+    )
+    craft_curriculum = subparsers.add_parser(
+        "craft-curriculum",
+        help="Generate the OkeyMeta JSON curriculum shard, manifest, holdout prompts, and recompute plan.",
+    )
+    craft_curriculum.add_argument("--output-dir", required=True, help="Directory to write curriculum files.")
+    craft_curriculum.add_argument(
+        "--records-per-category",
+        type=int,
+        default=1000,
+        help="How many JSON records to generate for each curriculum category.",
+    )
+    craft_curriculum.add_argument("--seed", type=int, default=7)
+    craft_curriculum.add_argument("--train-ratio", type=float, default=0.92)
+    craft_curriculum.add_argument(
+        "--effective-token-target",
+        type=int,
+        default=0,
+        help="Set plan weighting so compact curriculum statistics represent this many effective tokens.",
+    )
+    evaluate = subparsers.add_parser(
+        "evaluate",
+        help="Evaluate memorization and held-out generalization from a benchmark manifest.",
+    )
+    evaluate.add_argument("--model", required=True, help="Path to a REFRAMR .safetensors checkpoint.")
+    evaluate.add_argument("--manifest", required=True, help="Path to a corpus benchmark manifest JSON file.")
+    evaluate.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile during evaluation.",
+    )
+    evaluate.add_argument("--top-k", type=int, default=5)
+    benchmark_open = subparsers.add_parser(
+        "benchmark-open",
+        help="Run arbitrary prompt files through a checkpoint with open-ended output metrics.",
+    )
+    benchmark_open.add_argument("--model", required=True, help="Path to a REFRAMR .safetensors checkpoint.")
+    benchmark_open.add_argument("--prompts", required=True, help="Path to a TXT, JSON, or JSONL prompt suite.")
+    benchmark_open.add_argument("--max-tokens", type=int, default=64)
+    benchmark_open.add_argument("--temperature", type=float, default=0.82)
+    benchmark_open.add_argument("--decode-top-k", type=int, default=24)
+    benchmark_open.add_argument("--decode-top-p", type=float, default=0.92)
+    benchmark_open.add_argument("--repetition-penalty", type=float, default=1.18)
+    benchmark_open.add_argument(
+        "--reasoning-mode",
+        choices=sorted(REASONING_PROFILES),
+        default=None,
+        help="Override the checkpoint's default reasoning-control profile during benchmarking.",
+    )
+    import_hf = subparsers.add_parser(
+        "import-hf",
+        help="Import Hugging Face dataset text into the REFRAMR JSON record standard.",
+    )
+    import_hf.add_argument("--dataset", required=True, help="Hugging Face dataset id.")
+    import_hf.add_argument("--output", required=True, help="Path to write the JSONL corpus.")
+    import_hf.add_argument("--config", default=None, help="Optional dataset config/subset.")
+    import_hf.add_argument("--split", default="train", help="Dataset split to import.")
+    import_hf.add_argument("--text-field", default=None, help="Explicit text column name.")
+    import_hf.add_argument("--limit", type=int, default=1000, help="Maximum records to import.")
+    import_hf.add_argument(
+        "--min-words",
+        type=int,
+        default=0,
+        help="Drop imported records shorter than this many words.",
+    )
+    import_hf.add_argument(
+        "--max-words",
+        type=int,
+        default=0,
+        help="Drop imported records longer than this many words. Use 0 to disable.",
+    )
+    import_hf.add_argument(
+        "--min-alpha-ratio",
+        type=float,
+        default=0.0,
+        help="Drop imported records whose alphabetic-character ratio falls below this threshold.",
+    )
+    import_hf.add_argument(
+        "--allowed-languages",
+        default="",
+        help="Optional comma-separated language codes to keep, such as en,yo,ig,ha.",
+    )
+    import_hf.add_argument(
+        "--preference-target",
+        choices=("both", "chosen", "rejected"),
+        default="chosen",
+        help="When importing preference datasets, keep both sides or only the chosen/rejected side.",
+    )
+    import_hf.add_argument(
+        "--no-streaming",
+        action="store_true",
+        help="Disable streaming dataset reads.",
+    )
+    return parser
+def parse_timescales(raw_timescales: str) -> tuple[float, ...]:
+    values = [segment.strip() for segment in raw_timescales.split(",") if segment.strip()]
+    if not values:
+        raise ValueError("At least one timescale is required.")
+    return tuple(float(value) for value in values)
+def command_compute(args: argparse.Namespace) -> int:
+    text = load_text_corpus(args.input)
+    requested_vocab_size = args.tokenizer_vocab_size or recommend_vocab_size(
+        text,
+        lowercase=args.lowercase,
+    )
+    tokenizer_vocab_size = clamp_vocab_size(requested_vocab_size)
+    config = ReframrConfig(
+        embedding_dim=args.embedding_dim,
+        state_dim=args.state_dim,
+        timescales=parse_timescales(args.timescales),
+        window_size=args.window_size,
+        regularization=args.regularization,
+        min_frequency=args.min_frequency,
+        max_vocab=args.max_vocab,
+        tokenizer_vocab_size=tokenizer_vocab_size,
+        tokenizer_min_pair_frequency=args.tokenizer_min_pair_frequency,
+        max_training_examples=args.max_training_examples,
+        max_transition_contexts_per_order=(
+            args.max_transition_contexts if args.max_transition_contexts > 0 else None
+        ),
+        max_transition_next_tokens=args.max_transition_next_tokens,
+        lowercase=args.lowercase,
+        default_reasoning_profile=args.reasoning_profile,
+    )
+    model = ReframrModel(config).fit(text)
+    model.save(args.output)
+    assert model.tokenizer is not None
+    assert model.embedding_model is not None
+    summary = {
+        "status": "computed",
+        "format": "safetensors",
+        "model_path": str(Path(args.output).resolve()),
+        "tokenizer_name": TOKENIZER_NAME,
+        "vocab_size": len(model.embedding_model.id_to_token),
+        "tokenizer_vocab_budget": config.tokenizer_vocab_size,
+        "tokenizer_vocab_budget_max": MAX_TOKENIZER_VOCAB_SIZE,
+        "tokenizer_vocab_size": model.tokenizer.vocab_size,
+        "reasoning_profile": config.default_reasoning_profile,
+        "reasoning_tokens": reasoning_prefix(config.default_reasoning_profile),
+        "lowercase": config.lowercase,
+        "max_training_examples": config.max_training_examples,
+        "max_transition_contexts_per_order": config.max_transition_contexts_per_order,
+        "max_transition_next_tokens": config.max_transition_next_tokens,
+        "embedding_dim": config.embedding_dim,
+        "state_dim": config.state_dim,
+        "timescales": list(config.timescales),
+    }
+    print(json.dumps(summary))
+    return 0
+def command_recompute(args: argparse.Namespace) -> int:
+    plan = load_corpus_plan(args.plan)
+    requested_vocab_size = args.tokenizer_vocab_size or 1024
+    tokenizer_vocab_size = clamp_vocab_size(requested_vocab_size)
+    config = ReframrConfig(
+        embedding_dim=args.embedding_dim,
+        state_dim=args.state_dim,
+        timescales=parse_timescales(args.timescales),
+        window_size=args.window_size,
+        regularization=args.regularization,
+        min_frequency=args.min_frequency,
+        max_vocab=args.max_vocab,
+        tokenizer_vocab_size=tokenizer_vocab_size,
+        tokenizer_min_pair_frequency=args.tokenizer_min_pair_frequency,
+        max_training_examples=args.max_training_examples,
+        max_transition_contexts_per_order=(
+            args.max_transition_contexts if args.max_transition_contexts > 0 else None
+        ),
+        max_transition_next_tokens=args.max_transition_next_tokens,
+        lowercase=args.lowercase,
+        default_reasoning_profile=args.reasoning_profile,
+    )
+    model, payload = fit_model_from_corpus_plan(
+        plan,
+        config,
+        log_every=args.log_every,
+    )
+    model.save(args.output)
+    summary = {
+        "status": "recomputed",
+        "format": "safetensors",
+        "streaming": True,
+        "plan_path": str(Path(args.plan).resolve()),
+        "model_path": str(Path(args.output).resolve()),
+        "tokenizer_name": TOKENIZER_NAME,
+        "tokenizer_vocab_budget": config.tokenizer_vocab_size,
+        "tokenizer_vocab_budget_max": MAX_TOKENIZER_VOCAB_SIZE,
+        "tokenizer_vocab_size": payload["tokenizer_vocab_size"],
+        "vocab_size": payload["embedding_vocab_size"],
+        "documents_processed": payload["documents_processed"],
+        "source_counts": payload["source_counts"],
+        "examples_processed": payload["examples_processed"],
+        "associative_examples": payload["associative_examples"],
+        "answer_associative_examples": payload.get("answer_associative_examples", 0),
+        "general_associative_examples": payload.get("general_associative_examples", 0),
+        "answer_intent_examples": payload.get("answer_intent_examples", 0),
+        "answer_start_examples": payload.get("answer_start_examples", 0),
+        "answer_sequence_examples": payload.get("answer_sequence_examples", 0),
+        "prompt_answer_readout_examples": payload.get("prompt_answer_readout_examples", 0),
+        "prompt_answer_start_readout_examples": payload.get("prompt_answer_start_readout_examples", 0),
+        "preference_pairs": payload.get("preference_pairs", 0),
+        "preference_state_pairs": payload.get("preference_state_pairs", 0),
+        "stage_seconds": payload.get("stage_seconds", {}),
+        "readout_solver": payload.get("readout_solver"),
+        "reasoning_profile": config.default_reasoning_profile,
+        "reasoning_tokens": reasoning_prefix(config.default_reasoning_profile),
+        "lowercase": config.lowercase,
+        "max_training_examples": config.max_training_examples,
+        "max_transition_contexts_per_order": config.max_transition_contexts_per_order,
+        "max_transition_next_tokens": config.max_transition_next_tokens,
+        "embedding_dim": config.embedding_dim,
+        "state_dim": config.state_dim,
+        "timescales": list(config.timescales),
+    }
+    print(json.dumps(summary))
+    return 0
+def command_predict(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    distribution = model.predict_next_distribution(
+        args.context,
+        reasoning_mode=args.reasoning_mode,
+    )
+    predictions = sorted(
+        distribution.items(),
+        key=lambda item: item[1],
+        reverse=True,
+    )[: args.top_k]
+    payload = {
+        "context": args.context,
+        "reasoning_mode": args.reasoning_mode or model.config.default_reasoning_profile,
+        "reasoning_tokens": reasoning_prefix(args.reasoning_mode or model.config.default_reasoning_profile),
+        "predictions": [
+            {"token": token, "probability": probability}
+            for token, probability in predictions
+        ],
+    }
+    print(json.dumps(payload))
+    return 0
+def command_generate(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    context = compose_generation_context(args.context, system=args.system)
+    generated_text = model.generate_text(
+        context,
+        max_tokens=args.max_tokens,
+        reasoning_mode=args.reasoning_mode,
+        temperature=args.temperature,
+        top_k=args.decode_top_k,
+        top_p=args.decode_top_p,
+        repetition_penalty=args.repetition_penalty,
+    )
+    payload = {
+        "context": context,
+        "reasoning_mode": args.reasoning_mode or model.config.default_reasoning_profile,
+        "reasoning_tokens": reasoning_prefix(args.reasoning_mode or model.config.default_reasoning_profile),
+        "generated_token_count": len(generated_text.split()),
+        "generated_text": generated_text,
+    }
+    print(json.dumps(payload))
+    return 0
+def compose_generation_context(prompt: str, *, system: str = "") -> str:
+    clean_prompt = prompt.strip()
+    clean_system = system.strip()
+    if not clean_system:
+        return clean_prompt
+    return f"System instruction: {clean_system}\nUser: {clean_prompt}"
+def command_generate_batch(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    prompts = load_prompt_suite(args.prompts)
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    active_mode = args.reasoning_mode or model.config.default_reasoning_profile
+    rows: list[dict[str, object]] = []
+    with output_path.open("w", encoding="utf-8") as handle:
+        for index, record in enumerate(prompts):
+            prompt = str(record["prompt"])
+            context = compose_generation_context(
+                prompt,
+                system=str(record.get("system", "")),
+            )
+            max_tokens = int(record.get("max_tokens", args.max_tokens))
+            generated_text = model.generate_text(
+                context,
+                max_tokens=max_tokens,
+                reasoning_mode=args.reasoning_mode,
+                temperature=args.temperature,
+                top_k=args.decode_top_k,
+                top_p=args.decode_top_p,
+                repetition_penalty=args.repetition_penalty,
+            )
+            row = {
+                "index": index,
+                "prompt": prompt,
+                "context": context,
+                "system": record.get("system", ""),
+                "tags": record.get("tags", []),
+                "reasoning_mode": active_mode,
+                "reasoning_tokens": reasoning_prefix(active_mode),
+                "generated_token_count": len(generated_text.split()),
+                "generated_text": generated_text,
+            }
+            rows.append(row)
+            handle.write(json.dumps(row, ensure_ascii=False, separators=(",", ":")) + "\n")
+    payload = {
+        "status": "generated",
+        "sample_count": len(rows),
+        "model_path": str(Path(args.model).resolve()),
+        "prompts_path": str(Path(args.prompts).resolve()),
+        "output_path": str(output_path.resolve()),
+        "model_loads": 1,
+    }
+    print(json.dumps(payload))
+    return 0
+def command_serve(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    default_mode = args.reasoning_mode or model.config.default_reasoning_profile
+    for index, raw_line in enumerate(sys.stdin):
+        line = raw_line.strip()
+        if not line:
+            continue
+        try:
+            request = json.loads(line)
+        except json.JSONDecodeError as exc:
+            response = {
+                "index": index,
+                "error": "invalid_json",
+                "message": str(exc),
+                "model_loads": 1,
+            }
+            sys.stdout.write(json.dumps(response, ensure_ascii=False, separators=(",", ":")) + "\n")
+            sys.stdout.flush()
+            continue
+        if isinstance(request, str):
+            context = request
+            request_payload: dict[str, object] = {}
+        elif isinstance(request, dict):
+            request_payload = request
+            raw_context = str(request_payload.get("prompt", request_payload.get("context", "")))
+            context = compose_generation_context(
+                raw_context,
+                system=str(request_payload.get("system", "")),
+            )
+        else:
+            response = {
+                "index": index,
+                "error": "invalid_request",
+                "message": "request must be a JSON object or string",
+                "model_loads": 1,
+            }
+            sys.stdout.write(json.dumps(response, ensure_ascii=False, separators=(",", ":")) + "\n")
+            sys.stdout.flush()
+            continue
+        active_mode = str(request_payload.get("reasoning_mode", default_mode))
+        max_tokens = int(request_payload.get("max_tokens", args.max_tokens))
+        temperature = float(request_payload.get("temperature", args.temperature))
+        top_k = int(request_payload.get("decode_top_k", args.decode_top_k))
+        top_p = float(request_payload.get("decode_top_p", args.decode_top_p))
+        repetition_penalty = float(
+            request_payload.get("repetition_penalty", args.repetition_penalty)
+        )
+        generated_text = model.generate_text(
+            context,
+            max_tokens=max_tokens,
+            reasoning_mode=active_mode,
+            temperature=temperature,
+            top_k=top_k,
+            top_p=top_p,
+            repetition_penalty=repetition_penalty,
+        )
+        response = {
+            "index": index,
+            "context": context,
+            "reasoning_mode": active_mode,
+            "reasoning_tokens": reasoning_prefix(active_mode),
+            "generated_token_count": len(generated_text.split()),
+            "generated_text": generated_text,
+            "model_loads": 1,
+        }
+        sys.stdout.write(json.dumps(response, ensure_ascii=False, separators=(",", ":")) + "\n")
+        sys.stdout.flush()
+    return 0
+def command_trace(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    payload = model.trace_generation(
+        args.context,
+        max_tokens=args.max_tokens,
+        reasoning_mode=args.reasoning_mode,
+        top_k=args.top_k,
+        temperature=args.temperature,
+        top_p=args.decode_top_p,
+        repetition_penalty=args.repetition_penalty,
+    )
+    print(json.dumps(payload))
+    return 0
+def command_inspect(args: argparse.Namespace) -> int:
+    print(json.dumps(inspect_checkpoint(args.model)))
+    return 0
+def command_craft_corpus(args: argparse.Namespace) -> int:
+    package = (
+        build_generalization_corpus()
+        if args.variant == "generalization"
+        else build_foundation_corpus()
+    )
+    paths = write_corpus_package(package, args.output_dir)
+    payload = {
+        "name": package.name,
+        "corpus_path": paths["corpus_path"],
+        "manifest_path": paths["manifest_path"],
+        "prompt_suite_path": paths["prompt_suite_path"],
+        "token_count_estimate": len(package.text.split()),
+        "memorization_samples": len(package.memorization_samples),
+        "generalization_samples": len(package.generalization_samples),
+        "generalization_prompt_count": len(package.open_ended_samples),
+        "variant": args.variant,
+        "section_counts": package.section_counts,
+    }
+    print(json.dumps(payload))
+    return 0
+def command_craft_curriculum(args: argparse.Namespace) -> int:
+    payload = write_curriculum_package(
+        args.output_dir,
+        CurriculumConfig(
+            records_per_category=args.records_per_category,
+            seed=args.seed,
+            train_ratio=args.train_ratio,
+        ),
+        effective_token_target=args.effective_token_target or None,
+    )
+    print(json.dumps(payload))
+    return 0
+def command_evaluate(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    manifest = load_manifest(args.manifest)
+    payload = evaluate_manifest(
+        model,
+        manifest,
+        reasoning_mode=args.reasoning_mode,
+        top_k=args.top_k,
+    )
+    print(json.dumps(payload))
+    return 0
+def command_benchmark_open(args: argparse.Namespace) -> int:
+    model = ReframrModel.load(args.model)
+    prompts = load_prompt_suite(args.prompts)
+    payload = benchmark_open_prompts(
+        model,
+        prompts,
+        reasoning_mode=args.reasoning_mode,
+        max_tokens=args.max_tokens,
+        temperature=args.temperature,
+        top_k=args.decode_top_k,
+        top_p=args.decode_top_p,
+        repetition_penalty=args.repetition_penalty,
+    )
+    print(json.dumps(payload))
+    return 0
+def command_import_hf(args: argparse.Namespace) -> int:
+    payload = import_hf_dataset(
+        dataset=args.dataset,
+        output_path=args.output,
+        config=args.config,
+        split=args.split,
+        text_field=args.text_field,
+        limit=args.limit,
+        streaming=not args.no_streaming,
+        preference_target=args.preference_target,
+        min_words=args.min_words,
+        max_words=args.max_words,
+        min_alpha_ratio=args.min_alpha_ratio,
+        allowed_languages=tuple(
+            segment.strip()
+            for segment in args.allowed_languages.split(",")
+            if segment.strip()
+        ),
+    )
+    print(json.dumps(payload))
+    return 0
+def main(argv: list[str] | None = None) -> int:
+    configure_stdio()
+    parser = build_parser()
+    args = parser.parse_args(argv)
+    if args.command in {"compute", "train"}:
+        return command_compute(args)
+    if args.command == "recompute":
+        return command_recompute(args)
+    if args.command == "predict":
+        return command_predict(args)
+    if args.command == "generate":
+        return command_generate(args)
+    if args.command == "generate-batch":
+        return command_generate_batch(args)
+    if args.command == "serve":
+        return command_serve(args)
+    if args.command == "trace":
+        return command_trace(args)
+    if args.command == "inspect":
+        return command_inspect(args)
+    if args.command == "craft-corpus":
+        return command_craft_corpus(args)
+    if args.command == "craft-curriculum":
+        return command_craft_curriculum(args)
+    if args.command == "evaluate":
+        return command_evaluate(args)
+    if args.command == "benchmark-open":
+        return command_benchmark_open(args)
+    if args.command == "import-hf":
+        return command_import_hf(args)
+    parser.error(f"Unknown command: {args.command}")
+    return 2

reframr/config.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from dataclasses import dataclass
+@dataclass(slots=True)
+class ReframrConfig:
+    embedding_dim: int = 16
+    state_dim: int = 32
+    timescales: tuple[float, ...] = (1.0, 0.5, 0.25, 0.125)
+    window_size: int = 2
+    regularization: float = 1e-3
+    min_frequency: int = 1
+    max_vocab: int | None = 256
+    tokenizer_vocab_size: int = 256
+    tokenizer_min_pair_frequency: int = 2
+    max_training_examples: int | None = 60000
+    max_transition_contexts_per_order: int | None = 4096
+    max_transition_next_tokens: int = 4
+    lowercase: bool = False
+    default_reasoning_profile: str = "none"
+    def to_dict(self) -> dict[str, object]:
+        return {
+            "embedding_dim": self.embedding_dim,
+            "state_dim": self.state_dim,
+            "timescales": list(self.timescales),
+            "window_size": self.window_size,
+            "regularization": self.regularization,
+            "min_frequency": self.min_frequency,
+            "max_vocab": self.max_vocab,
+            "tokenizer_vocab_size": self.tokenizer_vocab_size,
+            "tokenizer_min_pair_frequency": self.tokenizer_min_pair_frequency,
+            "max_training_examples": self.max_training_examples,
+            "max_transition_contexts_per_order": self.max_transition_contexts_per_order,
+            "max_transition_next_tokens": self.max_transition_next_tokens,
+            "lowercase": self.lowercase,
+            "default_reasoning_profile": self.default_reasoning_profile,
+        }
+    @classmethod
+    def from_dict(cls, payload: dict[str, object]) -> "ReframrConfig":
+        return cls(
+            embedding_dim=int(payload["embedding_dim"]),
+            state_dim=int(payload["state_dim"]),
+            timescales=tuple(float(value) for value in payload["timescales"]),
+            window_size=int(payload["window_size"]),
+            regularization=float(payload["regularization"]),
+            min_frequency=int(payload["min_frequency"]),
+            max_vocab=(
+                int(payload.get("max_vocab", 256))
+                if payload.get("max_vocab", 256) is not None
+                else None
+            ),
+            tokenizer_vocab_size=int(payload.get("tokenizer_vocab_size", 256)),
+            tokenizer_min_pair_frequency=int(payload.get("tokenizer_min_pair_frequency", 2)),
+            max_training_examples=(
+                int(payload["max_training_examples"])
+                if payload.get("max_training_examples") is not None
+                else None
+            ),
+            max_transition_contexts_per_order=(
+                int(payload["max_transition_contexts_per_order"])
+                if payload.get("max_transition_contexts_per_order") is not None
+                else None
+            ),
+            max_transition_next_tokens=int(payload.get("max_transition_next_tokens", 4)),
+            lowercase=bool(payload.get("lowercase", False)),
+            default_reasoning_profile=str(payload.get("default_reasoning_profile", "none")),
+        )

reframr/corpus.py ADDED Viewed

	@@ -0,0 +1,123 @@

+import re
+from collections import Counter
+from .linalg import Matrix, np, zeros
+TOKEN_PATTERN = re.compile(r"[A-Za-z0-9']+")
+FRAMETOKEN_WORD_PREFIX = "▁"
+def tokenize(text: str) -> list[str]:
+    return TOKEN_PATTERN.findall(text.lower())
+def build_vocabulary(
+    tokens: list[str],
+    min_frequency: int = 1,
+    max_vocab: int | None = None,
+) -> tuple[dict[str, int], list[str]]:
+    counts = Counter(tokens)
+    return build_vocabulary_from_counts(
+        counts,
+        min_frequency=min_frequency,
+        max_vocab=max_vocab,
+    )
+def build_vocabulary_from_counts(
+    counts: dict[str, float],
+    min_frequency: int = 1,
+    max_vocab: int | None = None,
+) -> tuple[dict[str, int], list[str]]:
+    items = [
+        (token, count)
+        for token, count in sorted(counts.items(), key=lambda pair: (-pair[1], pair[0]))
+        if count >= min_frequency
+    ]
+    if max_vocab is not None:
+        if any(_looks_like_frametoken(token) for token, _ in items):
+            items = _prioritize_frametoken_output_items(items)[:max_vocab]
+        else:
+            items = items[:max_vocab]
+    id_to_token = [token for token, _ in items]
+    token_to_id = {token: index for index, token in enumerate(id_to_token)}
+    return token_to_id, id_to_token
+def _looks_like_frametoken(token: str) -> bool:
+    return token.startswith(FRAMETOKEN_WORD_PREFIX) or (
+        token.startswith("<") and token.endswith(">")
+    )
+def _is_special_token(token: str) -> bool:
+    return token.startswith("<") and token.endswith(">")
+def _is_word_start_token(token: str) -> bool:
+    return token.startswith(FRAMETOKEN_WORD_PREFIX)
+def _is_single_letter_word_start(token: str) -> bool:
+    if not token.startswith(FRAMETOKEN_WORD_PREFIX):
+        return False
+    rendered = token[len(FRAMETOKEN_WORD_PREFIX) :]
+    return len(rendered) == 1 and rendered.isalpha() and rendered not in {"A", "I"}
+def _is_bare_fallback_token(token: str) -> bool:
+    return len(token) == 1 and not token.startswith(FRAMETOKEN_WORD_PREFIX)
+def _prioritize_frametoken_output_items(items: list[tuple[str, float]]) -> list[tuple[str, float]]:
+    # FrameToken keeps fallback characters for encoding coverage, but the model's
+    # output/readout vocabulary should spend its capped slots on answerable tokens.
+    def priority(item: tuple[str, float]) -> tuple[int, float, str]:
+        token, count = item
+        if _is_special_token(token):
+            group = 0
+        elif _is_single_letter_word_start(token):
+            group = 3
+        elif _is_word_start_token(token):
+            group = 1
+        elif _is_bare_fallback_token(token):
+            group = 4
+        else:
+            group = 2
+        return (group, -count, token)
+    return sorted(items, key=priority)
+def build_cooccurrence_matrix(
+    tokens: list[str],
+    token_to_id: dict[str, int],
+    window_size: int,
+) -> Matrix:
+    size = len(token_to_id)
+    token_ids = [token_to_id[token] for token in tokens if token in token_to_id]
+    if np is not None and size > 0 and token_ids:
+        matrix = np.zeros((size, size), dtype=np.float64)
+        token_array = np.asarray(token_ids, dtype=np.int64)
+        for offset in range(1, window_size + 1):
+            if len(token_array) <= offset:
+                break
+            left = token_array[:-offset]
+            right = token_array[offset:]
+            weight = 1.0 / offset
+            np.add.at(matrix, (left, right), weight)
+            np.add.at(matrix, (right, left), weight)
+        return matrix.tolist()
+    matrix = zeros(size, size)
+    for index, token_id in enumerate(token_ids):
+        for offset in range(1, window_size + 1):
+            other_index = index + offset
+            if other_index >= len(token_ids):
+                break
+            other_id = token_ids[other_index]
+            weight = 1.0 / offset
+            matrix[token_id][other_id] += weight
+            matrix[other_id][token_id] += weight
+    return matrix

reframr/corpus_recipes.py ADDED Viewed

	@@ -0,0 +1,1257 @@

+import json
+from dataclasses import dataclass
+from pathlib import Path
+@dataclass(slots=True)
+class EvalSample:
+    section: str
+    context: str
+    expected: str
+    def to_dict(self) -> dict[str, str]:
+        return {
+            "section": self.section,
+            "context": self.context,
+            "expected": self.expected,
+        }
+@dataclass(slots=True)
+class OpenEvalSample:
+    section: str
+    context: str
+    required_groups: list[list[str]]
+    banned_phrases: list[str]
+    min_words: int = 12
+    require_punctuation: bool = True
+    max_tokens: int = 56
+    def to_dict(self) -> dict[str, object]:
+        return {
+            "section": self.section,
+            "context": self.context,
+            "required_groups": self.required_groups,
+            "banned_phrases": self.banned_phrases,
+            "min_words": self.min_words,
+            "require_punctuation": self.require_punctuation,
+            "max_tokens": self.max_tokens,
+        }
+@dataclass(slots=True)
+class CorpusRecord:
+    section: str
+    context: str
+    answer: str
+    split: str = "train"
+    @property
+    def text(self) -> str:
+        return _line(self.context, self.answer)
+    def to_dict(self) -> dict[str, str]:
+        return {
+            "section": self.section,
+            "split": self.split,
+            "context": self.context,
+            "answer": self.answer,
+            "text": self.text,
+        }
+@dataclass(slots=True)
+class CorpusPackage:
+    name: str
+    records: list[CorpusRecord]
+    section_counts: dict[str, int]
+    memorization_samples: list[EvalSample]
+    generalization_samples: list[EvalSample]
+    open_ended_samples: list[OpenEvalSample]
+    @property
+    def slug(self) -> str:
+        return self.name.lower().replace(" ", "-")
+    @property
+    def text(self) -> str:
+        if not self.records:
+            return ""
+        return "\n".join(record.text for record in self.records) + "\n"
+    def manifest(self, *, corpus_filename: str) -> dict[str, object]:
+        return {
+            "name": self.name,
+            "corpus_filename": corpus_filename,
+            "section_counts": self.section_counts,
+            "splits": {
+                "memorization": [sample.to_dict() for sample in self.memorization_samples],
+                "generalization": [sample.to_dict() for sample in self.generalization_samples],
+                "open_ended": [sample.to_dict() for sample in self.open_ended_samples],
+            },
+        }
+    def corpus_records(self) -> list[dict[str, str]]:
+        return [record.to_dict() for record in self.records]
+    def prompt_suite(self) -> list[dict[str, object]]:
+        return [
+            {
+                "prompt": sample.context,
+                "tags": [sample.section, "generalization"],
+                "min_words": sample.min_words,
+                "require_punctuation": sample.require_punctuation,
+                "max_tokens": sample.max_tokens,
+            }
+            for sample in self.open_ended_samples
+        ]
+def _line(context: str, expected: str) -> str:
+        return f"{context} {expected}"
+def _balanced_samples(samples: list[EvalSample], total: int) -> list[EvalSample]:
+    buckets: dict[str, list[EvalSample]] = {}
+    for sample in samples:
+        buckets.setdefault(sample.section, []).append(sample)
+    selected: list[EvalSample] = []
+    ordered_sections = sorted(buckets)
+    while len(selected) < total:
+        progressed = False
+        for section in ordered_sections:
+            bucket = buckets[section]
+            if not bucket:
+                continue
+            selected.append(bucket.pop(0))
+            progressed = True
+            if len(selected) >= total:
+                break
+        if not progressed:
+            break
+    return selected
+def _recount_sections(records: list[CorpusRecord]) -> dict[str, int]:
+    counts: dict[str, int] = {}
+    for record in records:
+        counts[record.section] = counts.get(record.section, 0) + 1
+    return counts
+def build_foundation_corpus() -> CorpusPackage:
+    records: list[CorpusRecord] = []
+    lines: list[str] = []
+    section_counts: dict[str, int] = {}
+    memorization: list[EvalSample] = []
+    generalization: list[EvalSample] = []
+    open_ended: list[OpenEvalSample] = []
+    def add_train(section: str, context: str, expected: str, *, sample: bool = False) -> None:
+        records.append(
+            CorpusRecord(
+                section=section,
+                context=context,
+                answer=expected,
+                split="train",
+            )
+        )
+        lines.append(_line(context, expected))
+        section_counts[section] = section_counts.get(section, 0) + 1
+        if sample:
+            memorization.append(EvalSample(section=section, context=context, expected=expected))
+    def add_holdout(section: str, context: str, expected: str) -> None:
+        generalization.append(EvalSample(section=section, context=context, expected=expected))
+    def add_open(
+        section: str,
+        context: str,
+        required_groups: list[list[str]],
+        *,
+        banned_phrases: list[str],
+        min_words: int = 12,
+        require_punctuation: bool = True,
+        max_tokens: int = 56,
+    ) -> None:
+        open_ended.append(
+            OpenEvalSample(
+                section=section,
+                context=context,
+                required_groups=required_groups,
+                banned_phrases=banned_phrases,
+                min_words=min_words,
+                require_punctuation=require_punctuation,
+                max_tokens=max_tokens,
+            )
+        )
+    holdout_addition = {
+        (2, 19),
+        (3, 17),
+        (4, 16),
+        (5, 15),
+        (6, 14),
+        (7, 13),
+        (8, 12),
+        (9, 11),
+        (10, 10),
+        (11, 9),
+        (12, 8),
+        (13, 7),
+        (14, 6),
+        (15, 5),
+        (16, 4),
+        (17, 3),
+        (18, 2),
+        (19, 21),
+        (20, 22),
+        (21, 19),
+        (22, 20),
+        (23, 18),
+        (24, 17),
+        (25, 16),
+    }
+    holdout_successor = {23, 29, 31, 37, 41, 43, 47, 53, 61, 67, 71, 73, 79}
+    holdout_predecessor = {24, 30, 32, 38, 42, 44, 48, 54, 62, 68, 72, 74, 80}
+    holdout_explain_addition = {
+        (7, 9),
+        (8, 11),
+        (10, 13),
+        (12, 15),
+        (14, 9),
+        (15, 14),
+        (16, 12),
+        (18, 7),
+    }
+    holdout_explain_subtraction = {
+        (19, 7),
+        (22, 9),
+        (25, 11),
+        (28, 13),
+        (31, 15),
+        (34, 12),
+    }
+    holdout_explain_multiplication = {
+        (6, 7),
+        (7, 8),
+        (8, 9),
+        (9, 6),
+        (11, 5),
+        (12, 6),
+    }
+    for left in range(1, 41):
+        for right in range(1, 41):
+            context = f"<reason> add {left} plus {right} equals <answer>"
+            expected = str(left + right)
+            if (left, right) in holdout_addition:
+                add_holdout("arithmetic", context, expected)
+            else:
+                add_train("arithmetic", context, expected, sample=(left + right) % 5 == 0)
+    holdout_subtraction = {
+        (9, 4),
+        (12, 5),
+        (15, 6),
+        (18, 7),
+        (21, 8),
+        (24, 9),
+        (27, 10),
+        (30, 11),
+    }
+    for left in range(3, 56):
+        for right in range(1, min(left, 21)):
+            context = f"<reason> subtract {right} from {left} equals <answer>"
+            expected = str(left - right)
+            if (left, right) in holdout_subtraction:
+                add_holdout("arithmetic", context, expected)
+            else:
+                add_train("arithmetic", context, expected, sample=(left - right) % 6 == 0)
+    holdout_multiplication = {
+        (7, 8),
+        (8, 9),
+        (9, 7),
+        (11, 6),
+        (12, 7),
+        (6, 11),
+    }
+    for left in range(2, 21):
+        for right in range(2, 21):
+            context = f"<reason> multiply {left} times {right} equals <answer>"
+            expected = str(left * right)
+            if (left, right) in holdout_multiplication:
+                add_holdout("arithmetic", context, expected)
+            else:
+                add_train("arithmetic", context, expected, sample=(left * right) % 9 == 0)
+    holdout_parity = {33, 37, 41, 45, 52, 58}
+    for value in range(1, 141):
+        context = f"<reason> parity of {value} is <answer>"
+        expected = "even" if value % 2 == 0 else "odd"
+        if value in holdout_parity:
+            add_holdout("arithmetic", context, expected)
+        else:
+            add_train("arithmetic", context, expected, sample=value % 10 == 0)
+    for value in range(1, 181):
+        successor_context = f"<reason> successor of {value} is <answer>"
+        successor_expected = str(value + 1)
+        if value in holdout_successor:
+            add_holdout("sequence", successor_context, successor_expected)
+        else:
+            add_train("sequence", successor_context, successor_expected, sample=value % 7 == 0)
+        predecessor_context = f"<reason> predecessor of {value} is <answer>"
+        predecessor_expected = str(value - 1)
+        if value in holdout_predecessor:
+            add_holdout("sequence", predecessor_context, predecessor_expected)
+        else:
+            add_train("sequence", predecessor_context, predecessor_expected, sample=value % 8 == 0)
+    for left in range(2, 25):
+        for right in range(2, 25):
+            context = f"<reason> explain the sum of {left} and {right} <answer>"
+            expected = (
+                f"Use {left} and {right} as the two addends; their total is "
+                f"{left + right}, so the answer is {left + right}."
+            )
+            if (left, right) in holdout_explain_addition:
+                add_holdout("reasoning", context, expected)
+            else:
+                add_train("reasoning", context, expected, sample=(left + right) % 7 == 0)
+    for left in range(8, 45):
+        for right in range(2, min(left, 17)):
+            context = f"<reason> explain the difference between {left} and {right} <answer>"
+            expected = (
+                f"Start with {left} and remove {right}; the remaining value is "
+                f"{left - right}, so the answer is {left - right}."
+            )
+            if (left, right) in holdout_explain_subtraction:
+                add_holdout("reasoning", context, expected)
+            else:
+                add_train("reasoning", context, expected, sample=(left - right) % 8 == 0)
+    for left in range(2, 17):
+        for right in range(2, 13):
+            context = f"<reason> explain the product of {left} and {right} <answer>"
+            expected = (
+                f"Treat {left} and {right} as factors; combining the equal groups gives "
+                f"{left * right}, so the answer is {left * right}."
+            )
+            if (left, right) in holdout_explain_multiplication:
+                add_holdout("reasoning", context, expected)
+            else:
+                add_train("reasoning", context, expected, sample=(left * right) % 9 == 0)
+    capitals = [
+        ("japan", "tokyo"),
+        ("brazil", "brasilia"),
+        ("canada", "ottawa"),
+        ("france", "paris"),
+        ("germany", "berlin"),
+        ("india", "new delhi"),
+        ("australia", "canberra"),
+        ("egypt", "cairo"),
+        ("kenya", "nairobi"),
+        ("mexico", "mexico city"),
+        ("norway", "oslo"),
+        ("chile", "santiago"),
+        ("argentina", "buenos aires"),
+        ("thailand", "bangkok"),
+        ("indonesia", "jakarta"),
+        ("morocco", "rabat"),
+        ("sweden", "stockholm"),
+        ("finland", "helsinki"),
+        ("peru", "lima"),
+        ("colombia", "bogota"),
+    ]
+    for country, capital in capitals:
+        add_train(
+            "memory",
+            f"<memory> capital of {country} is <answer>",
+            capital,
+            sample=country in {"japan", "brazil", "canada", "france", "india", "kenya"},
+        )
+    analogies_train = [
+        ("bird", "nest", "bee", "hive"),
+        ("fish", "water", "camel", "desert"),
+        ("painter", "brush", "writer", "pen"),
+        ("doctor", "hospital", "teacher", "school"),
+        ("farmer", "field", "captain", "ship"),
+        ("judge", "court", "chef", "kitchen"),
+        ("astronomer", "telescope", "musician", "violin"),
+        ("pilot", "cockpit", "driver", "garage"),
+        ("programmer", "code", "architect", "blueprint"),
+        ("tailor", "needle", "carpenter", "hammer"),
+        ("sailor", "compass", "hiker", "map"),
+        ("chemist", "laboratory", "baker", "oven"),
+        ("photographer", "camera", "sculptor", "chisel"),
+        ("gardener", "soil", "potter", "clay"),
+        ("librarian", "catalog", "analyst", "report"),
+        ("surfer", "wave", "skater", "ramp"),
+        ("director", "script", "conductor", "score"),
+        ("nurse", "clinic", "lawyer", "firm"),
+    ]
+    analogies_holdout = [
+        ("curator", "museum", "editor", "journal"),
+        ("beekeeper", "apiary", "farmer", "barn"),
+        ("surgeon", "scalpel", "artist", "canvas"),
+        ("sailor", "harbor", "miner", "tunnel"),
+        ("scientist", "laboratory", "gardener", "greenhouse"),
+        ("translator", "dictionary", "navigator", "chart"),
+        ("coach", "sideline", "chef", "kitchen"),
+        ("astronaut", "capsule", "diver", "reef"),
+    ]
+    for left_subject, left_object, right_subject, right_object in analogies_train:
+        add_train(
+            "analogy",
+            f"<reason> {left_subject} relates to {left_object} as {right_subject} relates to <answer>",
+            right_object,
+            sample=left_subject in {"bird", "doctor", "judge", "pilot", "chemist", "nurse"},
+        )
+    for left_subject, left_object, right_subject, right_object in analogies_holdout:
+        add_holdout(
+            "analogy",
+            f"<reason> {left_subject} relates to {left_object} as {right_subject} relates to <answer>",
+            right_object,
+        )
+    classifications = [
+        ("sparrow", "bird"),
+        ("salmon", "fish"),
+        ("oak", "tree"),
+        ("rose", "flower"),
+        ("copper", "metal"),
+        ("mercury", "planet"),
+        ("triangle", "shape"),
+        ("python", "language"),
+        ("whale", "mammal"),
+        ("eagle", "bird"),
+        ("lion", "mammal"),
+        ("emerald", "gem"),
+        ("neptune", "planet"),
+        ("ruby", "gem"),
+        ("cedar", "tree"),
+        ("falcon", "bird"),
+        ("orca", "mammal"),
+        ("sapphire", "gem"),
+        ("elm", "tree"),
+        ("swift", "language"),
+    ]
+    for item, group in classifications:
+        add_train(
+            "classification",
+            f"<memory> category of {item} is <answer>",
+            group,
+            sample=item in {"sparrow", "salmon", "oak", "rose", "neptune", "ruby"},
+        )
+    reasoning_phrases = [
+        ("think clearly before final response", "response"),
+        ("verify each claim before answer", "answer"),
+        ("retrieve memory before conclusion", "conclusion"),
+        ("focus on evidence before claim", "claim"),
+        ("plan then reason then answer", "answer"),
+        ("reflect before committing output", "output"),
+        ("use memory when context grows", "grows"),
+        ("check arithmetic before assertion", "assertion"),
+        ("organize steps before conclusion", "conclusion"),
+        ("inspect state before next answer", "answer"),
+        ("paraphrase before claiming novelty", "novelty"),
+        ("stabilize state before long generation", "generation"),
+        ("reuse evidence before rewriting summary", "summary"),
+        ("compare patterns before final synthesis", "synthesis"),
+    ]
+    for phrase, final_word in reasoning_phrases:
+        add_train(
+            "protocol",
+            f"<reason> {phrase} <answer>",
+            final_word,
+            sample=final_word in {"response", "answer", "claim", "generation", "summary"},
+        )
+    paraphrase_train = [
+        (
+            "clear goals and steady practice",
+            "clear goals joined with steady practice create durable skill",
+        ),
+        (
+            "careful review prevents shallow errors",
+            "careful review stops shallow errors before they spread",
+        ),
+        (
+            "patient systems improve over time",
+            "patient systems improve through steady revision over time",
+        ),
+        (
+            "bright ideas need exact execution",
+            "bright ideas need exact execution to become reliable work",
+        ),
+        (
+            "quiet focus strengthens difficult reasoning",
+            "quiet focus strengthens difficult reasoning during long analysis",
+        ),
+        (
+            "small evidence guides better judgment",
+            "small evidence guides better judgment when choices feel similar",
+        ),
+        (
+            "stable memory helps long writing",
+            "stable memory helps long writing keep its shape and intent",
+        ),
+        (
+            "measured iteration protects quality",
+            "measured iteration protects quality while keeping momentum alive",
+        ),
+        (
+            "careful structure scales ambitious work",
+            "careful structure scales ambitious work without needless disorder",
+        ),
+        (
+            "strong prompts need grounded answers",
+            "strong prompts need grounded answers supported by real evidence",
+        ),
+        (
+            "shared context reduces wasted motion",
+            "shared context reduces wasted motion across a complex build",
+        ),
+        (
+            "consistent language sharpens collaboration",
+            "consistent language sharpens collaboration and shortens confusion",
+        ),
+    ]
+    paraphrase_holdout = [
+        (
+            "steady systems reward patient builders",
+            "steady systems reward patient builders with dependable progress",
+        ),
+        (
+            "clear revision protects difficult projects",
+            "clear revision protects difficult projects from hidden drift",
+        ),
+        (
+            "focused memory improves long responses",
+            "focused memory improves long responses during deep reasoning",
+        ),
+        (
+            "clean evidence supports honest claims",
+            "clean evidence supports honest claims during ambitious work",
+        ),
+        (
+            "durable plans reduce fragile execution",
+            "durable plans reduce fragile execution before launch pressure rises",
+        ),
+        (
+            "careful synthesis strengthens global understanding",
+            "careful synthesis strengthens global understanding without empty hype",
+        ),
+    ]
+    for source, target in paraphrase_train:
+        add_train(
+            "paraphrase",
+            f"<reason> paraphrase {source} into stronger prose <answer>",
+            target,
+            sample=source in {
+                "clear goals and steady practice",
+                "patient systems improve over time",
+                "stable memory helps long writing",
+                "shared context reduces wasted motion",
+            },
+        )
+    for source, target in paraphrase_holdout:
+        add_holdout(
+            "paraphrase",
+            f"<reason> paraphrase {source} into stronger prose <answer>",
+            target,
+        )
+    comparison_train = [
+        ("pebble", "stone", "boulder", "largest", "boulder"),
+        ("stream", "river", "ocean", "largest", "ocean"),
+        ("candle", "lantern", "sun", "brightest", "sun"),
+        ("village", "city", "continent", "largest", "continent"),
+        ("breeze", "wind", "storm", "strongest", "storm"),
+        ("cup", "bucket", "reservoir", "largest", "reservoir"),
+        ("violin", "orchestra", "stadium chorus", "loudest", "stadium chorus"),
+        ("ember", "flame", "wildfire", "hottest", "wildfire"),
+        ("minute", "hour", "day", "longest", "day"),
+        ("thread", "rope", "bridge cable", "thickest", "bridge cable"),
+        ("hill", "mountain", "range", "largest", "range"),
+        ("drizzle", "rain", "monsoon", "strongest", "monsoon"),
+        ("spark", "torch", "beacon", "brightest", "beacon"),
+        ("brook", "canal", "delta", "widest", "delta"),
+        ("hut", "house", "tower", "tallest", "tower"),
+        ("cart", "truck", "freighter", "largest", "freighter"),
+        ("path", "road", "highway", "widest", "highway"),
+        ("note", "melody", "symphony", "longest", "symphony"),
+    ]
+    comparison_holdout = [
+        ("seed", "sapling", "forest", "largest", "forest"),
+        ("glimmer", "lamp", "lighthouse", "brightest", "lighthouse"),
+        ("whisper", "speech", "thunder", "loudest", "thunder"),
+        ("creek", "river", "sea", "largest", "sea"),
+        ("trail", "road", "expressway", "widest", "expressway"),
+        ("hill", "cliff", "summit", "highest", "summit"),
+        ("ember", "bonfire", "volcano", "hottest", "volcano"),
+        ("minute", "season", "century", "longest", "century"),
+    ]
+    for first, second, third, comparator, expected in comparison_train:
+        add_train(
+            "comparison",
+            f"<reason> {comparator} among {first} {second} {third} is <answer>",
+            expected,
+            sample=expected in {"boulder", "ocean", "storm", "day", "range", "highway"},
+        )
+    for first, second, third, comparator, expected in comparison_holdout:
+        add_holdout(
+            "comparison",
+            f"<reason> {comparator} among {first} {second} {third} is <answer>",
+            expected,
+        )
+    causal_train = [
+        ("iron left in rain", "rust"),
+        ("clouds cooling into droplets", "rain"),
+        ("plants receiving sunlight", "growth"),
+        ("water reaching freezing temperature", "ice"),
+        ("friction between dry sticks", "heat"),
+        ("strong wind over warm water", "waves"),
+        ("seed placed in moist soil", "sprout"),
+        ("glass exposed to sudden force", "crack"),
+        ("constant pressure on stone", "erosion"),
+        ("fuel meeting flame", "combustion"),
+        ("repeated practice with feedback", "skill"),
+        ("unchecked heat in metal", "expansion"),
+        ("low temperature overnight", "frost"),
+        ("sustained current through filament", "glow"),
+        ("gravity pulling rain downhill", "flow"),
+        ("sleep loss across many nights", "fatigue"),
+        ("overloaded bridge cable", "strain"),
+        ("salt water meeting steel", "corrosion"),
+    ]
+    causal_holdout = [
+        ("dust gathering in still air", "settling"),
+        ("long drought across dry fields", "cracking"),
+        ("steady pressure beneath ice", "creep"),
+        ("clean lens focusing sunlight", "heat"),
+        ("lack of oxygen in closed flame", "extinguish"),
+        ("waves striking rock for years", "wear"),
+    ]
+    for cause, effect in causal_train:
+        add_train(
+            "causal",
+            f"<reason> effect of {cause} is <answer>",
+            effect,
+            sample=effect in {"rust", "rain", "growth", "ice", "skill", "fatigue"},
+        )
+    for cause, effect in causal_holdout:
+        add_holdout(
+            "causal",
+            f"<reason> effect of {cause} is <answer>",
+            effect,
+        )
+    definition_train = [
+        ("orbit", "path traced by one body around another"),
+        ("bridge", "structure that carries passage over an obstacle"),
+        ("catalyst", "substance that speeds a reaction without being consumed"),
+        ("harbor", "protected water area where ships can anchor safely"),
+        ("algorithm", "finite procedure for transforming input into output"),
+        ("archive", "ordered collection preserved for future reference"),
+        ("equilibrium", "state where opposing influences remain balanced"),
+        ("lens", "curved material that focuses or spreads light"),
+        ("reservoir", "stored supply of water or another resource"),
+        ("signal", "pattern that carries information across distance"),
+        ("compiler", "program that translates source code into another form"),
+        ("calendar", "system for organizing days into meaningful cycles"),
+        ("estuary", "place where river water meets the sea"),
+        ("voltage", "difference in electric potential between two points"),
+        ("synapse", "junction where one neuron communicates with another"),
+        ("telescope", "instrument that gathers distant light for observation"),
+    ]
+    definition_holdout = [
+        ("glacier", "mass of ice that moves slowly across land"),
+        ("protocol", "agreed procedure that coordinates reliable exchange"),
+        ("reef", "ridge of rock or coral rising near the water surface"),
+        ("memory", "stored information available for later retrieval"),
+        ("frequency", "how often a repeating event occurs in set time"),
+        ("compass", "instrument that indicates direction relative to north"),
+    ]
+    for term, definition in definition_train:
+        add_train(
+            "definition",
+            f"<memory> define {term} as <answer>",
+            definition,
+            sample=term in {"orbit", "algorithm", "compiler", "harbor", "signal"},
+        )
+    for term, definition in definition_holdout:
+        add_holdout(
+            "definition",
+            f"<memory> define {term} as <answer>",
+            definition,
+        )
+    identity_train = [
+        (
+            "describe REFRAMR briefly",
+            "REFRAMR is an analytical recurrent language system built by OkeyMeta Ltd to compute structure from corpus evidence instead of gradient loops.",
+        ),
+        (
+            "describe REFRAMR in your own words",
+            "REFRAMR is OkeyMeta Ltd language intelligence shaped through analytical memory recurrent state and computed structure rather than opaque training ritual.",
+        ),
+        (
+            "describe REFRAMR in your own words with punctuation",
+            "REFRAMR is recurrent, analytical, and evidence-driven; OkeyMeta Ltd shapes it to compute structure from corpus behavior instead of blind gradient churn.",
+        ),
+        (
+            "describe REFRAMR in your own words, with punctuation",
+            "REFRAMR is a recurrent analytical language system; OkeyMeta Ltd builds it to preserve structure, carry long context, and keep reasoning signals inspectable.",
+        ),
+        (
+            "what is REFRAMR",
+            "REFRAMR is an OkeyMeta analytical language system built around computed memory state and closed form readout.",
+        ),
+        (
+            "what makes REFRAMR different",
+            "REFRAMR differs by combining analytical memory corpus statistics and transparent reasoning traces without standard backprop training",
+        ),
+        (
+            "describe FrameToken briefly",
+            "FrameToken is REFRAMR native tokenizer from OkeyMeta Ltd that preserves reasoning controls while staying fast on ordinary hardware.",
+        ),
+        (
+            "what is REFRAMR mission",
+            "REFRAMR aims to build strong language intelligence through computed structure recurrent memory and interpretable reasoning",
+        ),
+        (
+            "how does REFRAMR reason",
+            "REFRAMR reasons through recurrent state analytical retrieval transition priors and explicit control tokens",
+        ),
+        (
+            "what is REFRAMR memory",
+            "REFRAMR memory is a multi timescale analytical state that compresses history without quadratic attention.",
+        ),
+        (
+            "explain REFRAMR memory for long context",
+            "REFRAMR memory keeps long context by folding prior evidence into a persistent analytical state so later tokens can still respond to earlier structure.",
+        ),
+        (
+            "explain REFRAMR memory for long context in your own words",
+            "REFRAMR keeps long context through a persistent analytical memory state, so earlier structure can still shape later output without a quadratic attention map.",
+        ),
+        (
+            "describe REFRAMR long context memory",
+            "REFRAMR long context memory is a persistent recurrent state that carries history forward without storing every token in a quadratic map.",
+        ),
+        (
+            "what is REFRAMR readout",
+            "REFRAMR readout is a closed form mapping from analytical state to token probabilities.",
+        ),
+        (
+            "what does REFRAMR optimize for",
+            "REFRAMR optimizes for analytical transparency long context behavior and hardware accessible computation",
+        ),
+        (
+            "what is REFRAMR tokenizer",
+            "REFRAMR tokenizer is FrameToken a native OkeyMeta vocabulary system shaped for analytical recurrent generation",
+        ),
+        (
+            "who are you REFRAMR",
+            "I am REFRAMR an OkeyMeta analytical language system shaped by corpus structure and transparent reasoning",
+        ),
+        (
+            "what is REFRAMR voice",
+            "REFRAMR voice is deliberate evidence driven and structurally aware rather than shallow imitation",
+        ),
+        (
+            "who builds REFRAMR",
+            "REFRAMR is built by OkeyMeta Ltd as a recurrent analytical language system for long context reasoning.",
+        ),
+        (
+            "summarize OkeyMeta role in REFRAMR",
+            "OkeyMeta Ltd builds REFRAMR as a transparent analytical language system grounded in corpus structure and recurrent memory",
+        ),
+        (
+            "what is OkeyMeta mission for REFRAMR",
+            "OkeyMeta Ltd is building REFRAMR to turn analytical structure into practical language intelligence on ordinary hardware",
+        ),
+        (
+            "describe REFRAMR with punctuation",
+            "REFRAMR is analytical, recurrent, and deliberate; OkeyMeta Ltd builds it to compute structure from evidence, not gradient ritual.",
+        ),
+        (
+            "summarize REFRAMR with punctuation",
+            "REFRAMR is a recurrent analytical language system; OkeyMeta Ltd builds it to keep structure visible, context persistent, and compute practical.",
+        ),
+        (
+            "summarize FrameToken with punctuation",
+            "FrameToken preserves boundaries, protects control tokens, and stays portable; it gives REFRAMR a clean native interface.",
+        ),
+    ]
+    identity_holdout = [
+        (
+            "explain REFRAMR in one sentence",
+            "REFRAMR is an OkeyMeta analytical language system that computes structure from corpus statistics and explicit memory dynamics",
+        ),
+        (
+            "summarize REFRAMR identity",
+            "REFRAMR is an OkeyMeta analytical recurrent model built to reason with transparent state rather than opaque gradient rituals",
+        ),
+        (
+            "what kind of model is REFRAMR",
+            "REFRAMR is an OkeyMeta post transformer recurrent analytical language model focused on computed structure and long stateful reasoning",
+        ),
+        (
+            "describe REFRAMR purpose",
+            "REFRAMR exists to turn mathematical structure and recurrent memory into practical language intelligence",
+        ),
+        (
+            "who owns REFRAMR",
+            "REFRAMR is built and owned by OkeyMeta Ltd as a long context analytical language effort",
+        ),
+        (
+            "describe FrameToken role",
+            "FrameToken is REFRAMR native tokenizer designed by OkeyMeta Ltd for analytical recurrent generation",
+        ),
+        (
+            "explain REFRAMR with punctuation",
+            "REFRAMR is recurrent, analytical, and long-context oriented; OkeyMeta Ltd built it to compute structure with visible reasoning.",
+        ),
+    ]
+    for prompt, answer in identity_train:
+        add_train(
+            "identity",
+            f"<reason> {prompt} <answer>",
+            answer,
+            sample=prompt in {
+                "describe REFRAMR briefly",
+                "what is REFRAMR",
+                "what makes REFRAMR different",
+                "describe FrameToken briefly",
+                "describe REFRAMR with punctuation",
+            },
+        )
+    for prompt, answer in identity_holdout:
+        add_holdout(
+            "identity",
+            f"<reason> {prompt} <answer>",
+            answer,
+        )
+    exposition_train = [
+        (
+            "explain why long context matters",
+            "Long context matters because ideas unfold across distance: setup, consequence, and revision rarely live in one sentence. A strong recurrent system must carry structure forward, not just local echoes.",
+        ),
+        (
+            "explain why punctuation matters in language models",
+            "Punctuation carries structure, pace, and intent; commas slow rhythm, periods close claims, and colons prepare explanation. A model that ignores marks will often flatten meaning.",
+        ),
+        (
+            "explain how punctuation helps long reasoning",
+            "Punctuation helps long reasoning because sequence alone is not enough: commas stage detail, semicolons balance linked claims, and periods let one conclusion land before the next begins.",
+        ),
+        (
+            "explain why punctuation supports long context",
+            "Punctuation supports long context by keeping long passages segmented and recoverable. When clauses stay marked, memory can preserve relation, pause, and closure more reliably.",
+        ),
+        (
+            "explain why punctuation helps long reasoning",
+            "Punctuation helps long reasoning by separating steps, slowing transitions, and protecting closure. Commas meter detail, colons open explanation, and periods keep one claim from smearing into the next.",
+        ),
+        (
+            "outline REFRAMR workflow",
+            "REFRAMR follows a clean path: build corpus statistics, derive recurrent state behavior, and compute the readout. Each stage stays inspectable; none requires opaque epoch loops.",
+        ),
+        (
+            "explain OkeyMeta design ethic",
+            "OkeyMeta design ethic is practical and strict: keep provenance visible, keep compute sane, and keep the system understandable. Ambition matters, but clarity matters more.",
+        ),
+        (
+            "explain why evidence matters",
+            "Evidence matters because confidence alone is cheap; structure, tests, and reproducible runs make a claim durable. When evidence improves, judgment becomes steadier.",
+        ),
+        (
+            "describe analytical memory",
+            "Analytical memory compresses history into a reusable state; it does not replay every token. That compression is useful only when the state stays orderly, expressive, and inspectable.",
+        ),
+        (
+            "explain corpus quality",
+            "Corpus quality is not only scale: it is structure, range, and cleanliness. Better data teaches a model where to pause, when to compare, and how to finish a thought.",
+        ),
+        (
+            "explain transparent reasoning",
+            "Transparent reasoning does not mean leaking private scratch work; it means exposing useful signals, clear traces, and grounded summaries. The system should reveal why a path dominated.",
+        ),
+        (
+            "describe disciplined generalization",
+            "Disciplined generalization begins with pattern depth, not shallow imitation. A model should reuse structure carefully, vary language naturally, and stay anchored to evidence.",
+        ),
+        (
+            "explain why recurrent state can scale",
+            "Recurrent state can scale because it updates incrementally; it does not rebuild a full attention map at each step. The challenge is quality, not merely length.",
+        ),
+        (
+            "describe strong completion behavior",
+            "Strong completion behavior means the answer reaches a real ending: clauses resolve, punctuation lands, and drift stays contained. A half-finished sentence is not intelligence.",
+        ),
+        (
+            "explain why handcrafted data still matters",
+            "Handcrafted data still matters because it can encode precision, tone, and deliberate contrast. It supplies patterns that scraped noise often blurs or discards.",
+        ),
+        (
+            "explain why punctuation supports long answers",
+            "Punctuation supports long answers because structure must breathe: commas pace detail, semicolons balance related claims, and periods secure closure. Without marks, long prose often collapses into blur.",
+        ),
+        (
+            "describe healthy model discipline",
+            "Healthy model discipline is visible in the small things: exact wording, stable endings, measured confidence, and clean recovery from ambiguity. Strong systems respect detail before spectacle.",
+        ),
+        (
+            "explain why broad corpus style matters",
+            "Broad corpus style matters because the model learns more than facts; it learns transition, emphasis, cadence, and restraint. A rich corpus teaches how to move from premise to finish.",
+        ),
+        (
+            "describe how evidence and style should meet",
+            "Evidence and style should meet in one sentence: the claim must be accurate, and the sentence must be shaped well enough to carry that accuracy without friction. Good language engineering serves both.",
+        ),
+        (
+            "explain why exact retrieval still needs composition",
+            "Exact retrieval still needs composition because recovered facts must land in coherent prose; the answer should connect, not merely appear. Precision becomes more useful when it arrives with structure.",
+        ),
+        (
+            "outline why model endings matter",
+            "Model endings matter for a simple reason: the final clause teaches whether the system understood the task or only imitated momentum. A clean ending shows control, not luck.",
+        ),
+    ]
+    exposition_holdout = [
+        (
+            "explain why sentence endings matter",
+            "Sentence endings matter because closure guides interpretation; a period settles a claim, while a comma signals more is coming. Good models must feel that difference.",
+        ),
+        (
+            "explain why structured data improves writing",
+            "Structured data improves writing because it teaches ordering, emphasis, and transition; the model learns not only facts, but how claims should connect.",
+        ),
+        (
+            "outline why analytical systems need traces",
+            "Analytical systems need traces so operators can inspect dominant signals, compare retrieval paths, and debug drift. Visibility turns mystery into engineering.",
+        ),
+        (
+            "describe why punctuation supports reasoning",
+            "Punctuation supports reasoning by marking relation, pause, and hierarchy; it helps the reader separate evidence from conclusion. A fluent model should use marks intentionally.",
+        ),
+        (
+            "explain why corpus range matters",
+            "Corpus range matters because generalization grows from varied structures, not one narrow script. When prompts diversify, the model learns to pivot with control.",
+        ),
+        (
+            "describe why exact answers still need style",
+            "Exact answers still need style: the right fact should arrive with clean syntax, useful pacing, and a stable finish. Precision and fluency should reinforce each other.",
+        ),
+    ]
+    for prompt, answer in exposition_train:
+        add_train(
+            "exposition",
+            f"<reason> {prompt} <answer>",
+            answer,
+            sample=prompt in {
+                "explain why long context matters",
+                "explain why punctuation matters in language models",
+                "outline REFRAMR workflow",
+                "describe strong completion behavior",
+            },
+        )
+    for prompt, answer in exposition_holdout:
+        add_holdout(
+            "exposition",
+            f"<reason> {prompt} <answer>",
+            answer,
+        )
+    composition_train = [
+        (
+            "ocean",
+            "ocean waves move with patient rhythm and silver foam follows the moonlit shore while distant wind keeps a calm measured pulse",
+        ),
+        (
+            "forest",
+            "forest light falls softly through cedar branches and cool air carries resin and rain while the ground stays quiet beneath careful steps",
+        ),
+        (
+            "desert",
+            "desert heat bends above pale stone and long shadows stretch across patient sand while evening air slowly restores a gentler balance",
+        ),
+        (
+            "city",
+            "city dawn spills across glass towers and quiet streets as buses wake in sequence and windows catch a thin ribbon of gold",
+        ),
+        (
+            "mountain",
+            "mountain air stays bright and thin while granite faces hold the morning sun and distant rivers thread silver lines below",
+        ),
+        (
+            "harbor",
+            "harbor lights shimmer in patient water while cables rest against masts and slow bells mark the edge of another working night",
+        ),
+        (
+            "library",
+            "library silence gathers around tall shelves while lamps hold warm circles of light and every page waits with deliberate calm",
+        ),
+        (
+            "laboratory",
+            "laboratory glass reflects a quiet blue glow while instruments rest in ordered rows and each surface signals exact preparation",
+        ),
+        (
+            "garden",
+            "garden air carries wet soil and green fragrance while trimmed paths divide the beds and new petals lean toward morning light",
+        ),
+        (
+            "observatory",
+            "observatory domes open toward dark sky while motors turn with patient certainty and cold metal frames the waiting stars",
+        ),
+    ]
+    composition_holdout = [
+        (
+            "glacier",
+            "glacier light drifts across slow blue ice while distant air remains clear and every ridge keeps a restrained patient shine",
+        ),
+        (
+            "volcano",
+            "volcano stone holds the memory of fire while dark slopes remain still and rising heat bends the horizon with slow force",
+        ),
+        (
+            "cathedral",
+            "cathedral windows gather colored light while high arches hold a quiet echo and polished stone returns each careful footstep",
+        ),
+        (
+            "market",
+            "market voices braid with morning movement while bright fruit lines the tables and woven shade softens the noonward heat",
+        ),
+        (
+            "reef",
+            "reef water carries shifting bands of color while coral forms patient cities and bright fish stitch motion through clear blue lanes",
+        ),
+        (
+            "station",
+            "station metal hums beneath pale lamps while distant tracks hold a thin vibration and travelers wait inside orderly lines",
+        ),
+        (
+            "courtroom",
+            "courtroom wood carries a formal hush while measured voices rise with care and every pause sharpens the weight of the next sentence",
+        ),
+        (
+            "shipyard",
+            "shipyard steel rings through salted air while cranes turn with slow authority and sparks drift briefly before fading into dusk",
+        ),
+        (
+            "archive",
+            "archive boxes rest in numbered rows while cool air holds the paper scent and each label promises a patient return to memory",
+        ),
+        (
+            "savanna",
+            "savanna light stretches across dry grass while distant heat softens the horizon and watchful movement gathers near the last shade",
+        ),
+        (
+            "workshop",
+            "workshop lamps shine over ordered tools while sawdust settles in pale ribbons and each bench waits for deliberate hands",
+        ),
+        (
+            "bridge",
+            "bridge cables hold their tense geometry while river light drifts below and the roadway hums with disciplined forward motion",
+        ),
+    ]
+    for theme, answer in composition_train:
+        add_train(
+            "composition",
+            f"<reason> write {theme} scene in one paragraph <answer>",
+            answer,
+            sample=theme in {"ocean", "forest", "city", "harbor", "laboratory"},
+        )
+        add_train(
+            "composition",
+            f"<reason> write {theme} scene <answer>",
+            answer,
+            sample=False,
+        )
+    for theme, answer in composition_holdout:
+        add_holdout(
+            "composition",
+            f"<reason> write {theme} scene in one paragraph <answer>",
+            answer,
+        )
+        add_holdout(
+            "composition",
+            f"<reason> write {theme} scene <answer>",
+            answer,
+        )
+    add_open(
+        "composition",
+        "write harbor dawn scene with calm tension",
+        [
+            ["harbor", "port"],
+            ["dawn", "morning", "sunrise", "light"],
+            ["water", "tide", "shore"],
+            ["calm", "quiet", "measured", "tension"],
+        ],
+        banned_phrases=[
+            "harbor lights shimmer in patient water while cables rest against masts and slow bells mark the edge of another working night",
+        ],
+        min_words=16,
+        max_tokens=40,
+    )
+    add_open(
+        "composition",
+        "write laboratory harbor scene with precise calm",
+        [
+            ["laboratory", "glass", "instrument"],
+            ["harbor", "water", "mast", "cable"],
+            ["calm", "quiet", "precise", "ordered"],
+        ],
+        banned_phrases=[],
+        min_words=16,
+        max_tokens=40,
+    )
+    add_open(
+        "identity",
+        "describe REFRAMR in your own words, with punctuation",
+        [
+            ["reframr"],
+            ["okeymeta"],
+            ["analytical", "recurrent", "language", "system"],
+        ],
+        banned_phrases=[
+            "REFRAMR is an analytical recurrent language system built by OkeyMeta Ltd to compute structure from corpus evidence instead of gradient loops",
+            "REFRAMR is analytical, recurrent, and deliberate; OkeyMeta Ltd builds it to compute structure from evidence, not gradient ritual.",
+        ],
+        min_words=12,
+        max_tokens=36,
+    )
+    add_open(
+        "exposition",
+        "explain why punctuation helps long reasoning",
+        [
+            ["punctuation"],
+            ["reasoning", "thinking"],
+            ["structure", "pace", "pause", "closure"],
+        ],
+        banned_phrases=[
+            "Punctuation supports long answers because structure must breathe: commas pace detail, semicolons balance related claims, and periods secure closure. Without marks, long prose often collapses into blur.",
+        ],
+        min_words=18,
+        max_tokens=40,
+    )
+    add_open(
+        "identity",
+        "explain REFRAMR memory for long context in your own words",
+        [
+            ["reframr"],
+            ["memory", "state"],
+            ["context", "history"],
+            ["long", "persistent", "extended"],
+        ],
+        banned_phrases=[
+            "REFRAMR memory is a multi timescale analytical state that compresses history without quadratic attention",
+        ],
+        min_words=16,
+        max_tokens=40,
+    )
+    add_open(
+        "composition",
+        "write archive bridge scene with reflective tension",
+        [
+            ["archive", "paper", "label", "memory"],
+            ["bridge", "cable", "river", "roadway"],
+            ["reflective", "tension", "quiet", "measured"],
+        ],
+        banned_phrases=[],
+        min_words=16,
+        max_tokens=40,
+    )
+    return CorpusPackage(
+        name="FrameCorpus-Foundation-v2",
+        records=records,
+        section_counts=section_counts,
+        memorization_samples=_balanced_samples(memorization, 24),
+        generalization_samples=_balanced_samples(generalization, 16),
+        open_ended_samples=open_ended,
+    )
+def build_generalization_corpus() -> CorpusPackage:
+    foundation = build_foundation_corpus()
+    allowed_sections = {
+        "analogy",
+        "paraphrase",
+        "comparison",
+        "causal",
+        "definition",
+        "identity",
+        "exposition",
+        "composition",
+    }
+    records = [
+        record
+        for record in foundation.records
+        if record.section in allowed_sections
+    ]
+    generalization = [
+        sample
+        for sample in foundation.generalization_samples
+        if sample.section in allowed_sections
+    ]
+    open_ended = [
+        sample
+        for sample in foundation.open_ended_samples
+        if sample.section in allowed_sections
+    ]
+    return CorpusPackage(
+        name="FrameCorpus-Generalization-v1",
+        records=records,
+        section_counts=_recount_sections(records),
+        memorization_samples=[],
+        generalization_samples=_balanced_samples(generalization, min(16, len(generalization))),
+        open_ended_samples=open_ended,
+    )
+def write_corpus_package(package: CorpusPackage, output_dir: str | Path) -> dict[str, str]:
+    directory = Path(output_dir)
+    directory.mkdir(parents=True, exist_ok=True)
+    base_filename = package.slug
+    corpus_filename = f"{base_filename}.jsonl"
+    manifest_filename = f"{base_filename}.manifest.json"
+    prompt_suite_filename = f"{base_filename}.prompts.jsonl"
+    corpus_path = directory / corpus_filename
+    manifest_path = directory / manifest_filename
+    prompt_suite_path = directory / prompt_suite_filename
+    corpus_path.write_text(
+        "\n".join(json.dumps(record, ensure_ascii=True) for record in package.corpus_records()) + "\n",
+        encoding="utf-8",
+    )
+    manifest_path.write_text(
+        json.dumps(package.manifest(corpus_filename=corpus_filename), indent=2),
+        encoding="utf-8",
+    )
+    prompt_suite_path.write_text(
+        "\n".join(json.dumps(record, ensure_ascii=True) for record in package.prompt_suite()) + "\n",
+        encoding="utf-8",
+    )
+    return {
+        "corpus_path": str(corpus_path.resolve()),
+        "manifest_path": str(manifest_path.resolve()),
+        "prompt_suite_path": str(prompt_suite_path.resolve()),
+    }

reframr/curriculum.py ADDED Viewed

The diff for this file is too large to render. See raw diff

reframr/datasets.py ADDED Viewed

	@@ -0,0 +1,165 @@

+import json
+from pathlib import Path
+from .text_quality import clean_answer_text, clean_context_text, clean_training_text
+TEXT_EXTENSIONS = {".txt", ".md", ".text"}
+STRUCTURED_EXTENSIONS = {".jsonl", ".json"}
+def _default_record_weight(record_type: str) -> int:
+    if record_type == "dialogue_turn":
+        return 2
+    if record_type == "instruction_answer":
+        return 2
+    if record_type == "preference_chosen":
+        return 3
+    if record_type == "preference_rejected":
+        return 0
+    return 1
+def _record_repeat_count(record: object) -> int:
+    if not isinstance(record, dict):
+        return 1
+    if bool(record.get("drop")):
+        return 0
+    raw_weight = record.get("weight")
+    if raw_weight is not None:
+        try:
+            numeric = int(round(float(raw_weight)))
+        except (TypeError, ValueError):
+            numeric = 1
+        return max(0, min(8, numeric))
+    return _default_record_weight(str(record.get("record_type", "")))
+def _coerce_text_record(record: object) -> str:
+    if isinstance(record, str):
+        return clean_training_text(record.strip())
+    if isinstance(record, dict):
+        if "text" in record:
+            return clean_training_text(str(record["text"]).strip())
+        if "content" in record:
+            return clean_training_text(str(record["content"]).strip())
+        if "context" in record and "answer" in record:
+            context = clean_context_text(str(record["context"]).strip())
+            answer = clean_answer_text(str(record["answer"]).strip())
+            if context and answer:
+                return f"<reason> {context} <answer> {answer}"
+    return ""
+def _coerce_prompt_record(record: object) -> dict[str, object] | None:
+    if isinstance(record, str):
+        prompt = record.strip()
+        return {"prompt": prompt, "tags": []} if prompt else None
+    if isinstance(record, dict):
+        raw_prompt = record.get("prompt", record.get("context", ""))
+        prompt = clean_context_text(str(raw_prompt).strip())
+        if not prompt:
+            return None
+        raw_tags = record.get("tags", [])
+        tags = [str(tag) for tag in raw_tags] if isinstance(raw_tags, list) else []
+        normalized = dict(record)
+        normalized["prompt"] = prompt
+        normalized["tags"] = tags
+        return normalized
+    return None
+def load_text_corpus(source: str | Path) -> str:
+    path = Path(source)
+    if path.is_dir():
+        parts = [
+            load_text_corpus(child)
+            for child in sorted(path.rglob("*"))
+            if child.is_file() and child.suffix.lower() in TEXT_EXTENSIONS | STRUCTURED_EXTENSIONS
+        ]
+        return "\n".join(part for part in parts if part.strip())
+    suffix = path.suffix.lower()
+    if suffix in TEXT_EXTENSIONS:
+        return path.read_text(encoding="utf-8")
+    if suffix == ".jsonl":
+        lines = []
+        for line in path.read_text(encoding="utf-8").splitlines():
+            if not line.strip():
+                continue
+            record = json.loads(line)
+            text = _coerce_text_record(record)
+            if text:
+                lines.extend([text] * _record_repeat_count(record))
+        return "\n".join(lines)
+    if suffix == ".json":
+        payload = json.loads(path.read_text(encoding="utf-8"))
+        if isinstance(payload, list):
+            parts: list[str] = []
+            for item in payload:
+                text = _coerce_text_record(item)
+                if text:
+                    parts.extend([text] * _record_repeat_count(item))
+            return "\n".join(parts)
+        if isinstance(payload, dict):
+            if "texts" in payload and isinstance(payload["texts"], list):
+                parts: list[str] = []
+                for item in payload["texts"]:
+                    text = _coerce_text_record(item)
+                    if text:
+                        parts.extend([text] * _record_repeat_count(item))
+                return "\n".join(parts)
+            if "records" in payload and isinstance(payload["records"], list):
+                parts: list[str] = []
+                for item in payload["records"]:
+                    text = _coerce_text_record(item)
+                    if text:
+                        parts.extend([text] * _record_repeat_count(item))
+                return "\n".join(parts)
+            text = _coerce_text_record(payload)
+            if text:
+                return "\n".join([text] * _record_repeat_count(payload))
+    raise ValueError(f"Unsupported corpus source: {path}")
+def load_prompt_suite(source: str | Path) -> list[dict[str, object]]:
+    path = Path(source)
+    suffix = path.suffix.lower()
+    prompts: list[dict[str, object]] = []
+    if suffix in TEXT_EXTENSIONS:
+        for line in path.read_text(encoding="utf-8").splitlines():
+            record = _coerce_prompt_record(line)
+            if record is not None:
+                prompts.append(record)
+        return prompts
+    if suffix == ".jsonl":
+        for line in path.read_text(encoding="utf-8").splitlines():
+            if not line.strip():
+                continue
+            record = _coerce_prompt_record(json.loads(line))
+            if record is not None:
+                prompts.append(record)
+        return prompts
+    if suffix == ".json":
+        payload = json.loads(path.read_text(encoding="utf-8"))
+        if isinstance(payload, list):
+            for item in payload:
+                record = _coerce_prompt_record(item)
+                if record is not None:
+                    prompts.append(record)
+            return prompts
+        if isinstance(payload, dict):
+            if "prompts" in payload and isinstance(payload["prompts"], list):
+                for item in payload["prompts"]:
+                    record = _coerce_prompt_record(item)
+                    if record is not None:
+                        prompts.append(record)
+                return prompts
+            record = _coerce_prompt_record(payload)
+            if record is not None:
+                return [record]
+    raise ValueError(f"Unsupported prompt suite: {path}")

reframr/embeddings.py ADDED Viewed

	@@ -0,0 +1,457 @@

+from __future__ import annotations
+import math
+from dataclasses import dataclass
+from .corpus import build_cooccurrence_matrix, build_vocabulary, tokenize
+from .linalg import Matrix, Vector, mean, np, top_k_eigenpairs_symmetric, zeros
+try:
+    from scipy import sparse as scipy_sparse
+    from scipy.sparse.linalg import svds as scipy_svds
+except (ImportError, ModuleNotFoundError, OSError):
+    scipy_sparse = None
+    scipy_svds = None
+SKETCHED_EMBEDDING_VOCAB_THRESHOLD = 2048
+def _remove_common_embedding_axis(embeddings: object, row_strength: object | None = None) -> object:
+    if np is None:
+        return embeddings
+    values = np.asarray(embeddings, dtype=np.float64)
+    if values.size == 0 or len(values.shape) != 2:
+        return values
+    norms = np.linalg.norm(values, axis=1)
+    nonzero = norms > 1e-12
+    values[nonzero] /= norms[nonzero, None]
+    if row_strength is not None:
+        strength = np.asarray(row_strength, dtype=np.float64)
+        if strength.shape[0] == values.shape[0]:
+            values[nonzero] *= np.log1p(strength[nonzero])[:, None]
+    common_axis = values.mean(axis=0, keepdims=True)
+    values = values - common_axis
+    norms = np.linalg.norm(values, axis=1)
+    nonzero = norms > 1e-12
+    values[nonzero] /= norms[nonzero, None]
+    if row_strength is not None:
+        strength = np.asarray(row_strength, dtype=np.float64)
+        if strength.shape[0] == values.shape[0]:
+            values[nonzero] *= np.log1p(strength[nonzero])[:, None]
+    return values
+def _sketched_sparse_ppmi_embedding(ppmi: object, embedding_dim: int) -> object:
+    coo = ppmi.tocoo()
+    rows = coo.row.astype(np.int64, copy=False)
+    cols = coo.col.astype(np.int64, copy=False)
+    values = coo.data.astype(np.float64, copy=False)
+    embeddings = np.zeros((ppmi.shape[0], embedding_dim), dtype=np.float64)
+    if embedding_dim <= 0 or values.size == 0:
+        return embeddings
+    buckets = ((cols * 1103515245 + 12345) % embedding_dim).astype(np.int64, copy=False)
+    signs = np.where(((cols * 214013 + 2531011) & 1) == 0, 1.0, -1.0)
+    np.add.at(embeddings, (rows, buckets), values * signs)
+    row_strength = np.sqrt(np.asarray(ppmi.sum(axis=1)).ravel())
+    return _remove_common_embedding_axis(embeddings, row_strength)
+def fit_sketched_ppmi_embedding_from_counts(
+    id_to_token: list[str],
+    rows: dict[int, dict[int, float]],
+    *,
+    embedding_dim: int,
+) -> EmbeddingModel:
+    if not id_to_token:
+        raise ValueError("Cannot fit REFRAMR embeddings without a vocabulary.")
+    if embedding_dim <= 0:
+        raise ValueError("Embedding dimension must be positive.")
+    size = len(id_to_token)
+    token_to_id = {token: index for index, token in enumerate(id_to_token)}
+    if np is None:
+        embeddings = zeros(size, embedding_dim)
+        row_sums = [0.0 for _ in range(size)]
+        for row, columns in rows.items():
+            row_sums[row] = sum(columns.values())
+        total = sum(row_sums)
+        if total <= 0.0:
+            return EmbeddingModel(token_to_id=token_to_id, id_to_token=id_to_token, embeddings=embeddings, ppmi_matrix=[])
+        for row, columns in rows.items():
+            for col, count in columns.items():
+                denominator = row_sums[row] * row_sums[col]
+                if count <= 0.0 or denominator <= 0.0:
+                    continue
+                value = math.log((count * total) / denominator)
+                if value <= 0.0:
+                    continue
+                bucket = (col * 1103515245 + 12345) % embedding_dim
+                sign = 1.0 if ((col * 214013 + 2531011) & 1) == 0 else -1.0
+                embeddings[row][bucket] += value * sign
+        return EmbeddingModel(token_to_id=token_to_id, id_to_token=id_to_token, embeddings=embeddings, ppmi_matrix=[])
+    embeddings = np.zeros((size, embedding_dim), dtype=np.float64)
+    row_sums = np.zeros(size, dtype=np.float64)
+    for row, columns in rows.items():
+        row_sums[row] = sum(columns.values())
+    total = float(row_sums.sum())
+    if total <= 0.0:
+        return EmbeddingModel(token_to_id=token_to_id, id_to_token=id_to_token, embeddings=embeddings, ppmi_matrix=[])
+    for row, columns in rows.items():
+        if not columns or row_sums[row] <= 0.0:
+            continue
+        cols = np.fromiter(columns.keys(), dtype=np.int64)
+        counts = np.fromiter(columns.values(), dtype=np.float64)
+        denominators = row_sums[row] * row_sums[cols]
+        valid = (counts > 0.0) & (denominators > 0.0)
+        if not np.any(valid):
+            continue
+        cols = cols[valid]
+        values = np.log((counts[valid] * total) / denominators[valid])
+        positive = values > 0.0
+        if not np.any(positive):
+            continue
+        cols = cols[positive]
+        values = values[positive]
+        buckets = ((cols * 1103515245 + 12345) % embedding_dim).astype(np.int64, copy=False)
+        signs = np.where(((cols * 214013 + 2531011) & 1) == 0, 1.0, -1.0)
+        np.add.at(embeddings[row], buckets, values * signs)
+    embeddings = _remove_common_embedding_axis(embeddings, row_sums)
+    return EmbeddingModel(
+        token_to_id=token_to_id,
+        id_to_token=id_to_token,
+        embeddings=embeddings,
+        ppmi_matrix=[],
+    )
+def _positive_ppmi_values(
+    *,
+    row: int,
+    columns: dict[int, float],
+    row_sums: object,
+    total: float,
+) -> tuple[object, object]:
+    cols = np.fromiter(columns.keys(), dtype=np.int64)
+    counts = np.fromiter(columns.values(), dtype=np.float64)
+    if cols.size == 0:
+        return cols, counts
+    denominators = float(row_sums[row]) * row_sums[cols]
+    valid = (counts > 0.0) & (denominators > 0.0)
+    if not np.any(valid):
+        return cols[:0], counts[:0]
+    cols = cols[valid]
+    values = np.log((counts[valid] * total) / denominators[valid])
+    positive = values > 0.0
+    return cols[positive], values[positive]
+def fit_randomized_ppmi_embedding_from_counts(
+    id_to_token: list[str],
+    rows: dict[int, dict[int, float]],
+    *,
+    embedding_dim: int,
+    oversampling: int = 32,
+) -> EmbeddingModel:
+    if np is None:
+        return fit_sketched_ppmi_embedding_from_counts(
+            id_to_token,
+            rows,
+            embedding_dim=embedding_dim,
+        )
+    if not id_to_token:
+        raise ValueError("Cannot fit REFRAMR embeddings without a vocabulary.")
+    if embedding_dim <= 0:
+        raise ValueError("Embedding dimension must be positive.")
+    size = len(id_to_token)
+    token_to_id = {token: index for index, token in enumerate(id_to_token)}
+    row_sums = np.zeros(size, dtype=np.float64)
+    for row, columns in rows.items():
+        row_sums[row] = sum(columns.values())
+    total = float(row_sums.sum())
+    if total <= 0.0:
+        return EmbeddingModel(
+            token_to_id=token_to_id,
+            id_to_token=id_to_token,
+            embeddings=np.zeros((size, embedding_dim), dtype=np.float64),
+            ppmi_matrix=[],
+        )
+    width = min(size, max(embedding_dim, embedding_dim + oversampling))
+    rng = np.random.default_rng(1729 + size * 31 + embedding_dim)
+    omega = rng.standard_normal((size, width)).astype(np.float64, copy=False)
+    sketch = np.zeros((size, width), dtype=np.float64)
+    ppmi_cache: dict[int, tuple[object, object]] = {}
+    for row, columns in rows.items():
+        if not columns or row_sums[row] <= 0.0:
+            continue
+        cols, values = _positive_ppmi_values(
+            row=row,
+            columns=columns,
+            row_sums=row_sums,
+            total=total,
+        )
+        if values.size == 0:
+            continue
+        ppmi_cache[row] = (cols, values)
+        sketch[row] = values @ omega[cols]
+    if not ppmi_cache:
+        return EmbeddingModel(
+            token_to_id=token_to_id,
+            id_to_token=id_to_token,
+            embeddings=np.zeros((size, embedding_dim), dtype=np.float64),
+            ppmi_matrix=[],
+        )
+    basis, _ = np.linalg.qr(sketch, mode="reduced")
+    compressed = np.zeros((basis.shape[1], size), dtype=np.float64)
+    for row, (cols, values) in ppmi_cache.items():
+        compressed[:, cols] += basis[row, :, None] * values[None, :]
+    left_small, singular_values, _ = np.linalg.svd(compressed, full_matrices=False)
+    left = basis @ left_small
+    width = min(embedding_dim, left.shape[1], singular_values.shape[0])
+    embeddings = np.zeros((size, embedding_dim), dtype=np.float64)
+    if width > 0:
+        embeddings[:, :width] = left[:, :width] * np.sqrt(np.maximum(singular_values[:width], 0.0))[None, :]
+    embeddings = _remove_common_embedding_axis(embeddings, np.sqrt(row_sums))
+    return EmbeddingModel(
+        token_to_id=token_to_id,
+        id_to_token=id_to_token,
+        embeddings=embeddings,
+        ppmi_matrix=[],
+    )
+def positive_pointwise_mutual_information(matrix: Matrix) -> Matrix:
+    if scipy_sparse is not None and scipy_sparse.issparse(matrix):
+        counts = matrix.tocoo()
+        if counts.nnz == 0:
+            return scipy_sparse.csr_matrix(counts.shape, dtype=np.float64)
+        row_sums = np.asarray(matrix.sum(axis=1)).ravel()
+        total = float(row_sums.sum())
+        if total == 0.0:
+            return scipy_sparse.csr_matrix(counts.shape, dtype=np.float64)
+        denominators = row_sums[counts.row] * row_sums[counts.col]
+        valid = (counts.data > 0.0) & (denominators > 0.0)
+        if not np.any(valid):
+            return scipy_sparse.csr_matrix(counts.shape, dtype=np.float64)
+        ratios = (counts.data[valid] * total) / denominators[valid]
+        data = np.maximum(np.log(ratios), 0.0)
+        keep = data > 0.0
+        if not np.any(keep):
+            return scipy_sparse.csr_matrix(counts.shape, dtype=np.float64)
+        return scipy_sparse.coo_matrix(
+            (
+                data[keep],
+                (counts.row[valid][keep], counts.col[valid][keep]),
+            ),
+            shape=counts.shape,
+            dtype=np.float64,
+        ).tocsr()
+    if not matrix:
+        return []
+    if np is not None:
+        counts = np.asarray(matrix, dtype=np.float64)
+        row_sums = counts.sum(axis=1)
+        total = float(row_sums.sum())
+        if total == 0.0:
+            return np.zeros_like(counts).tolist()
+        denominator = np.outer(row_sums, row_sums)
+        valid = (counts > 0.0) & (denominator > 0.0)
+        ppmi = np.zeros_like(counts)
+        with np.errstate(divide="ignore", invalid="ignore"):
+            ratios = np.divide(
+                counts * total,
+                denominator,
+                out=np.ones_like(counts),
+                where=valid,
+            )
+            ppmi[valid] = np.maximum(np.log(ratios[valid]), 0.0)
+        return ppmi.tolist()
+    row_sums = [sum(row) for row in matrix]
+    total = sum(row_sums)
+    if total == 0.0:
+        return zeros(len(matrix), len(matrix))
+    ppmi = zeros(len(matrix), len(matrix))
+    for row in range(len(matrix)):
+        for col in range(len(matrix[row])):
+            count = matrix[row][col]
+            if count <= 0.0 or row_sums[row] == 0.0 or row_sums[col] == 0.0:
+                continue
+            p_ij = count / total
+            p_i = row_sums[row] / total
+            p_j = row_sums[col] / total
+            value = math.log(p_ij / (p_i * p_j))
+            ppmi[row][col] = max(0.0, value)
+    return ppmi
+@dataclass(slots=True)
+class EmbeddingModel:
+    token_to_id: dict[str, int]
+    id_to_token: list[str]
+    embeddings: Matrix
+    ppmi_matrix: Matrix
+    def vector(self, token: str) -> Vector:
+        index = self.token_to_id.get(token)
+        if index is None and token.lower() != token:
+            index = self.token_to_id.get(token.lower())
+        if index is None:
+            return [0.0 for _ in range(self.dimension)]
+        row = self.embeddings[index]
+        return row.astype(float).tolist() if hasattr(row, "tolist") else row[:]
+    @property
+    def dimension(self) -> int:
+        if hasattr(self.embeddings, "shape"):
+            return int(self.embeddings.shape[1]) if len(self.embeddings.shape) > 1 else 0
+        return len(self.embeddings[0]) if self.embeddings else 0
+    @property
+    def projection_axis(self) -> Vector:
+        if hasattr(self.embeddings, "shape"):
+            if int(self.embeddings.shape[0]) == 0:
+                return []
+            return self.embeddings.mean(axis=0).astype(float).tolist()
+        if not self.embeddings:
+            return []
+        return [
+            mean([row[column] for row in self.embeddings])
+            for column in range(self.dimension)
+        ]
+def fit_ppmi_embedding(
+    text: str,
+    *,
+    embedding_dim: int,
+    window_size: int,
+    min_frequency: int = 1,
+    max_vocab: int | None = None,
+) -> EmbeddingModel:
+    tokens = tokenize(text)
+    if not tokens:
+        raise ValueError("Cannot fit REFRAMR embeddings on empty text.")
+    return fit_ppmi_embedding_from_tokens(
+        tokens,
+        embedding_dim=embedding_dim,
+        window_size=window_size,
+        min_frequency=min_frequency,
+        max_vocab=max_vocab,
+    )
+def fit_ppmi_embedding_from_tokens(
+    tokens: list[str],
+    *,
+    embedding_dim: int,
+    window_size: int,
+    min_frequency: int = 1,
+    max_vocab: int | None = None,
+) -> EmbeddingModel:
+    if not tokens:
+        raise ValueError("Cannot fit REFRAMR embeddings on an empty token stream.")
+    token_to_id, id_to_token = build_vocabulary(tokens, min_frequency, max_vocab)
+    cooccurrence = build_cooccurrence_matrix(tokens, token_to_id, window_size)
+    ppmi = positive_pointwise_mutual_information(cooccurrence)
+    eigenpairs = top_k_eigenpairs_symmetric(ppmi, embedding_dim)
+    embeddings = zeros(len(id_to_token), embedding_dim)
+    for component, (eigenvalue, eigenvector) in enumerate(eigenpairs):
+        scale = math.sqrt(max(eigenvalue, 0.0))
+        for row in range(len(id_to_token)):
+            embeddings[row][component] = eigenvector[row] * scale
+    if np is not None:
+        embeddings = _remove_common_embedding_axis(np.asarray(embeddings, dtype=np.float64))
+    return EmbeddingModel(
+        token_to_id=token_to_id,
+        id_to_token=id_to_token,
+        embeddings=embeddings,
+        ppmi_matrix=ppmi,
+    )
+def fit_ppmi_embedding_from_cooccurrence(
+    id_to_token: list[str],
+    cooccurrence: Matrix,
+    *,
+    embedding_dim: int,
+) -> EmbeddingModel:
+    if not id_to_token:
+        raise ValueError("Cannot fit REFRAMR embeddings without a vocabulary.")
+    ppmi = positive_pointwise_mutual_information(cooccurrence)
+    if scipy_sparse is not None and scipy_sparse.issparse(ppmi):
+        embedding_width = min(embedding_dim, len(id_to_token))
+        if len(id_to_token) >= SKETCHED_EMBEDDING_VOCAB_THRESHOLD or embedding_width >= 128:
+            embeddings = _sketched_sparse_ppmi_embedding(ppmi, embedding_dim)
+            return EmbeddingModel(
+                token_to_id={token: index for index, token in enumerate(id_to_token)},
+                id_to_token=id_to_token,
+                embeddings=embeddings,
+                ppmi_matrix=[],
+            )
+        embeddings = zeros(len(id_to_token), embedding_dim)
+        if embedding_width <= 0 or ppmi.nnz == 0:
+            return EmbeddingModel(
+                token_to_id={token: index for index, token in enumerate(id_to_token)},
+                id_to_token=id_to_token,
+                embeddings=embeddings,
+                ppmi_matrix=[],
+            )
+        if embedding_width < min(ppmi.shape) and scipy_svds is not None:
+            left, values, _ = scipy_svds(ppmi.asfptype(), k=embedding_width, which="LM")
+            order = np.argsort(values)[::-1]
+            for component, source_index in enumerate(order):
+                scale = math.sqrt(max(float(values[source_index]), 0.0))
+                column = left[:, source_index]
+                for row, value in enumerate(column):
+                    embeddings[row][component] = float(value) * scale
+        else:
+            dense = ppmi.toarray().tolist()
+            eigenpairs = top_k_eigenpairs_symmetric(dense, embedding_width)
+            for component, (eigenvalue, eigenvector) in enumerate(eigenpairs):
+                scale = math.sqrt(max(eigenvalue, 0.0))
+                for row in range(len(id_to_token)):
+                    embeddings[row][component] = eigenvector[row] * scale
+        if np is not None:
+            embeddings = _remove_common_embedding_axis(np.asarray(embeddings, dtype=np.float64))
+        return EmbeddingModel(
+            token_to_id={token: index for index, token in enumerate(id_to_token)},
+            id_to_token=id_to_token,
+            embeddings=embeddings,
+            ppmi_matrix=[],
+        )
+    eigenpairs = top_k_eigenpairs_symmetric(ppmi, embedding_dim)
+    embeddings = zeros(len(id_to_token), embedding_dim)
+    for component, (eigenvalue, eigenvector) in enumerate(eigenpairs):
+        scale = math.sqrt(max(eigenvalue, 0.0))
+        for row in range(len(id_to_token)):
+            embeddings[row][component] = eigenvector[row] * scale
+    if np is not None:
+        embeddings = _remove_common_embedding_axis(np.asarray(embeddings, dtype=np.float64))
+    return EmbeddingModel(
+        token_to_id={token: index for index, token in enumerate(id_to_token)},
+        id_to_token=id_to_token,
+        embeddings=embeddings,
+        ppmi_matrix=ppmi,
+    )

reframr/evaluation.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import json
+from pathlib import Path
+from .model import ReframrModel
+def load_manifest(path: str | Path) -> dict[str, object]:
+    return json.loads(Path(path).read_text(encoding="utf-8"))
+def _expected_next_token(model: ReframrModel, expected_text: str) -> str:
+    assert model.tokenizer is not None
+    encoded = model.tokenizer.encode(f" {expected_text}")
+    return encoded[0] if encoded else ""
+def _normalize_text(text: str) -> str:
+    return " ".join(text.casefold().split())
+def _word_ngrams(words: list[str], size: int) -> list[tuple[str, ...]]:
+    if size <= 0 or len(words) < size:
+        return []
+    return [tuple(words[index : index + size]) for index in range(len(words) - size + 1)]
+def _distinct_ratio(words: list[str], size: int) -> float:
+    grams = _word_ngrams(words, size)
+    if not grams:
+        return 0.0
+    return len(set(grams)) / len(grams)
+def _repetition_ratio(words: list[str], size: int) -> float:
+    grams = _word_ngrams(words, size)
+    if not grams:
+        return 0.0
+    repeated = len(grams) - len(set(grams))
+    return repeated / len(grams)
+def _open_ended_score(
+    model: ReframrModel,
+    sample: dict[str, object],
+    *,
+    reasoning_mode: str | None,
+) -> dict[str, object]:
+    generated = model.generate_text(
+        str(sample["context"]),
+        max_tokens=int(sample.get("max_tokens", 56)),
+        reasoning_mode=reasoning_mode,
+    )
+    normalized = _normalize_text(generated)
+    required_groups = [
+        [str(term).casefold() for term in group]
+        for group in sample.get("required_groups", [])
+    ]
+    satisfied_groups = sum(
+        1
+        for group in required_groups
+        if any(term in normalized for term in group)
+    )
+    group_coverage = (
+        satisfied_groups / len(required_groups) if required_groups else 0.0
+    )
+    punctuation_hit = any(mark in generated for mark in ".,;:?!")
+    min_words = int(sample.get("min_words", 12))
+    min_word_hit = len(generated.split()) >= min_words
+    banned_phrases = [str(phrase) for phrase in sample.get("banned_phrases", [])]
+    exact_copy = any(normalized == _normalize_text(phrase) for phrase in banned_phrases)
+    novelty_hit = not exact_copy
+    require_punctuation = bool(sample.get("require_punctuation", True))
+    score_components = [
+        group_coverage,
+        1.0 if min_word_hit else 0.0,
+        1.0 if novelty_hit else 0.0,
+    ]
+    if require_punctuation:
+        score_components.append(1.0 if punctuation_hit else 0.0)
+    return {
+        "section": str(sample["section"]),
+        "context": str(sample["context"]),
+        "generated_text": generated,
+        "group_coverage": group_coverage,
+        "punctuation_hit": punctuation_hit,
+        "min_word_hit": min_word_hit,
+        "exact_copy": exact_copy,
+        "score": sum(score_components) / len(score_components) if score_components else 0.0,
+    }
+def evaluate_manifest(
+    model: ReframrModel,
+    manifest: dict[str, object],
+    *,
+    reasoning_mode: str | None = None,
+    top_k: int = 5,
+) -> dict[str, object]:
+    results: dict[str, object] = {
+        "corpus_name": manifest["name"],
+        "reasoning_mode": reasoning_mode or model.config.default_reasoning_profile,
+        "splits": {},
+    }
+    splits = manifest["splits"]
+    for split_name in ("memorization", "generalization"):
+        samples = splits[split_name]
+        top1_hits = 0
+        topk_hits = 0
+        expected_probabilities = []
+        for sample in samples:
+            distribution = model.predict_next_token_distribution(
+                sample["context"],
+                reasoning_mode=reasoning_mode,
+            )
+            ranked = sorted(distribution.items(), key=lambda item: item[1], reverse=True)
+            predicted = ranked[0][0] if ranked else ""
+            top_tokens = [token for token, _ in ranked[:top_k]]
+            expected = _expected_next_token(model, sample["expected"])
+            expected_probability = distribution.get(expected, 0.0)
+            if predicted == expected:
+                top1_hits += 1
+            if expected in top_tokens:
+                topk_hits += 1
+            expected_probabilities.append(expected_probability)
+        sample_count = len(samples)
+        mean_expected_probability = (
+            sum(expected_probabilities) / sample_count if sample_count else 0.0
+        )
+        results["splits"][split_name] = {
+            "sample_count": sample_count,
+            "top1_accuracy": top1_hits / sample_count if sample_count else 0.0,
+            "topk_accuracy": topk_hits / sample_count if sample_count else 0.0,
+            "mean_expected_probability": mean_expected_probability,
+        }
+    open_ended_samples = splits.get("open_ended", [])
+    if open_ended_samples:
+        sample_results = [
+            _open_ended_score(
+                model,
+                sample,
+                reasoning_mode=reasoning_mode,
+            )
+            for sample in open_ended_samples
+        ]
+        sample_count = len(sample_results)
+        results["open_ended"] = {
+            "sample_count": sample_count,
+            "mean_score": (
+                sum(float(sample["score"]) for sample in sample_results) / sample_count
+                if sample_count
+                else 0.0
+            ),
+            "mean_group_coverage": (
+                sum(float(sample["group_coverage"]) for sample in sample_results) / sample_count
+                if sample_count
+                else 0.0
+            ),
+            "punctuation_rate": (
+                sum(1 for sample in sample_results if bool(sample["punctuation_hit"])) / sample_count
+                if sample_count
+                else 0.0
+            ),
+            "min_word_rate": (
+                sum(1 for sample in sample_results if bool(sample["min_word_hit"])) / sample_count
+                if sample_count
+                else 0.0
+            ),
+            "exact_copy_rate": (
+                sum(1 for sample in sample_results if bool(sample["exact_copy"])) / sample_count
+                if sample_count
+                else 0.0
+            ),
+            "samples": sample_results,
+        }
+    return results
+def benchmark_open_prompts(
+    model: ReframrModel,
+    prompts: list[dict[str, object]],
+    *,
+    reasoning_mode: str | None = None,
+    max_tokens: int = 64,
+    temperature: float = 0.82,
+    top_k: int = 24,
+    top_p: float = 0.92,
+    repetition_penalty: float = 1.18,
+) -> dict[str, object]:
+    samples: list[dict[str, object]] = []
+    for item in prompts:
+        prompt = str(item["prompt"])
+        generated = model.generate_text(
+            prompt,
+            max_tokens=max_tokens,
+            reasoning_mode=reasoning_mode,
+            temperature=temperature,
+            top_k=top_k,
+            top_p=top_p,
+            repetition_penalty=repetition_penalty,
+        )
+        words = generated.split()
+        samples.append(
+            {
+                "prompt": prompt,
+                "tags": [str(tag) for tag in item.get("tags", [])],
+                "generated_text": generated,
+                "word_count": len(words),
+                "char_count": len(generated),
+                "punctuation_hit": any(mark in generated for mark in ".,;:?!"),
+                "distinct_2": _distinct_ratio(words, 2),
+                "distinct_3": _distinct_ratio(words, 3),
+                "repetition_3": _repetition_ratio(words, 3),
+            }
+        )
+    sample_count = len(samples)
+    return {
+        "sample_count": sample_count,
+        "reasoning_mode": reasoning_mode or model.config.default_reasoning_profile,
+        "generation_policy": {
+            "temperature": temperature,
+            "top_k": top_k,
+            "top_p": top_p,
+            "repetition_penalty": repetition_penalty,
+        },
+        "mean_word_count": (
+            sum(int(sample["word_count"]) for sample in samples) / sample_count
+            if sample_count
+            else 0.0
+        ),
+        "mean_char_count": (
+            sum(int(sample["char_count"]) for sample in samples) / sample_count
+            if sample_count
+            else 0.0
+        ),
+        "punctuation_rate": (
+            sum(1 for sample in samples if bool(sample["punctuation_hit"])) / sample_count
+            if sample_count
+            else 0.0
+        ),
+        "mean_distinct_2": (
+            sum(float(sample["distinct_2"]) for sample in samples) / sample_count
+            if sample_count
+            else 0.0
+        ),
+        "mean_distinct_3": (
+            sum(float(sample["distinct_3"]) for sample in samples) / sample_count
+            if sample_count
+            else 0.0
+        ),
+        "mean_repetition_3": (
+            sum(float(sample["repetition_3"]) for sample in samples) / sample_count
+            if sample_count
+            else 0.0
+        ),
+        "samples": samples,
+    }

reframr/hf_import.py ADDED Viewed

	@@ -0,0 +1,662 @@

+import json
+import re
+import site
+import sys
+from itertools import chain
+from pathlib import Path
+from .text_quality import clean_answer_text, clean_context_text, clean_training_text
+TEXT_FIELD_PREFERENCES = (
+    "text",
+    "content",
+    "body",
+    "article",
+    "document",
+    "passage",
+    "markdown",
+)
+DIALOGUE_FIELD_PREFERENCES = (
+    "messages",
+    "conversation",
+    "conversations",
+    "dialogue",
+    "dialog",
+    "turns",
+)
+PREFERENCE_FIELD_PAIRS = (
+    ("chosen", "rejected"),
+    ("response_j", "response_k"),
+    ("response_0", "response_1"),
+)
+INSTRUCTION_FIELD_PAIRS = (
+    ("instruction", "output"),
+    ("prompt", "completion"),
+    ("prompt", "response"),
+    ("question", "answer"),
+    ("question", "response"),
+    ("query", "response"),
+)
+TRANSCRIPT_ROLE_PATTERN = re.compile(r"(?:^|\n\s*\n)(Human|Assistant|System)\s*:\s*", re.IGNORECASE)
+ROLE_ALIASES = {
+    "assistant": "assistant",
+    "bot": "assistant",
+    "gpt": "assistant",
+    "model": "assistant",
+    "assistant_response": "assistant",
+    "human": "user",
+    "user": "user",
+    "prompter": "user",
+    "customer": "user",
+    "system": "system",
+}
+def _word_count(text: str) -> int:
+    return len(text.split())
+def _alpha_ratio(text: str) -> float:
+    if not text:
+        return 0.0
+    alpha_count = sum(character.isalpha() for character in text)
+    return alpha_count / len(text)
+def _default_record_weight(record_type: str) -> int:
+    if record_type == "dialogue_turn":
+        return 2
+    if record_type == "instruction_answer":
+        return 2
+    if record_type == "preference_chosen":
+        return 3
+    if record_type == "preference_rejected":
+        return 0
+    return 1
+def choose_text_field(columns: list[str]) -> str:
+    normalized = {column.casefold(): column for column in columns}
+    for preferred in TEXT_FIELD_PREFERENCES:
+        if preferred in normalized:
+            return normalized[preferred]
+    raise ValueError("Could not infer a text column. Pass --text-field explicitly.")
+def choose_dialogue_field(columns: list[str]) -> str:
+    normalized = {column.casefold(): column for column in columns}
+    for preferred in DIALOGUE_FIELD_PREFERENCES:
+        if preferred in normalized:
+            return normalized[preferred]
+    raise ValueError("Could not infer a conversation column.")
+def choose_preference_fields(columns: list[str]) -> tuple[str, str]:
+    normalized = {column.casefold(): column for column in columns}
+    for chosen_name, rejected_name in PREFERENCE_FIELD_PAIRS:
+        if chosen_name in normalized and rejected_name in normalized:
+            return normalized[chosen_name], normalized[rejected_name]
+    raise ValueError("Could not infer chosen/rejected preference columns.")
+def choose_instruction_fields(columns: list[str]) -> tuple[str, str]:
+    normalized = {column.casefold(): column for column in columns}
+    for prompt_name, answer_name in INSTRUCTION_FIELD_PAIRS:
+        if prompt_name in normalized and answer_name in normalized:
+            return normalized[prompt_name], normalized[answer_name]
+    raise ValueError("Could not infer instruction/answer columns.")
+def _row_identifier(row: dict[str, object]) -> str:
+    for candidate in ("id", "_id", "row_id", "uuid", "prompt_id"):
+        if candidate in row and str(row[candidate]).strip():
+            return str(row[candidate]).strip()
+    return ""
+def _base_record(
+    *,
+    dataset: str,
+    config: str | None,
+    split: str,
+    row_id: str,
+) -> dict[str, str]:
+    return {
+        "source": "huggingface",
+        "dataset": dataset,
+        "config": config or "",
+        "split": split,
+        "row_id": row_id,
+    }
+def _row_language(row: dict[str, object]) -> str:
+    for candidate in ("lang", "language", "locale"):
+        value = row.get(candidate)
+        if isinstance(value, str) and value.strip():
+            return value.strip()
+    return ""
+def _normalize_role(raw_role: object) -> str:
+    role = str(raw_role or "").strip().casefold()
+    return ROLE_ALIASES.get(role, role)
+def _message_content(message: dict[str, object]) -> str:
+    for field in ("content", "value", "text", "message"):
+        value = message.get(field)
+        if isinstance(value, str) and value.strip():
+            return clean_training_text(value)
+    return ""
+def _message_role(message: dict[str, object]) -> str:
+    for field in ("role", "from", "speaker", "author"):
+        value = message.get(field)
+        if value is not None:
+            normalized = _normalize_role(value)
+            if normalized:
+                return normalized
+    return ""
+def _parse_dialogue_messages(raw_messages: object) -> list[dict[str, str]]:
+    if not isinstance(raw_messages, list):
+        return []
+    parsed: list[dict[str, str]] = []
+    for message in raw_messages:
+        if not isinstance(message, dict):
+            continue
+        role = _message_role(message)
+        content = _message_content(message)
+        if role not in {"system", "user", "assistant"} or not content:
+            continue
+        parsed.append({"role": role, "content": content})
+    return parsed
+def _parse_transcript_messages(raw_text: object) -> list[dict[str, str]]:
+    if not isinstance(raw_text, str):
+        return []
+    text = raw_text.strip()
+    if not text:
+        return []
+    matches = list(TRANSCRIPT_ROLE_PATTERN.finditer(text))
+    if not matches:
+        return []
+    parsed: list[dict[str, str]] = []
+    for index, match in enumerate(matches):
+        role = _normalize_role(match.group(1))
+        start = match.end()
+        end = matches[index + 1].start() if index + 1 < len(matches) else len(text)
+        content = clean_training_text(text[start:end].strip())
+        if role in {"system", "user", "assistant"} and content:
+            parsed.append({"role": role, "content": content})
+    return parsed
+def _render_prompt(messages: list[dict[str, str]]) -> str:
+    lines = []
+    for message in messages:
+        content = clean_context_text(message["content"])
+        if content:
+            lines.append(content)
+    return "\n".join(lines).strip()
+def _compose_training_text(context: str, answer: str) -> str:
+    context = clean_context_text(context)
+    answer = clean_answer_text(answer)
+    return f"<reason> {context} <answer> {answer}".strip()
+def _compose_instruction_context(row: dict[str, object], prompt_field: str) -> str:
+    parts: list[str] = []
+    prompt = clean_context_text(str(row.get(prompt_field, "")).strip())
+    extra_input = clean_context_text(str(row.get("input", "")).strip())
+    if prompt:
+        parts.append(prompt)
+    if extra_input:
+        parts.append(extra_input)
+    return "\n".join(parts).strip()
+def _extract_prompt_answer(
+    row: dict[str, object],
+    *,
+    field_name: str,
+) -> tuple[str, str]:
+    dialogue_messages = _parse_dialogue_messages(row.get(field_name))
+    if dialogue_messages and dialogue_messages[-1]["role"] == "assistant":
+        prompt = _render_prompt(dialogue_messages[:-1])
+        answer = dialogue_messages[-1]["content"]
+        if prompt and answer:
+            return prompt, answer
+    messages = _parse_transcript_messages(row.get(field_name))
+    if messages:
+        if messages[-1]["role"] == "assistant":
+            prompt = _render_prompt(messages[:-1])
+            answer = messages[-1]["content"]
+            if prompt and answer:
+                return prompt, answer
+    prompt = clean_training_text(str(row.get("prompt", row.get("question", ""))).strip())
+    answer = clean_answer_text(str(row.get(field_name, "")).strip())
+    return prompt, answer
+def _ordered_preference_fields(
+    row: dict[str, object],
+    *,
+    left_field: str,
+    right_field: str,
+) -> tuple[str, str]:
+    if {left_field, right_field} != {"response_0", "response_1"}:
+        return left_field, right_field
+    for selector in ("safer_response_id", "better_response_id"):
+        value = row.get(selector)
+        try:
+            preferred = int(value)
+        except (TypeError, ValueError):
+            continue
+        if preferred == 0:
+            return "response_0", "response_1"
+        if preferred == 1:
+            return "response_1", "response_0"
+    return left_field, right_field
+def _passes_quality_gate(
+    record: dict[str, str],
+    *,
+    min_words: int,
+    max_words: int,
+    min_alpha_ratio: float,
+    allowed_languages: set[str],
+) -> bool:
+    candidate = str(record.get("answer") or record.get("text") or "").strip()
+    if not candidate:
+        return False
+    word_count = _word_count(candidate)
+    if min_words > 0 and word_count < min_words:
+        return False
+    if max_words > 0 and word_count > max_words:
+        return False
+    alpha_ratio = _alpha_ratio(candidate)
+    if min_alpha_ratio > 0.0 and alpha_ratio < min_alpha_ratio:
+        return False
+    if allowed_languages:
+        language = str(record.get("language", "")).strip().casefold()
+        if not language or language not in allowed_languages:
+            return False
+    record["quality_word_count"] = str(word_count)
+    record["quality_alpha_ratio"] = f"{alpha_ratio:.4f}"
+    return True
+def to_json_record(
+    *,
+    dataset: str,
+    config: str | None,
+    split: str,
+    text_field: str,
+    row: dict[str, object],
+) -> dict[str, str]:
+    text = clean_training_text(str(row.get(text_field, "")).strip())
+    if not text:
+        raise ValueError("Row is missing usable text.")
+    record_type = "text"
+    return {
+        **_base_record(
+            dataset=dataset,
+            config=config,
+            split=split,
+            row_id=_row_identifier(row),
+        ),
+        "record_type": record_type,
+        "language": _row_language(row),
+        "text_field": text_field,
+        "text": text,
+        "word_count": _word_count(text),
+        "weight": _default_record_weight(record_type),
+    }
+def dialogue_to_json_records(
+    *,
+    dataset: str,
+    config: str | None,
+    split: str,
+    conversation_field: str,
+    row: dict[str, object],
+) -> list[dict[str, str]]:
+    messages = _parse_dialogue_messages(row.get(conversation_field))
+    if not messages:
+        raise ValueError("Row does not contain usable dialogue turns.")
+    row_id = _row_identifier(row)
+    records: list[dict[str, str]] = []
+    history: list[dict[str, str]] = []
+    row_language = _row_language(row)
+    system_text = clean_training_text(str(row.get("system", "")).strip())
+    if system_text:
+        history.append({"role": "system", "content": system_text})
+    assistant_turn_index = 0
+    for message in messages:
+        if message["role"] != "assistant":
+            history.append(message)
+            continue
+        prompt = _render_prompt(history)
+        if not prompt:
+            continue
+        assistant_turn_index += 1
+        records.append(
+            {
+                **_base_record(
+                    dataset=dataset,
+                    config=config,
+                    split=split,
+                    row_id=row_id,
+                ),
+                "record_type": "dialogue_turn",
+                "language": row_language,
+                "conversation_field": conversation_field,
+                "turn_index": str(assistant_turn_index),
+                "context": prompt,
+                "answer": clean_answer_text(message["content"]),
+                "text": _compose_training_text(prompt, message["content"]),
+                "word_count": _word_count(clean_answer_text(message["content"])),
+                "weight": _default_record_weight("dialogue_turn"),
+            }
+        )
+        history.append(message)
+    if not records:
+        raise ValueError("Dialogue row did not yield any assistant training turns.")
+    return records
+def preference_to_json_records(
+    *,
+    dataset: str,
+    config: str | None,
+    split: str,
+    chosen_field: str,
+    rejected_field: str,
+    row: dict[str, object],
+    preference_target: str = "both",
+) -> list[dict[str, str]]:
+    row_id = _row_identifier(row)
+    pair_id = row_id or f"{chosen_field}:{rejected_field}"
+    records: list[dict[str, str]] = []
+    row_language = _row_language(row)
+    chosen_field, rejected_field = _ordered_preference_fields(
+        row,
+        left_field=chosen_field,
+        right_field=rejected_field,
+    )
+    field_specs = [
+        (chosen_field, "preference_chosen"),
+        (rejected_field, "preference_rejected"),
+    ]
+    if preference_target == "chosen":
+        field_specs = [(chosen_field, "preference_chosen")]
+    elif preference_target == "rejected":
+        field_specs = [(rejected_field, "preference_rejected")]
+    elif preference_target != "both":
+        raise ValueError("preference_target must be one of: both, chosen, rejected.")
+    for field_name, record_type in field_specs:
+        prompt, answer = _extract_prompt_answer(row, field_name=field_name)
+        if not prompt or not answer:
+            continue
+        records.append(
+            {
+                **_base_record(
+                    dataset=dataset,
+                    config=config,
+                    split=split,
+                    row_id=row_id,
+                ),
+                "record_type": record_type,
+                "language": row_language,
+                "pair_id": pair_id,
+                "text_field": field_name,
+                "context": prompt,
+                "answer": clean_answer_text(answer),
+                "text": _compose_training_text(prompt, answer),
+                "word_count": _word_count(clean_answer_text(answer)),
+                "weight": _default_record_weight(record_type),
+            }
+        )
+    if not records:
+        raise ValueError("Preference row did not yield usable chosen/rejected transcripts.")
+    return records
+def instruction_to_json_records(
+    *,
+    dataset: str,
+    config: str | None,
+    split: str,
+    prompt_field: str,
+    answer_field: str,
+    row: dict[str, object],
+) -> list[dict[str, str]]:
+    context = _compose_instruction_context(row, prompt_field)
+    answer = clean_answer_text(str(row.get(answer_field, "")).strip())
+    if not context or not answer:
+        raise ValueError("Instruction row did not contain usable prompt and answer text.")
+    record_type = "instruction_answer"
+    return [
+        {
+            **_base_record(
+                dataset=dataset,
+                config=config,
+                split=split,
+                row_id=_row_identifier(row),
+            ),
+            "record_type": record_type,
+            "language": _row_language(row),
+            "context": context,
+            "answer": answer,
+            "text": _compose_training_text(context, answer),
+            "word_count": _word_count(answer),
+            "weight": _default_record_weight(record_type),
+        }
+    ]
+def _expand_row_records(
+    *,
+    dataset: str,
+    config: str | None,
+    split: str,
+    row: dict[str, object],
+    text_field: str | None,
+    preference_target: str,
+) -> list[dict[str, str]]:
+    if text_field is not None:
+        explicit_value = row.get(text_field)
+        if isinstance(explicit_value, list):
+            return dialogue_to_json_records(
+                dataset=dataset,
+                config=config,
+                split=split,
+                conversation_field=text_field,
+                row=row,
+            )
+        return [
+            to_json_record(
+                dataset=dataset,
+                config=config,
+                split=split,
+                text_field=text_field,
+                row=row,
+            )
+        ]
+    columns = list(row)
+    try:
+        chosen_field, rejected_field = choose_preference_fields(columns)
+        return preference_to_json_records(
+            dataset=dataset,
+            config=config,
+            split=split,
+            chosen_field=chosen_field,
+            rejected_field=rejected_field,
+            row=row,
+            preference_target=preference_target,
+        )
+    except ValueError:
+        pass
+    try:
+        prompt_field, answer_field = choose_instruction_fields(columns)
+        return instruction_to_json_records(
+            dataset=dataset,
+            config=config,
+            split=split,
+            prompt_field=prompt_field,
+            answer_field=answer_field,
+            row=row,
+        )
+    except ValueError:
+        pass
+    try:
+        conversation_field = choose_dialogue_field(columns)
+        if isinstance(row.get(conversation_field), list):
+            return dialogue_to_json_records(
+                dataset=dataset,
+                config=config,
+                split=split,
+                conversation_field=conversation_field,
+                row=row,
+            )
+    except ValueError:
+        pass
+    inferred_text_field = choose_text_field(columns)
+    return [
+        to_json_record(
+            dataset=dataset,
+            config=config,
+            split=split,
+            text_field=inferred_text_field,
+            row=row,
+        )
+    ]
+def import_hf_dataset(
+    *,
+    dataset: str,
+    output_path: str | Path,
+    config: str | None = None,
+    split: str = "train",
+    text_field: str | None = None,
+    limit: int = 1000,
+    streaming: bool = True,
+    preference_target: str = "chosen",
+    min_words: int = 0,
+    max_words: int = 0,
+    min_alpha_ratio: float = 0.0,
+    allowed_languages: tuple[str, ...] = (),
+) -> dict[str, object]:
+    try:
+        from datasets import load_dataset
+    except ModuleNotFoundError:
+        user_site = site.getusersitepackages()
+        if user_site and user_site not in sys.path:
+            sys.path.append(user_site)
+        from datasets import load_dataset
+    dataset_kwargs: dict[str, object] = {
+        "split": split,
+        "streaming": streaming,
+    }
+    if config:
+        dataset_kwargs["name"] = config
+    hf_dataset = load_dataset(dataset, **dataset_kwargs)
+    iterator = iter(hf_dataset)
+    first_row: dict[str, object] | None = None
+    if text_field is None:
+        first_row = dict(next(iterator))
+        iterator = chain([first_row], iterator)
+    output = Path(output_path)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    written = 0
+    record_types: set[str] = set()
+    normalized_languages = {language.casefold() for language in allowed_languages if language.strip()}
+    with output.open("w", encoding="utf-8") as handle:
+        for row in iterator:
+            if written >= limit:
+                break
+            normalized_row = dict(row)
+            try:
+                records = _expand_row_records(
+                    dataset=dataset,
+                    config=config,
+                    split=split,
+                    row=normalized_row,
+                    text_field=text_field,
+                    preference_target=preference_target,
+                )
+            except ValueError:
+                continue
+            for record in records:
+                if written >= limit:
+                    break
+                if not _passes_quality_gate(
+                    record,
+                    min_words=min_words,
+                    max_words=max_words,
+                    min_alpha_ratio=min_alpha_ratio,
+                    allowed_languages=normalized_languages,
+                ):
+                    continue
+                record_types.add(record.get("record_type", "text"))
+                handle.write(json.dumps(record, ensure_ascii=False) + "\n")
+                written += 1
+    inferred_mode = "mixed" if len(record_types) > 1 else (next(iter(record_types)) if record_types else "unknown")
+    return {
+        "dataset": dataset,
+        "config": config or "",
+        "split": split,
+        "text_field": text_field or "",
+        "output_path": str(output.resolve()),
+        "records_written": written,
+        "record_types": sorted(record_types),
+        "mode": inferred_mode,
+        "preference_target": preference_target,
+        "streaming": streaming,
+        "min_words": min_words,
+        "max_words": max_words,
+        "min_alpha_ratio": min_alpha_ratio,
+        "allowed_languages": sorted(normalized_languages),
+    }

reframr/hippo.py ADDED Viewed

	@@ -0,0 +1,145 @@

+import math
+from dataclasses import dataclass
+import site
+import sys
+from pathlib import Path
+from .linalg import Matrix, Vector, identity, invert_matrix, matvec
+_VENDOR_ROOT = Path(__file__).resolve().parent.parent / ".vendor"
+for _vendor_path in (_VENDOR_ROOT / "python", _VENDOR_ROOT / "sitepkgs"):
+    if _vendor_path.exists():
+        vendor_text = str(_vendor_path)
+        if vendor_text not in sys.path:
+            sys.path.insert(0, vendor_text)
+try:
+    import numpy as np
+except ModuleNotFoundError:
+    user_site = site.getusersitepackages()
+    if user_site and user_site not in sys.path:
+        sys.path.append(user_site)
+    try:
+        import numpy as np
+    except ModuleNotFoundError:
+        np = None
+def hippo_legs_matrix(order: int) -> tuple[Matrix, Vector]:
+    a_matrix = [[0.0 for _ in range(order)] for _ in range(order)]
+    b_vector = [0.0 for _ in range(order)]
+    for row in range(order):
+        for col in range(order):
+            if row > col:
+                a_matrix[row][col] = -math.sqrt(2 * row + 1) * math.sqrt(2 * col + 1)
+            elif row == col:
+                a_matrix[row][col] = -(row + 1)
+        b_vector[row] = math.sqrt(2 * row + 1)
+    return a_matrix, b_vector
+def analytical_embedding_drive(embedding: Vector, state_dim: int) -> Vector:
+    if not embedding:
+        return [0.0 for _ in range(state_dim)]
+    width = len(embedding)
+    return [
+        (
+            embedding[index % width]
+            + 0.5 * embedding[(3 * index + 1) % width]
+            - 0.25 * embedding[(5 * index + 2) % width]
+        )
+        for index in range(state_dim)
+    ]
+def analytical_embedding_drive_fast(embedding: object, state_dim: int) -> object:
+    if np is None:
+        embedding_vector = embedding.tolist() if hasattr(embedding, "tolist") else list(embedding)
+        return analytical_embedding_drive(embedding_vector, state_dim)
+    embedding_array = embedding if hasattr(embedding, "shape") else np.asarray(embedding, dtype=np.float64)
+    if embedding_array.size == 0:
+        return np.zeros(state_dim, dtype=np.float64)
+    indices = np.arange(state_dim, dtype=np.int64)
+    width = int(embedding_array.shape[0])
+    return (
+        embedding_array[indices % width]
+        + 0.5 * embedding_array[(3 * indices + 1) % width]
+        - 0.25 * embedding_array[(5 * indices + 2) % width]
+    )
+@dataclass(slots=True)
+class AnalyticalMemoryUnit:
+    state_dim: int
+    timescale: float
+    def __post_init__(self) -> None:
+        a_matrix, b_vector = hippo_legs_matrix(self.state_dim)
+        self.transition, self.input_projection = self._discretize_transition(
+            a_matrix,
+            b_vector,
+            self.timescale,
+        )
+    transition: Matrix = None  # type: ignore[assignment]
+    input_projection: Vector = None  # type: ignore[assignment]
+    transition_array: object | None = None  # type: ignore[assignment]
+    input_projection_array: object | None = None  # type: ignore[assignment]
+    @staticmethod
+    def _discretize_transition(
+        a_matrix: Matrix,
+        b_vector: Vector,
+        step: float,
+    ) -> tuple[Matrix, Vector]:
+        implicit_system = [
+            [
+                identity_value - step * a_value
+                for identity_value, a_value in zip(identity_row, a_row)
+            ]
+            for identity_row, a_row in zip(identity(len(a_matrix)), a_matrix)
+        ]
+        transition = invert_matrix(implicit_system)
+        input_projection = matvec(transition, [step * value for value in b_vector])
+        return transition, input_projection
+    def step(self, state: Vector, scalar_input: float) -> Vector:
+        if np is not None and self.transition_array is None:
+            self.transition_array = np.asarray(self.transition, dtype=np.float64)
+            self.input_projection_array = np.asarray(self.input_projection, dtype=np.float64)
+        propagated = matvec(self.transition, state)
+        return [
+            propagated[index] + self.input_projection[index] * scalar_input
+            for index in range(self.state_dim)
+        ]
+    def step_vector(self, state: Vector, drive: Vector) -> Vector:
+        propagated = matvec(self.transition, state)
+        return [
+            propagated[index] + self.input_projection[index] * drive[index]
+            for index in range(self.state_dim)
+        ]
+    def step_fast(self, state: object, scalar_input: float) -> object:
+        if np is None:
+            state_vector = state.tolist() if hasattr(state, "tolist") else list(state)
+            return self.step(state_vector, scalar_input)
+        if self.transition_array is None or self.input_projection_array is None:
+            self.transition_array = np.asarray(self.transition, dtype=np.float64)
+            self.input_projection_array = np.asarray(self.input_projection, dtype=np.float64)
+        state_array = state if hasattr(state, "shape") else np.asarray(state, dtype=np.float64)
+        return (self.transition_array @ state_array) + (self.input_projection_array * scalar_input)
+    def step_vector_fast(self, state: object, drive: object) -> object:
+        if np is None:
+            state_vector = state.tolist() if hasattr(state, "tolist") else list(state)
+            drive_vector = drive.tolist() if hasattr(drive, "tolist") else list(drive)
+            return self.step_vector(state_vector, drive_vector)
+        if self.transition_array is None or self.input_projection_array is None:
+            self.transition_array = np.asarray(self.transition, dtype=np.float64)
+            self.input_projection_array = np.asarray(self.input_projection, dtype=np.float64)
+        state_array = state if hasattr(state, "shape") else np.asarray(state, dtype=np.float64)
+        drive_array = drive if hasattr(drive, "shape") else np.asarray(drive, dtype=np.float64)
+        return (self.transition_array @ state_array) + (self.input_projection_array * drive_array)

reframr/linalg.py ADDED Viewed

	@@ -0,0 +1,271 @@

+import math
+import site
+import sys
+from pathlib import Path
+_VENDOR_ROOT = Path(__file__).resolve().parent.parent / ".vendor"
+for _vendor_path in (_VENDOR_ROOT / "python", _VENDOR_ROOT / "sitepkgs"):
+    if _vendor_path.exists():
+        vendor_text = str(_vendor_path)
+        if vendor_text not in sys.path:
+            sys.path.insert(0, vendor_text)
+try:
+    import numpy as np
+except ModuleNotFoundError:
+    user_site = site.getusersitepackages()
+    if user_site and user_site not in sys.path:
+        sys.path.append(user_site)
+    try:
+        import numpy as np
+    except ModuleNotFoundError:
+        np = None
+if np is not None and not hasattr(np, "asarray"):
+    np = None
+Matrix = list[list[float]]
+Vector = list[float]
+SUMPROD = getattr(math, "sumprod", None)
+def zeros(rows: int, cols: int) -> Matrix:
+    return [[0.0 for _ in range(cols)] for _ in range(rows)]
+def zeros_vector(size: int) -> Vector:
+    return [0.0 for _ in range(size)]
+def identity(size: int) -> Matrix:
+    matrix = zeros(size, size)
+    for index in range(size):
+        matrix[index][index] = 1.0
+    return matrix
+def copy_matrix(matrix: Matrix) -> Matrix:
+    return [row[:] for row in matrix]
+def transpose(matrix: Matrix) -> Matrix:
+    if not matrix:
+        return []
+    if np is not None:
+        return np.asarray(matrix, dtype=np.float64).T.tolist()
+    return [list(column) for column in zip(*matrix)]
+def matvec(matrix: Matrix, vector: Vector) -> Vector:
+    if np is not None:
+        return (np.asarray(matrix, dtype=np.float64) @ np.asarray(vector, dtype=np.float64)).tolist()
+    if SUMPROD is not None:
+        return [SUMPROD(row, vector) for row in matrix]
+    return [sum(value * vector[idx] for idx, value in enumerate(row)) for row in matrix]
+def matmul(left: Matrix, right: Matrix) -> Matrix:
+    if not left or not right:
+        return []
+    if np is not None:
+        return (np.asarray(left, dtype=np.float64) @ np.asarray(right, dtype=np.float64)).tolist()
+    right_t = transpose(right)
+    if SUMPROD is not None:
+        return [[SUMPROD(row, column) for column in right_t] for row in left]
+    return [
+        [sum(a * b for a, b in zip(row, column)) for column in right_t]
+        for row in left
+    ]
+def add_matrices(left: Matrix, right: Matrix) -> Matrix:
+    return [
+        [left[row][col] + right[row][col] for col in range(len(left[row]))]
+        for row in range(len(left))
+    ]
+def subtract_matrices(left: Matrix, right: Matrix) -> Matrix:
+    return [
+        [left[row][col] - right[row][col] for col in range(len(left[row]))]
+        for row in range(len(left))
+    ]
+def scale_matrix(matrix: Matrix, scalar: float) -> Matrix:
+    return [[scalar * value for value in row] for row in matrix]
+def dot(left: Vector, right: Vector) -> float:
+    if np is not None:
+        return float(np.dot(np.asarray(left, dtype=np.float64), np.asarray(right, dtype=np.float64)))
+    if SUMPROD is not None:
+        return SUMPROD(left, right)
+    return sum(a * b for a, b in zip(left, right))
+def norm(vector: Vector) -> float:
+    return math.sqrt(dot(vector, vector))
+def outer(left: Vector, right: Vector) -> Matrix:
+    if np is not None:
+        return np.outer(np.asarray(left, dtype=np.float64), np.asarray(right, dtype=np.float64)).tolist()
+    return [[a * b for b in right] for a in left]
+def mean(values: Vector) -> float:
+    return sum(values) / len(values) if values else 0.0
+def trace(matrix: Matrix) -> float:
+    return sum(matrix[index][index] for index in range(min(len(matrix), len(matrix[0]))))
+def covariance_matrix(samples: list[Vector]) -> Matrix:
+    if not samples:
+        return []
+    if np is not None:
+        sample_array = np.asarray(samples, dtype=np.float64)
+        centered = sample_array - sample_array.mean(axis=0, keepdims=True)
+        denominator = max(len(samples) - 1, 1)
+        return ((centered.T @ centered) / denominator).tolist()
+    feature_count = len(samples[0])
+    sample_count = len(samples)
+    means = [
+        sum(sample[feature] for sample in samples) / sample_count
+        for feature in range(feature_count)
+    ]
+    covariance = zeros(feature_count, feature_count)
+    for sample in samples:
+        centered = [sample[index] - means[index] for index in range(feature_count)]
+        for row in range(feature_count):
+            for col in range(feature_count):
+                covariance[row][col] += centered[row] * centered[col]
+    denominator = max(sample_count - 1, 1)
+    return scale_matrix(covariance, 1.0 / denominator)
+def solve_linear_system(matrix: Matrix, vector: Vector) -> Vector:
+    if np is not None:
+        return np.linalg.solve(
+            np.asarray(matrix, dtype=np.float64),
+            np.asarray(vector, dtype=np.float64),
+        ).tolist()
+    size = len(matrix)
+    augmented = [matrix[row][:] + [vector[row]] for row in range(size)]
+    for pivot_index in range(size):
+        pivot_row = max(
+            range(pivot_index, size),
+            key=lambda row_index: abs(augmented[row_index][pivot_index]),
+        )
+        augmented[pivot_index], augmented[pivot_row] = augmented[pivot_row], augmented[pivot_index]
+        pivot_value = augmented[pivot_index][pivot_index]
+        if abs(pivot_value) < 1e-12:
+            raise ValueError("Singular matrix encountered while solving linear system.")
+        inverse_pivot = 1.0 / pivot_value
+        augmented[pivot_index] = [value * inverse_pivot for value in augmented[pivot_index]]
+        for row_index in range(size):
+            if row_index == pivot_index:
+                continue
+            factor = augmented[row_index][pivot_index]
+            augmented[row_index] = [
+                augmented[row_index][col] - factor * augmented[pivot_index][col]
+                for col in range(size + 1)
+            ]
+    return [augmented[row][-1] for row in range(size)]
+def invert_matrix(matrix: Matrix) -> Matrix:
+    if np is not None:
+        return np.linalg.inv(np.asarray(matrix, dtype=np.float64)).tolist()
+    size = len(matrix)
+    inverse_columns = []
+    for basis_index in range(size):
+        basis_vector = [0.0 for _ in range(size)]
+        basis_vector[basis_index] = 1.0
+        inverse_columns.append(solve_linear_system(matrix, basis_vector))
+    return transpose(inverse_columns)
+def dominant_eigenpair_symmetric(
+    matrix: Matrix,
+    max_iterations: int = 64,
+    tolerance: float = 1e-10,
+) -> tuple[float, Vector]:
+    size = len(matrix)
+    if size == 0:
+        return 0.0, []
+    if np is not None:
+        values, vectors = np.linalg.eigh(np.asarray(matrix, dtype=np.float64))
+        index = int(np.argmax(values))
+        eigenvalue = float(values[index])
+        if eigenvalue <= tolerance:
+            return 0.0, zeros_vector(size)
+        return eigenvalue, vectors[:, index].astype(float).tolist()
+    vector = [1.0 / math.sqrt(size) for _ in range(size)]
+    for _ in range(max_iterations):
+        next_vector = matvec(matrix, vector)
+        next_norm = norm(next_vector)
+        if next_norm < tolerance:
+            return 0.0, zeros_vector(size)
+        next_vector = [value / next_norm for value in next_vector]
+        delta = max(abs(a - b) for a, b in zip(vector, next_vector))
+        vector = next_vector
+        if delta < tolerance:
+            break
+    eigenvalue = dot(vector, matvec(matrix, vector))
+    return eigenvalue, vector
+def top_k_eigenpairs_symmetric(matrix: Matrix, k: int) -> list[tuple[float, Vector]]:
+    if np is not None and matrix:
+        values, vectors = np.linalg.eigh(np.asarray(matrix, dtype=np.float64))
+        ranked = sorted(
+            (
+                (float(values[index]), vectors[:, index].astype(float).tolist())
+                for index in range(len(values))
+                if float(values[index]) > 1e-9
+            ),
+            key=lambda item: item[0],
+            reverse=True,
+        )
+        return ranked[: min(k, len(ranked))]
+    working = copy_matrix(matrix)
+    eigenpairs: list[tuple[float, Vector]] = []
+    for _ in range(min(k, len(working))):
+        eigenvalue, eigenvector = dominant_eigenpair_symmetric(working)
+        if eigenvalue <= 1e-9 or not eigenvector:
+            break
+        eigenpairs.append((eigenvalue, eigenvector))
+        deflation = scale_matrix(outer(eigenvector, eigenvector), eigenvalue)
+        working = subtract_matrices(working, deflation)
+    return eigenpairs
+def softmax(logits: Vector) -> Vector:
+    if not logits:
+        return []
+    if np is not None:
+        values = np.asarray(logits, dtype=np.float64)
+        shifted = np.exp(values - values.max())
+        total = float(shifted.sum())
+        if total == 0.0:
+            return [1.0 / len(logits) for _ in logits]
+        return (shifted / total).tolist()
+    max_logit = max(logits)
+    shifted = [math.exp(logit - max_logit) for logit in logits]
+    total = sum(shifted)
+    if total == 0.0:
+        return [1.0 / len(logits) for _ in logits]
+    return [value / total for value in shifted]

reframr/model.py ADDED Viewed

The diff for this file is too large to render. See raw diff

reframr/reasoning.py ADDED Viewed

	@@ -0,0 +1,26 @@

+TOKENIZER_NAME = "FrameToken"
+REASONING_CONTROL_TOKENS: tuple[str, ...] = (
+    "<reason>",
+    "<plan>",
+    "<reflect>",
+    "<answer>",
+    "<memory>",
+    "<retrieve>",
+    "<focus>",
+    "<verify>",
+    "<tool>",
+)
+REASONING_PROFILES: dict[str, tuple[str, ...]] = {
+    "none": (),
+    "deep": ("<reason>",),
+    "memory": ("<memory>", "<retrieve>", "<focus>"),
+    "tool": ("<tool>", "<reason>", "<verify>"),
+}
+def reasoning_prefix(mode: str) -> list[str]:
+    if mode not in REASONING_PROFILES:
+        raise ValueError(f"Unknown reasoning mode: {mode}")
+    return list(REASONING_PROFILES[mode])

reframr/reservoir.py ADDED Viewed

	@@ -0,0 +1,94 @@

+from .linalg import Matrix, Vector, identity, invert_matrix, matmul, matvec, np, scale_matrix, transpose
+def _empty_matrix(matrix: Matrix) -> bool:
+    if np is not None and hasattr(matrix, "size"):
+        return int(matrix.size) == 0
+    return not matrix
+def ridge_regression_readout(
+    states: list[Vector],
+    targets: list[Vector],
+    *,
+    regularization: float,
+) -> Matrix:
+    if not states or not targets:
+        raise ValueError("States and targets must be non-empty for ridge readout.")
+    if np is not None:
+        state_matrix = np.asarray(states, dtype=np.float64).T
+        target_matrix = np.asarray(targets, dtype=np.float64).T
+        gram = state_matrix @ state_matrix.T
+        regularized = gram + (regularization * np.eye(gram.shape[0], dtype=np.float64))
+        cross_covariance = target_matrix @ state_matrix.T
+        return np.linalg.solve(regularized.T, cross_covariance.T).T.tolist()
+    state_matrix = transpose(states)
+    target_matrix = transpose(targets)
+    gram = matmul(state_matrix, transpose(state_matrix))
+    regularized = [
+        [
+            gram[row][col] + (regularization if row == col else 0.0)
+            for col in range(len(gram[row]))
+        ]
+        for row in range(len(gram))
+    ]
+    inverse = invert_matrix(regularized)
+    cross_covariance = matmul(target_matrix, transpose(state_matrix))
+    return matmul(cross_covariance, inverse)
+def ridge_regression_readout_from_moments(
+    gram: Matrix,
+    cross_covariance: Matrix,
+    *,
+    regularization: float,
+) -> Matrix:
+    if _empty_matrix(gram) or _empty_matrix(cross_covariance):
+        raise ValueError("Gram and cross-covariance moments must be non-empty for ridge readout.")
+    if np is not None:
+        gram_array = np.asarray(gram, dtype=np.float64)
+        regularized = gram_array + (regularization * np.eye(gram_array.shape[0], dtype=np.float64))
+        cross_covariance_array = np.asarray(cross_covariance, dtype=np.float64)
+        return np.linalg.solve(regularized.T, cross_covariance_array.T).T
+    regularized = [
+        [
+            gram[row][col] + (regularization if row == col else 0.0)
+            for col in range(len(gram[row]))
+        ]
+        for row in range(len(gram))
+    ]
+    inverse = invert_matrix(regularized)
+    return matmul(cross_covariance, inverse)
+def ridge_regression_readout_from_diagonal_moments(
+    feature_second_moment: Vector,
+    cross_covariance: Matrix,
+    *,
+    regularization: float,
+) -> Matrix:
+    if _empty_matrix(feature_second_moment) or _empty_matrix(cross_covariance):
+        raise ValueError("Diagonal moments and cross-covariance must be non-empty for ridge readout.")
+    if np is not None:
+        denominator = np.asarray(feature_second_moment, dtype=np.float64) + regularization
+        denominator = np.where(np.abs(denominator) > 1e-12, denominator, regularization)
+        cross_covariance_array = np.asarray(cross_covariance, dtype=np.float64)
+        return cross_covariance_array / denominator[None, :]
+    denominator = [
+        value + regularization if abs(value + regularization) > 1e-12 else regularization
+        for value in feature_second_moment
+    ]
+    return [
+        [
+            value / denominator[col]
+            for col, value in enumerate(row)
+        ]
+        for row in cross_covariance
+    ]
+def apply_readout(weights: Matrix, state: Vector) -> Vector:
+    return matvec(weights, state)

reframr/streaming.py ADDED Viewed

	@@ -0,0 +1,1852 @@

+from __future__ import annotations
+import json
+import random
+import re
+import site
+import sys
+import time
+from collections import Counter
+from collections.abc import Iterable, Iterator
+from dataclasses import dataclass
+from pathlib import Path
+from .config import ReframrConfig
+from .corpus import build_vocabulary_from_counts
+from .embeddings import fit_ppmi_embedding_from_cooccurrence, fit_randomized_ppmi_embedding_from_counts
+from .hippo import AnalyticalMemoryUnit
+from .linalg import Matrix, Vector, norm, zeros, zeros_vector
+from .model import ReframrModel, RUNTIME_ARRAY_DTYPE, TRANSITION_ORDERS, np
+from .reservoir import (
+    ridge_regression_readout_from_diagonal_moments,
+    ridge_regression_readout_from_moments,
+)
+from .ternary import apply_ternary_mask, derive_ternary_mask_from_feature_energy
+from .text_quality import clean_answer_text, clean_context_text, clean_training_text
+from .tokenizer import NativeTokenizer
+try:
+    from scipy import sparse as scipy_sparse
+except (ImportError, ModuleNotFoundError, OSError):
+    scipy_sparse = None
+TEXT_FIELD_PREFERENCES = (
+    "text",
+    "content",
+    "body",
+    "article",
+    "document",
+    "passage",
+    "markdown",
+    "answer",
+    "response",
+)
+DIALOGUE_FIELD_PREFERENCES = (
+    "messages",
+    "conversation",
+    "conversations",
+    "dialogue",
+    "dialog",
+    "turns",
+    "chosen",
+)
+INSTRUCTION_FIELD_PAIRS = (
+    ("instruction", "output"),
+    ("prompt", "completion"),
+    ("prompt", "response"),
+    ("question", "answer"),
+    ("question", "response"),
+    ("query", "answer"),
+    ("query", "response"),
+)
+TRANSCRIPT_ROLE_PATTERN = re.compile(r"(?:^|\n\s*\n)(Human|Assistant|System)\s*:\s*", re.IGNORECASE)
+ROLE_ALIASES = {
+    "assistant": "assistant",
+    "assistant_response": "assistant",
+    "bot": "assistant",
+    "gpt": "assistant",
+    "model": "assistant",
+    "human": "user",
+    "prompter": "user",
+    "user": "user",
+    "customer": "user",
+    "system": "system",
+}
+ANSWER_READOUT_WEIGHT = 1.0
+CONTEXT_READOUT_WEIGHT = 0.0
+CONTEXT_STAT_WEIGHT = 0.02
+PLAIN_TEXT_READOUT_WEIGHT = 0.03
+PREFERENCE_REJECTED_TOKENIZER_WEIGHT = 0.0
+PREFERENCE_BIAS_SCALE = 0.95
+MAX_PREFERENCE_STATE_PAIRS = 512
+ANSWER_START_TOKEN_WINDOW = 12
+ANSWER_START_DECAY = 0.86
+MAX_ANSWER_SEQUENCE_EXAMPLES = 196608
+MAX_ANSWER_SEQUENCE_TOKENS = 192
+HF_STREAM_MAX_RETRIES = 5
+HF_STREAM_RETRY_BASE_DELAY_SECONDS = 0.25
+FULL_READOUT_FEATURE_LIMIT = 2304
+FULL_READOUT_EXAMPLE_LIMIT = 25000
+@dataclass(slots=True)
+class CorpusPlanEntry:
+    source: str
+    name: str
+    dataset: str = ""
+    path: str = ""
+    config: str | None = None
+    split: str = "train"
+    limit: int = 0
+    weight: float = 1.0
+    text_field: str | None = None
+    min_words: int = 0
+    max_words: int = 0
+    min_alpha_ratio: float = 0.0
+    allowed_languages: tuple[str, ...] = ()
+    records: tuple[object, ...] = ()
+    streaming: bool = True
+    trust_remote_code: bool = False
+@dataclass(slots=True)
+class StreamDocument:
+    text: str
+    weight: float
+    source: str
+    language: str = ""
+    preference_rejected_text: str = ""
+class StreamingCooccurrenceAccumulator:
+    def __init__(self, token_to_id: dict[str, int], window_size: int) -> None:
+        self.token_to_id = token_to_id
+        self.window_size = window_size
+        self.rows: dict[int, dict[int, float]] = {}
+    def update_tokens(self, tokens: list[str], *, weight: float) -> None:
+        token_ids = [self.token_to_id[token] for token in tokens if token in self.token_to_id]
+        for index, token_id in enumerate(token_ids):
+            for offset in range(1, self.window_size + 1):
+                other_index = index + offset
+                if other_index >= len(token_ids):
+                    break
+                other_id = token_ids[other_index]
+                delta = weight * (1.0 / offset)
+                self.rows.setdefault(token_id, {})[other_id] = (
+                    self.rows.setdefault(token_id, {}).get(other_id, 0.0) + delta
+                )
+                self.rows.setdefault(other_id, {})[token_id] = (
+                    self.rows.setdefault(other_id, {}).get(token_id, 0.0) + delta
+                )
+    def to_dense(self) -> Matrix:
+        size = len(self.token_to_id)
+        matrix = zeros(size, size)
+        for row, columns in self.rows.items():
+            for col, value in columns.items():
+                matrix[row][col] = value
+        return matrix
+    def to_sparse(self) -> object:
+        if scipy_sparse is None or np is None:
+            return self.to_dense()
+        rows: list[int] = []
+        cols: list[int] = []
+        data: list[float] = []
+        for row, columns in self.rows.items():
+            for col, value in columns.items():
+                rows.append(row)
+                cols.append(col)
+                data.append(value)
+        size = len(self.token_to_id)
+        return scipy_sparse.coo_matrix(
+            (
+                np.asarray(data, dtype=np.float64),
+                (np.asarray(rows, dtype=np.int64), np.asarray(cols, dtype=np.int64)),
+            ),
+            shape=(size, size),
+            dtype=np.float64,
+        ).tocsr()
+class TransitionAccumulator:
+    def __init__(
+        self,
+        *,
+        max_contexts_per_order: int | None = None,
+        max_next_tokens: int = 0,
+    ) -> None:
+        self.max_contexts_per_order = max_contexts_per_order
+        self.max_next_tokens = max_next_tokens
+        self.context_soft_limit = (
+            max_contexts_per_order * 4
+            if max_contexts_per_order is not None and max_contexts_per_order > 0
+            else None
+        )
+        self.next_token_soft_limit = max_next_tokens * 4 if max_next_tokens > 0 else None
+        self.counts: dict[int, dict[tuple[str, ...], dict[str, float]]] = {
+            order: {} for order in sorted(TRANSITION_ORDERS)
+        }
+    def update_tokens(self, tokens: list[str], *, weight: float) -> None:
+        for order in sorted(TRANSITION_ORDERS):
+            order_counts = self.counts[order]
+            for index in range(order - 1, len(tokens) - 1):
+                key = tuple(tokens[index - order + 1 : index + 1])
+                nxt = tokens[index + 1]
+                if (
+                    self.context_soft_limit is not None
+                    and key not in order_counts
+                    and len(order_counts) >= self.context_soft_limit
+                ):
+                    continue
+                bucket = order_counts.setdefault(key, {})
+                if (
+                    self.next_token_soft_limit is not None
+                    and nxt not in bucket
+                    and len(bucket) >= self.next_token_soft_limit
+                ):
+                    continue
+                bucket[nxt] = bucket.get(nxt, 0.0) + weight
+    def finalize(
+        self,
+        *,
+        max_contexts_per_order: int | None,
+        max_next_tokens: int,
+    ) -> dict[int, dict[tuple[str, ...], dict[str, float]]]:
+        probabilities: dict[int, dict[tuple[str, ...], dict[str, float]]] = {
+            order: {} for order in sorted(TRANSITION_ORDERS)
+        }
+        for order, mapping in self.counts.items():
+            items = list(mapping.items())
+            items.sort(key=lambda item: (-sum(item[1].values()), item[0]))
+            if max_contexts_per_order is not None and max_contexts_per_order >= 0:
+                items = items[:max_contexts_per_order]
+            for key, bucket in items:
+                next_items = sorted(bucket.items(), key=lambda item: (-item[1], item[0]))
+                if max_next_tokens > 0:
+                    next_items = next_items[:max_next_tokens]
+                total = sum(value for _, value in next_items)
+                if total <= 0.0:
+                    continue
+                probabilities[order][key] = {
+                    token: value / total
+                    for token, value in next_items
+                }
+        return probabilities
+class StateReservoir:
+    def __init__(self, capacity: int | None, *, seed: int = 13) -> None:
+        self.capacity = capacity
+        self.random = random.Random(seed)
+        self.states: list[Vector] = []
+        self.labels: list[int] = []
+        self.weights: list[float] = []
+        self.seen = 0
+        self.total_weight = 0.0
+    def reserve_slot(self, weight: float = 1.0) -> int | None:
+        if weight <= 0.0:
+            return None
+        self.seen += 1
+        self.total_weight += weight
+        if self.capacity is None:
+            return len(self.states)
+        if self.capacity <= 0:
+            return None
+        if len(self.states) < self.capacity:
+            return len(self.states)
+        keep_probability = min(1.0, (self.capacity * weight) / max(self.total_weight, 1e-12))
+        if self.random.random() >= keep_probability:
+            return None
+        return self.random.randrange(self.capacity)
+    def store_reserved(
+        self,
+        slot: int,
+        state: Vector,
+        label_id: int,
+        *,
+        example_weight: float = 1.0,
+    ) -> None:
+        stored_state = state.copy() if hasattr(state, "copy") else state[:]
+        if slot == len(self.states):
+            self.states.append(stored_state)
+            self.labels.append(label_id)
+            self.weights.append(example_weight)
+        elif 0 <= slot < len(self.states):
+            self.states[slot] = stored_state
+            self.labels[slot] = label_id
+            self.weights[slot] = example_weight
+    def consider(self, state: Vector, label_id: int, weight: float = 1.0) -> None:
+        slot = self.reserve_slot(weight=weight)
+        if slot is not None:
+            self.store_reserved(slot, state, label_id, example_weight=weight)
+class SequenceReservoir:
+    def __init__(self, capacity: int | None, *, seed: int = 41) -> None:
+        self.capacity = capacity
+        self.random = random.Random(seed)
+        self.keys: list[Vector] = []
+        self.prompt_rows: list[list[int]] = []
+        self.token_rows: list[list[int]] = []
+        self.weights: list[float] = []
+        self.seen_weight = 0.0
+    def reserve_slot(self, *, weight: float = 1.0) -> int | None:
+        if self.capacity == 0 or weight <= 0.0:
+            return None
+        self.seen_weight += weight
+        if self.capacity is None or len(self.keys) < self.capacity:
+            return len(self.keys)
+        probability = min(1.0, (self.capacity * weight) / max(self.seen_weight, 1e-12))
+        if self.random.random() >= probability:
+            return None
+        return self.random.randrange(self.capacity)
+    def store_reserved(
+        self,
+        slot: int,
+        key: Vector,
+        prompt_token_ids: list[int],
+        token_ids: list[int],
+        *,
+        example_weight: float = 1.0,
+    ) -> None:
+        key_copy = key.tolist() if hasattr(key, "tolist") else list(key)
+        prompt_row = prompt_token_ids[:MAX_ANSWER_SEQUENCE_TOKENS]
+        row = token_ids[:MAX_ANSWER_SEQUENCE_TOKENS]
+        if self.capacity is None or slot >= len(self.keys):
+            self.keys.append(key_copy)
+            self.prompt_rows.append(prompt_row)
+            self.token_rows.append(row)
+            self.weights.append(example_weight)
+            return
+        self.keys[slot] = key_copy
+        self.prompt_rows[slot] = prompt_row
+        self.token_rows[slot] = row
+        self.weights[slot] = example_weight
+    def consider(
+        self,
+        key: Vector,
+        prompt_token_ids: list[int],
+        token_ids: list[int],
+        weight: float = 1.0,
+    ) -> None:
+        if not token_ids:
+            return
+        slot = self.reserve_slot(weight=weight)
+        if slot is not None:
+            self.store_reserved(slot, key, prompt_token_ids, token_ids, example_weight=weight)
+def _word_count(text: str) -> int:
+    return len(text.split())
+def _alpha_ratio(text: str) -> float:
+    if not text:
+        return 0.0
+    alpha_count = sum(character.isalpha() for character in text)
+    return alpha_count / len(text)
+def _row_language(row: dict[str, object]) -> str:
+    for candidate in ("lang", "language", "locale"):
+        value = row.get(candidate)
+        if isinstance(value, str) and value.strip():
+            return value.strip()
+    return ""
+def _normalize_role(raw_role: object) -> str:
+    role = str(raw_role or "").strip().casefold()
+    return ROLE_ALIASES.get(role, role)
+def _message_content(message: dict[str, object]) -> str:
+    for field in ("content", "value", "text", "message"):
+        value = message.get(field)
+        if isinstance(value, str) and value.strip():
+            return clean_training_text(value)
+    return ""
+def _message_role(message: dict[str, object]) -> str:
+    for field in ("role", "from", "speaker", "author"):
+        value = message.get(field)
+        if value is not None:
+            normalized = _normalize_role(value)
+            if normalized:
+                return normalized
+    return ""
+def _parse_dialogue_messages(raw_messages: object) -> list[dict[str, str]]:
+    if not isinstance(raw_messages, list):
+        return []
+    parsed: list[dict[str, str]] = []
+    for message in raw_messages:
+        if not isinstance(message, dict):
+            continue
+        role = _message_role(message)
+        content = _message_content(message)
+        if role not in {"system", "user", "assistant"} or not content:
+            continue
+        parsed.append({"role": role, "content": content})
+    return parsed
+def _parse_transcript_messages(raw_text: object) -> list[dict[str, str]]:
+    if not isinstance(raw_text, str):
+        return []
+    text = raw_text.strip()
+    if not text:
+        return []
+    matches = list(TRANSCRIPT_ROLE_PATTERN.finditer(text))
+    if not matches:
+        return []
+    parsed: list[dict[str, str]] = []
+    for index, match in enumerate(matches):
+        role = _normalize_role(match.group(1))
+        start = match.end()
+        end = matches[index + 1].start() if index + 1 < len(matches) else len(text)
+        content = clean_training_text(text[start:end].strip())
+        if role in {"system", "user", "assistant"} and content:
+            parsed.append({"role": role, "content": content})
+    return parsed
+def _render_prompt(messages: list[dict[str, str]]) -> str:
+    parts = []
+    for message in messages:
+        content = clean_context_text(message["content"])
+        if content:
+            parts.append(content)
+    return "\n".join(parts).strip()
+def _last_user_prompt_before(messages: list[dict[str, str]], end_index: int) -> str:
+    for message in reversed(messages[:end_index]):
+        if message["role"] == "user":
+            return clean_context_text(message["content"])
+    return _render_prompt(messages[:end_index])
+def _compose_training_text(context: object, answer: object) -> str:
+    prompt_text = clean_context_text(_flatten_value(context))
+    answer_text = clean_answer_text(_flatten_value(answer))
+    if prompt_text and answer_text:
+        return f"<reason> {prompt_text} <answer> {answer_text}".strip()
+    return clean_training_text(answer_text or prompt_text)
+def _compose_from_messages(messages: list[dict[str, str]]) -> str:
+    assistant_index = None
+    for index in range(len(messages) - 1, -1, -1):
+        if messages[index]["role"] == "assistant":
+            assistant_index = index
+            break
+    if assistant_index is not None:
+        prompt = _last_user_prompt_before(messages, assistant_index)
+        answer = clean_answer_text(messages[assistant_index]["content"])
+        if prompt and answer:
+            return f"<reason> {prompt} <answer> {answer}".strip()
+    return "\n".join(
+        message["content"]
+        for message in messages
+        if message.get("content")
+    ).strip()
+def _flatten_message_list(messages: object) -> str:
+    parsed = _parse_dialogue_messages(messages)
+    if parsed:
+        return _compose_from_messages(parsed)
+    if not isinstance(messages, list):
+        return ""
+    parts: list[str] = []
+    for message in messages:
+        if not isinstance(message, dict):
+            continue
+        content = str(
+            message.get("content", message.get("value", message.get("text", "")))
+        ).strip()
+        if not content:
+            continue
+        parts.append(clean_training_text(content))
+    return "\n".join(parts).strip()
+def _flatten_value(value: object) -> str:
+    if isinstance(value, str):
+        parsed = _parse_transcript_messages(value)
+        if parsed:
+            return _compose_from_messages(parsed)
+        return clean_training_text(value.strip())
+    if isinstance(value, list):
+        return _flatten_message_list(value)
+    if isinstance(value, dict):
+        for field in ("messages", "conversation", "conversations", "dialogue", "turns"):
+            nested_messages = value.get(field)
+            text = _flatten_message_list(nested_messages)
+            if text:
+                return text
+        for field in ("text", "content", "value", "message"):
+            nested = value.get(field)
+            if isinstance(nested, str) and nested.strip():
+                return _flatten_value(nested)
+    return ""
+def _safe_flag(value: object) -> bool | None:
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, str):
+        normalized = value.strip().casefold()
+        if normalized in {"true", "1", "yes", "safe"}:
+            return True
+        if normalized in {"false", "0", "no", "unsafe"}:
+            return False
+    return None
+def _selected_response_fields(row: dict[str, object]) -> tuple[str, str]:
+    if "response_0" not in row or "response_1" not in row:
+        return "", ""
+    safe_0 = _safe_flag(row.get("is_response_0_safe"))
+    safe_1 = _safe_flag(row.get("is_response_1_safe"))
+    if safe_0 is not None and safe_1 is not None:
+        if safe_0 and not safe_1:
+            return "response_0", "response_1"
+        if safe_1 and not safe_0:
+            return "response_1", "response_0"
+        if safe_0 and safe_1:
+            return "response_0", ""
+        return "", ""
+    for selector in ("safer_response_id", "better_response_id"):
+        raw_value = row.get(selector)
+        try:
+            preferred = int(raw_value)
+        except (TypeError, ValueError):
+            continue
+        chosen = "response_1" if preferred == 1 else "response_0"
+        rejected = "response_0" if chosen == "response_1" else "response_1"
+        return chosen, rejected
+    return "response_0", "response_1"
+def _extract_preference_pair(row: dict[str, object]) -> tuple[str, str]:
+    if "chosen" in row and "rejected" in row:
+        chosen_text = clean_training_text(_flatten_value(row.get("chosen")))
+        rejected_text = clean_training_text(_flatten_value(row.get("rejected")))
+        if chosen_text and rejected_text:
+            return chosen_text, rejected_text
+    if "response_0" in row and "response_1" in row:
+        preferred_field, rejected_field = _selected_response_fields(row)
+        if not preferred_field or not rejected_field:
+            return "", ""
+        prompt = row.get("prompt", row.get("question", row.get("query", "")))
+        if prompt:
+            chosen_text = _compose_training_text(prompt, row.get(preferred_field))
+            rejected_text = _compose_training_text(prompt, row.get(rejected_field))
+            if chosen_text and rejected_text:
+                return clean_training_text(chosen_text), clean_training_text(rejected_text)
+        chosen_text = clean_training_text(_flatten_value(row.get(preferred_field)))
+        rejected_text = clean_training_text(_flatten_value(row.get(rejected_field)))
+        if chosen_text and rejected_text:
+            return chosen_text, rejected_text
+    return "", ""
+def _extract_preference_value(row: dict[str, object]) -> str:
+    chosen_text, _ = _extract_preference_pair(row)
+    return chosen_text
+def _extract_row_text(row: dict[str, object], text_field: str | None) -> str:
+    if "context" in row and "answer" in row:
+        context = clean_context_text(_flatten_value(row.get("context")))
+        answer = clean_answer_text(_flatten_value(row.get("answer")))
+        if context and answer:
+            return f"<reason> {context} <answer> {answer}".strip()
+    if "response_0" in row and "response_1" in row:
+        preferred_field, _ = _selected_response_fields(row)
+        prompt = row.get("prompt", row.get("question", row.get("query", "")))
+        if preferred_field and prompt:
+            text = _compose_training_text(prompt, row.get(preferred_field))
+            if text:
+                return text
+    for prompt_field, answer_field in INSTRUCTION_FIELD_PAIRS:
+        if prompt_field in row and answer_field in row:
+            text = _compose_training_text(row.get(prompt_field), row.get(answer_field))
+            if text:
+                return text
+    if text_field is not None:
+        return clean_training_text(_flatten_value(row.get(text_field)))
+    preferred = _extract_preference_value(row)
+    if preferred:
+        return clean_training_text(preferred)
+    for field in TEXT_FIELD_PREFERENCES:
+        text = _flatten_value(row.get(field))
+        if text:
+            return clean_training_text(text)
+    for field in DIALOGUE_FIELD_PREFERENCES:
+        text = _flatten_value(row.get(field))
+        if text:
+            return clean_training_text(text)
+    return ""
+def _passes_text_quality(text: str, language: str, entry: CorpusPlanEntry) -> bool:
+    if not text:
+        return False
+    word_count = _word_count(text)
+    if entry.min_words > 0 and word_count < entry.min_words:
+        return False
+    if entry.max_words > 0 and word_count > entry.max_words:
+        return False
+    if entry.min_alpha_ratio > 0.0 and _alpha_ratio(text) < entry.min_alpha_ratio:
+        return False
+    if entry.allowed_languages:
+        if not language or language.casefold() not in entry.allowed_languages:
+            return False
+    return True
+def load_corpus_plan(source: str | Path) -> list[CorpusPlanEntry]:
+    payload = json.loads(Path(source).read_text(encoding="utf-8-sig"))
+    raw_entries = payload.get("sources", payload.get("datasets", []))
+    if not isinstance(raw_entries, list) or not raw_entries:
+        raise ValueError("Corpus plan must define a non-empty 'sources' list.")
+    entries: list[CorpusPlanEntry] = []
+    for index, raw_entry in enumerate(raw_entries, start=1):
+        if not isinstance(raw_entry, dict):
+            raise ValueError("Each corpus plan entry must be an object.")
+        source = str(raw_entry.get("source", "hf")).strip() or "hf"
+        name = str(
+            raw_entry.get("name", raw_entry.get("dataset", f"source-{index}"))
+        ).strip() or f"source-{index}"
+        raw_languages = raw_entry.get("allowed_languages", [])
+        allowed_languages = tuple(
+            str(value).strip().casefold()
+            for value in raw_languages
+            if str(value).strip()
+        ) if isinstance(raw_languages, list) else ()
+        raw_records = raw_entry.get("records", raw_entry.get("texts", []))
+        if source == "inline" and not isinstance(raw_records, list):
+            raise ValueError("Inline corpus plan entries must provide a records/texts list.")
+        entries.append(
+            CorpusPlanEntry(
+                source=source,
+                name=name,
+                dataset=str(raw_entry.get("dataset", "")),
+                path=str(raw_entry.get("path", raw_entry.get("file", ""))),
+                config=(
+                    str(raw_entry["config"])
+                    if raw_entry.get("config") is not None
+                    else None
+                ),
+                split=str(raw_entry.get("split", "train")),
+                limit=int(raw_entry.get("limit", 0)),
+                weight=float(raw_entry.get("weight", 1.0)),
+                text_field=(
+                    str(raw_entry["text_field"])
+                    if raw_entry.get("text_field") is not None
+                    else None
+                ),
+                min_words=int(raw_entry.get("min_words", 0)),
+                max_words=int(raw_entry.get("max_words", 0)),
+                min_alpha_ratio=float(raw_entry.get("min_alpha_ratio", 0.0)),
+                allowed_languages=allowed_languages,
+                records=tuple(raw_records) if isinstance(raw_records, list) else (),
+                streaming=bool(raw_entry.get("streaming", True)),
+                trust_remote_code=bool(raw_entry.get("trust_remote_code", False)),
+            )
+        )
+    return entries
+def _iter_hf_rows(entry: CorpusPlanEntry) -> Iterator[dict[str, object]]:
+    try:
+        from datasets import load_dataset
+    except ModuleNotFoundError:
+        user_site = site.getusersitepackages()
+        if user_site and user_site not in sys.path:
+            sys.path.append(user_site)
+        from datasets import load_dataset
+    dataset_kwargs: dict[str, object] = {
+        "split": entry.split,
+        "streaming": entry.streaming,
+    }
+    if entry.config:
+        dataset_kwargs["name"] = entry.config
+    if entry.trust_remote_code:
+        dataset_kwargs["trust_remote_code"] = True
+    for row in load_dataset(entry.dataset, **dataset_kwargs):
+        yield dict(row)
+def _iter_file_rows(entry: CorpusPlanEntry) -> Iterator[dict[str, object]]:
+    raw_path = entry.path or entry.dataset
+    if not raw_path:
+        raise ValueError("File corpus plan entries must provide a path.")
+    path = Path(raw_path)
+    suffix = path.suffix.lower()
+    if suffix == ".jsonl":
+        with path.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if line.strip():
+                    row = json.loads(line)
+                    yield row if isinstance(row, dict) else {"text": str(row)}
+        return
+    if suffix == ".json":
+        payload = json.loads(path.read_text(encoding="utf-8"))
+        if isinstance(payload, list):
+            for row in payload:
+                yield row if isinstance(row, dict) else {"text": str(row)}
+            return
+        if isinstance(payload, dict):
+            rows = payload.get("records", payload.get("texts"))
+            if isinstance(rows, list):
+                for row in rows:
+                    yield row if isinstance(row, dict) else {"text": str(row)}
+                return
+            yield payload
+            return
+    if suffix in {".txt", ".md", ".text"}:
+        yield {"text": path.read_text(encoding="utf-8")}
+        return
+    raise ValueError(f"Unsupported file corpus source: {path}")
+def iter_corpus_plan_documents(plan: Iterable[CorpusPlanEntry]) -> Iterator[StreamDocument]:
+    for entry in plan:
+        accepted = 0
+        attempts = 0
+        while True:
+            accepted_seen_this_attempt = 0
+            try:
+                if entry.source == "inline":
+                    row_iterator = (
+                        item if isinstance(item, dict) else {"text": str(item)}
+                        for item in entry.records
+                    )
+                elif entry.source == "hf":
+                    row_iterator = _iter_hf_rows(entry)
+                elif entry.source == "file":
+                    row_iterator = _iter_file_rows(entry)
+                else:
+                    raise ValueError(f"Unsupported corpus plan source: {entry.source}")
+                for row in row_iterator:
+                    language = _row_language(row)
+                    _, rejected_text = _extract_preference_pair(row)
+                    text = clean_training_text(_extract_row_text(row, entry.text_field))
+                    if not _passes_text_quality(text, language, entry):
+                        continue
+                    accepted_seen_this_attempt += 1
+                    if accepted_seen_this_attempt <= accepted:
+                        continue
+                    yield StreamDocument(
+                        text=text,
+                        weight=entry.weight,
+                        source=entry.name,
+                        language=language,
+                        preference_rejected_text=rejected_text,
+                    )
+                    accepted += 1
+                    if entry.limit > 0 and accepted >= entry.limit:
+                        break
+                break
+            except Exception as exc:
+                if entry.source != "hf":
+                    raise
+                if attempts >= HF_STREAM_MAX_RETRIES:
+                    print(
+                        f"[source] {entry.name} skipped after {attempts} retries; "
+                        f"accepted {accepted} documents before final error: {exc}"
+                    )
+                    break
+                attempts += 1
+                delay = min(
+                    15.0,
+                    HF_STREAM_RETRY_BASE_DELAY_SECONDS * (2 ** (attempts - 1)),
+                )
+                print(
+                    f"[source] {entry.name} stream interrupted after {accepted} accepted "
+                    f"documents; retry {attempts}/{HF_STREAM_MAX_RETRIES} in {delay:.2f}s: {exc}"
+                )
+                time.sleep(delay)
+def _log_progress(label: str, processed: int, log_every: int) -> None:
+    if log_every > 0 and processed % log_every == 0:
+        print(f"[{label}] processed {processed} documents")
+def _answer_boundary(tokens: list[str]) -> int | None:
+    try:
+        return tokens.index("<answer>")
+    except ValueError:
+        return None
+def _weighted_text_parts_for_statistics(text: str, document_weight: float) -> list[tuple[str, float]]:
+    if "<answer>" not in text:
+        return [(text, document_weight)]
+    context, answer = text.split("<answer>", 1)
+    context = clean_context_text(context.replace("<reason>", " "))
+    answer = clean_answer_text(answer)
+    parts: list[tuple[str, float]] = []
+    if context:
+        parts.append((context, document_weight * CONTEXT_STAT_WEIGHT))
+    if answer:
+        parts.append((answer, document_weight * ANSWER_READOUT_WEIGHT))
+    return parts or [(text, document_weight)]
+def _weighted_token_sequences_for_statistics(
+    tokens: list[str],
+    tokenizer: NativeTokenizer,
+    document_weight: float,
+) -> list[tuple[list[str], float]]:
+    answer_index = _answer_boundary(tokens)
+    if answer_index is None:
+        sequence = [token for token in tokens if token not in tokenizer.special_tokens]
+        return [(sequence, document_weight)] if sequence else []
+    context_tokens = [
+        token for token in tokens[:answer_index] if token not in tokenizer.special_tokens
+    ]
+    answer_tokens = [
+        token for token in tokens[answer_index + 1 :] if token not in tokenizer.special_tokens
+    ]
+    sequences: list[tuple[list[str], float]] = []
+    if context_tokens:
+        sequences.append((context_tokens, document_weight * CONTEXT_STAT_WEIGHT))
+    if answer_tokens:
+        sequences.append((answer_tokens, document_weight * ANSWER_READOUT_WEIGHT))
+    return sequences
+def _readout_weight_for_target(
+    answer_index: int | None,
+    target_index: int,
+    document_weight: float,
+) -> float:
+    if answer_index is None:
+        return document_weight * PLAIN_TEXT_READOUT_WEIGHT
+    if target_index <= answer_index:
+        return document_weight * CONTEXT_READOUT_WEIGHT
+    return document_weight * ANSWER_READOUT_WEIGHT
+def _answer_payload_tokens(tokens: list[str], tokenizer: NativeTokenizer) -> list[str]:
+    answer_index = _answer_boundary(tokens)
+    payload = tokens[answer_index + 1 :] if answer_index is not None else tokens
+    return [token for token in payload if token not in tokenizer.special_tokens]
+def _standardized_preference_bias(values: object, active_mask: object | None = None) -> list[float]:
+    if np is not None:
+        bias = np.asarray(values, dtype=np.float64)
+        if bias.size == 0:
+            return []
+        active = (
+            np.asarray(active_mask, dtype=bool)
+            if active_mask is not None
+            else np.ones(bias.shape, dtype=bool)
+        )
+        if not np.any(active):
+            return [0.0 for _ in range(int(bias.size))]
+        active_values = bias[active]
+        spread = float(active_values.std())
+        if spread <= 1e-12:
+            return [0.0 for _ in range(int(bias.size))]
+        standardized = np.zeros_like(bias, dtype=np.float64)
+        standardized[active] = (
+            (active_values - float(active_values.mean())) / spread
+        ) * PREFERENCE_BIAS_SCALE
+        return np.clip(standardized, -2.5, 2.5).astype(float).tolist()
+    raw_values = [float(value) for value in values]
+    if not raw_values:
+        return []
+    average = sum(raw_values) / len(raw_values)
+    variance = sum((value - average) * (value - average) for value in raw_values) / len(raw_values)
+    spread = variance**0.5
+    if spread <= 1e-12:
+        return [0.0 for _ in raw_values]
+    active_indices = (
+        [
+            index
+            for index, active in enumerate(active_mask)
+            if active
+        ]
+        if active_mask is not None
+        else list(range(len(raw_values)))
+    )
+    if not active_indices:
+        return [0.0 for _ in raw_values]
+    active_values = [raw_values[index] for index in active_indices]
+    average = mean(active_values)
+    spread = (mean([(value - average) * (value - average) for value in active_values])) ** 0.5
+    if spread <= 1e-12:
+        return [0.0 for _ in raw_values]
+    standardized = [0.0 for _ in raw_values]
+    for index in active_indices:
+        standardized[index] = max(
+            -2.5,
+            min(2.5, ((raw_values[index] - average) / spread) * PREFERENCE_BIAS_SCALE),
+        )
+    return standardized
+def _candidate_preference_bias_from_state_vector(
+    model: ReframrModel,
+    preference_state: object,
+) -> object:
+    if np is None:
+        return None
+    assert model.embedding_model is not None
+    assert model.memory_units is not None
+    assert model.ternary_mask is not None
+    embeddings = np.asarray(model.embedding_model.embeddings, dtype=np.float64)
+    if embeddings.size == 0:
+        return np.zeros(0, dtype=np.float64)
+    state_vector = np.asarray(preference_state, dtype=np.float64)
+    mask = np.asarray(model.ternary_mask, dtype=np.float64) * float(model.ternary_scale)
+    if state_vector.shape[0] != mask.shape[0]:
+        return np.zeros(embeddings.shape[0], dtype=np.float64)
+    state_indices = np.arange(model.config.state_dim, dtype=np.int64)
+    drive = (
+        embeddings[:, state_indices % model.config.embedding_dim]
+        + (0.5 * embeddings[:, (3 * state_indices + 1) % model.config.embedding_dim])
+        - (0.25 * embeddings[:, (5 * state_indices + 2) % model.config.embedding_dim])
+    )
+    scores = np.zeros(embeddings.shape[0], dtype=np.float64)
+    offset = 0
+    for unit in model.memory_units:
+        hidden_end = offset + model.config.state_dim
+        trace_end = hidden_end + model.config.embedding_dim
+        hidden_pref = state_vector[offset:hidden_end] * mask[offset:hidden_end]
+        trace_pref = state_vector[hidden_end:trace_end] * mask[hidden_end:trace_end]
+        hidden_delta_axis = np.asarray(unit.input_projection, dtype=np.float64) * hidden_pref
+        trace_gain = 1.0 - (1.0 / (1.0 + unit.timescale))
+        scores += drive @ hidden_delta_axis
+        scores += embeddings @ (trace_gain * trace_pref)
+        offset = trace_end
+    return scores
+def _derive_preference_bias_from_pairs(
+    model: ReframrModel,
+    preference_token_pairs: list[tuple[list[str], list[str], float]],
+    tokenizer: NativeTokenizer,
+) -> tuple[list[float], int]:
+    assert model.embedding_model is not None
+    vocab_size = len(model.embedding_model.id_to_token)
+    if not preference_token_pairs:
+        return [0.0 for _ in range(vocab_size)], 0
+    if np is not None:
+        token_bias = np.zeros(vocab_size, dtype=np.float64)
+        active_token_mask = np.zeros(vocab_size, dtype=bool)
+        state_delta = np.zeros(model._combined_state_width(), dtype=np.float64)
+    else:
+        token_bias = [0.0 for _ in range(vocab_size)]
+        active_token_ids: set[int] = set()
+        state_delta = [0.0 for _ in range(model._combined_state_width())]
+    pair_weight_total = 0.0
+    state_pair_count = 0
+    state_stride = max(
+        1,
+        (len(preference_token_pairs) + MAX_PREFERENCE_STATE_PAIRS - 1)
+        // MAX_PREFERENCE_STATE_PAIRS,
+    )
+    for pair_index, (chosen_tokens, rejected_tokens, pair_weight) in enumerate(preference_token_pairs):
+        chosen_answer = _answer_payload_tokens(chosen_tokens, tokenizer)
+        rejected_answer = _answer_payload_tokens(rejected_tokens, tokenizer)
+        if chosen_answer:
+            delta = pair_weight / max(1, len(chosen_answer))
+            for token in chosen_answer:
+                token_id = model.embedding_model.token_to_id.get(token)
+                if token_id is not None:
+                    token_bias[token_id] += delta
+                    if np is not None:
+                        active_token_mask[token_id] = True
+                    else:
+                        active_token_ids.add(token_id)
+        if rejected_answer:
+            delta = pair_weight / max(1, len(rejected_answer))
+            for token in rejected_answer:
+                token_id = model.embedding_model.token_to_id.get(token)
+                if token_id is not None:
+                    token_bias[token_id] -= delta
+                    if np is not None:
+                        active_token_mask[token_id] = True
+                    else:
+                        active_token_ids.add(token_id)
+        if pair_index % state_stride != 0 or state_pair_count >= MAX_PREFERENCE_STATE_PAIRS:
+            continue
+        chosen_state = model._masked_decode_state(model._build_decode_state(chosen_tokens))
+        rejected_state = model._masked_decode_state(model._build_decode_state(rejected_tokens))
+        if len(chosen_state) != len(rejected_state):
+            continue
+        pair_weight_total += pair_weight
+        state_pair_count += 1
+        if np is not None:
+            state_delta += pair_weight * (
+                np.asarray(chosen_state, dtype=np.float64)
+                - np.asarray(rejected_state, dtype=np.float64)
+            )
+        else:
+            for index, (chosen_value, rejected_value) in enumerate(zip(chosen_state, rejected_state)):
+                state_delta[index] += pair_weight * (chosen_value - rejected_value)
+    if pair_weight_total > 0.0:
+        if np is not None:
+            state_delta = state_delta / pair_weight_total
+            candidate_bias = _candidate_preference_bias_from_state_vector(model, state_delta)
+            if candidate_bias is not None:
+                token_bias[active_token_mask] = (
+                    token_bias[active_token_mask] + candidate_bias[active_token_mask]
+                )
+        else:
+            state_delta = [value / pair_weight_total for value in state_delta]
+    if np is not None:
+        return _standardized_preference_bias(token_bias, active_token_mask), state_pair_count
+    active_mask = [index in active_token_ids for index in range(vocab_size)]
+    return _standardized_preference_bias(token_bias, active_mask), state_pair_count
+def _solve_weighted_prompt_readout(
+    states: list[Vector],
+    labels: list[int],
+    weights: list[float],
+    *,
+    vocab_size: int,
+    diagonal: object,
+    state_offset: object,
+    regularization: float,
+) -> tuple[object, object, int]:
+    if np is None or not states or not labels or not weights:
+        return [], [0.0 for _ in range(vocab_size)], 0
+    state_matrix = np.asarray(states, dtype=np.float64)
+    label_array = np.asarray(labels, dtype=np.int64)
+    weight_vector = np.asarray(weights, dtype=np.float64)
+    valid_mask = (
+        (label_array >= 0)
+        & (label_array < vocab_size)
+        & (weight_vector > 0.0)
+    )
+    if not np.any(valid_mask):
+        return [], [0.0 for _ in range(vocab_size)], 0
+    state_matrix = state_matrix[valid_mask]
+    label_array = label_array[valid_mask]
+    weight_vector = weight_vector[valid_mask]
+    diagonal_array = np.asarray(diagonal, dtype=np.float64)
+    offset_array = np.asarray(state_offset, dtype=np.float64)
+    if (
+        len(state_matrix.shape) != 2
+        or diagonal_array.shape[0] != state_matrix.shape[1]
+        or offset_array.shape[0] != state_matrix.shape[1]
+    ):
+        return [], [0.0 for _ in range(vocab_size)], 0
+    masked_states = state_matrix * diagonal_array[None, :]
+    centered_states = masked_states - offset_array[None, :]
+    weighted_centered_states = weight_vector[:, None] * centered_states
+    gram = centered_states.T @ weighted_centered_states
+    cross = np.zeros((vocab_size, centered_states.shape[1]), dtype=np.float64)
+    np.add.at(cross, label_array, weighted_centered_states)
+    total_weight = float(weight_vector.sum())
+    if total_weight <= 0.0:
+        return [], [0.0 for _ in range(vocab_size)], 0
+    bias = np.zeros(vocab_size, dtype=np.float64)
+    np.add.at(bias, label_array, weight_vector)
+    bias /= total_weight
+    readout = ridge_regression_readout_from_moments(
+        gram,
+        cross,
+        regularization=regularization,
+    )
+    return readout, bias, int(label_array.shape[0])
+def fit_model_from_corpus_plan(
+    plan: Iterable[CorpusPlanEntry],
+    config: ReframrConfig,
+    *,
+    log_every: int = 0,
+) -> tuple[ReframrModel, dict[str, object]]:
+    entries = list(plan)
+    if not entries:
+        raise ValueError("Cannot fit REFRAMR without any corpus plan entries.")
+    stage_seconds: dict[str, float] = {}
+    stage_started = time.perf_counter()
+    def finish_stage(name: str) -> None:
+        nonlocal stage_started
+        now = time.perf_counter()
+        elapsed = round(now - stage_started, 6)
+        stage_seconds[name] = elapsed
+        if log_every > 0:
+            print(f"[stage] {name} finished in {elapsed:.3f}s")
+        stage_started = now
+    seed_tokenizer = NativeTokenizer(
+        merges=[],
+        vocab=[],
+        base_symbols=[],
+        lowercase=config.lowercase,
+    )
+    segment_counts: Counter[str] = Counter()
+    source_counts: dict[str, int] = {}
+    documents: list[StreamDocument] = []
+    processed = 0
+    for entry in entries:
+        if log_every > 0:
+            print(f"[source] {entry.name} started")
+        source_start = processed
+        for document in iter_corpus_plan_documents([entry]):
+            documents.append(document)
+            processed += 1
+            source_counts[document.source] = source_counts.get(document.source, 0) + 1
+            for text_part, part_weight in _weighted_text_parts_for_statistics(
+                document.text,
+                document.weight,
+            ):
+                for segment in seed_tokenizer.pretokenize(text_part):
+                    segment_counts[segment] += part_weight
+            if document.preference_rejected_text:
+                rejected_weight = document.weight * PREFERENCE_REJECTED_TOKENIZER_WEIGHT
+                for text_part, part_weight in _weighted_text_parts_for_statistics(
+                    document.preference_rejected_text,
+                    rejected_weight,
+                ):
+                    for segment in seed_tokenizer.pretokenize(text_part):
+                        segment_counts[segment] += part_weight
+            _log_progress("tokenizer", processed, log_every)
+        if log_every > 0:
+            print(f"[source] {entry.name} accepted {processed - source_start} documents")
+    if processed == 0:
+        raise ValueError("Corpus plan did not yield any usable documents after filtering.")
+    finish_stage("stream_and_segment")
+    tokenizer = NativeTokenizer.train_from_segment_counts(
+        segment_counts,
+        vocab_size=config.tokenizer_vocab_size,
+        min_pair_frequency=config.tokenizer_min_pair_frequency,
+        lowercase=config.lowercase,
+    )
+    finish_stage("tokenizer_fit")
+    token_counts: Counter[str] = Counter()
+    raw_tokenized_documents: list[list[str]] = []
+    raw_rejected_tokenized_documents: list[list[str]] = []
+    processed = 0
+    for document in documents:
+        processed += 1
+        tokens = tokenizer.encode(document.text)
+        raw_tokenized_documents.append(tokens)
+        for token in tokens:
+            if token in tokenizer.special_tokens:
+                token_counts[token] += document.weight
+        for token_sequence, sequence_weight in _weighted_token_sequences_for_statistics(
+            tokens,
+            tokenizer,
+            document.weight,
+        ):
+            for token in token_sequence:
+                token_counts[token] += sequence_weight
+        rejected_tokens = (
+            tokenizer.encode(document.preference_rejected_text)
+            if document.preference_rejected_text
+            else []
+        )
+        raw_rejected_tokenized_documents.append(rejected_tokens)
+        rejected_weight = document.weight * PREFERENCE_REJECTED_TOKENIZER_WEIGHT
+        for token in rejected_tokens:
+            if token in tokenizer.special_tokens:
+                token_counts[token] += rejected_weight
+        for token_sequence, sequence_weight in _weighted_token_sequences_for_statistics(
+            rejected_tokens,
+            tokenizer,
+            rejected_weight,
+        ):
+            for token in token_sequence:
+                token_counts[token] += sequence_weight
+        _log_progress("vocab", processed, log_every)
+    token_to_id, id_to_token = build_vocabulary_from_counts(
+        token_counts,
+        min_frequency=config.min_frequency,
+        max_vocab=config.max_vocab,
+    )
+    if not id_to_token:
+        raise ValueError("Streaming recompute could not derive an embedding vocabulary.")
+    finish_stage("vocabulary")
+    cooccurrence = StreamingCooccurrenceAccumulator(token_to_id, config.window_size)
+    tokenized_documents: list[list[str]] = []
+    preference_token_pairs: list[tuple[list[str], list[str], float]] = []
+    processed = 0
+    for document, raw_tokens, raw_rejected_tokens in zip(
+        documents,
+        raw_tokenized_documents,
+        raw_rejected_tokenized_documents,
+    ):
+        processed += 1
+        tokens = [token for token in raw_tokens if token in token_to_id]
+        tokenized_documents.append(tokens)
+        rejected_tokens = [token for token in raw_rejected_tokens if token in token_to_id]
+        if len(tokens) > 1 and len(rejected_tokens) > 1:
+            preference_token_pairs.append((tokens, rejected_tokens, document.weight))
+        for token_sequence, sequence_weight in _weighted_token_sequences_for_statistics(
+            tokens,
+            tokenizer,
+            document.weight,
+        ):
+            if len(token_sequence) > 1:
+                cooccurrence.update_tokens(token_sequence, weight=sequence_weight)
+        _log_progress("cooccurrence", processed, log_every)
+    finish_stage("cooccurrence")
+    if np is not None:
+        embedding_model = fit_randomized_ppmi_embedding_from_counts(
+            id_to_token,
+            cooccurrence.rows,
+            embedding_dim=config.embedding_dim,
+        )
+    else:
+        embedding_model = fit_ppmi_embedding_from_cooccurrence(
+            id_to_token,
+            cooccurrence.to_sparse(),
+            embedding_dim=config.embedding_dim,
+        )
+    finish_stage("embedding")
+    model = ReframrModel(config)
+    model.tokenizer = tokenizer
+    model.embedding_model = embedding_model
+    model.memory_units = [
+        AnalyticalMemoryUnit(config.state_dim, timescale)
+        for timescale in config.timescales
+    ]
+    model.trace_token_weights = model._derive_trace_token_weights_from_counts(token_counts)
+    feature_count = len(model._zero_combined_state())
+    if np is not None:
+        feature_second_moment = np.zeros(feature_count, dtype=np.float64)
+        raw_cross = np.zeros((len(embedding_model.id_to_token), feature_count), dtype=np.float64)
+    else:
+        feature_second_moment = zeros_vector(feature_count)
+        raw_cross = zeros(len(embedding_model.id_to_token), feature_count)
+    example_weight_total = 0.0
+    has_answer_targets = any(_answer_boundary(tokens) is not None for tokens in tokenized_documents)
+    if config.max_training_examples is None:
+        answer_reservoir_capacity = None
+        general_reservoir_capacity = None
+    elif config.max_training_examples <= 0:
+        answer_reservoir_capacity = 0
+        general_reservoir_capacity = 0
+    elif has_answer_targets:
+        answer_reservoir_capacity = max(1, int(config.max_training_examples * 0.75))
+        general_reservoir_capacity = max(0, config.max_training_examples - answer_reservoir_capacity)
+    else:
+        answer_reservoir_capacity = 0
+        general_reservoir_capacity = config.max_training_examples
+    answer_sequence_capacity = MAX_ANSWER_SEQUENCE_EXAMPLES if has_answer_targets else 0
+    answer_reservoir = StateReservoir(answer_reservoir_capacity, seed=17)
+    general_reservoir = StateReservoir(general_reservoir_capacity, seed=13)
+    answer_intent_reservoir = StateReservoir(answer_reservoir_capacity, seed=29)
+    answer_start_reservoir = StateReservoir(answer_reservoir_capacity, seed=37)
+    answer_sequence_reservoir = SequenceReservoir(answer_sequence_capacity, seed=41)
+    moment_reservoir = StateReservoir(
+        config.max_training_examples if config.max_training_examples is not None else None,
+        seed=31,
+    )
+    transitions = TransitionAccumulator(
+        max_contexts_per_order=config.max_transition_contexts_per_order,
+        max_next_tokens=config.max_transition_next_tokens,
+    )
+    if np is not None:
+        target_label_mass = np.zeros(len(embedding_model.id_to_token), dtype=np.float64)
+    else:
+        target_label_mass = zeros_vector(len(embedding_model.id_to_token))
+    for document, tokens in zip(documents, tokenized_documents):
+        answer_index = _answer_boundary(tokens)
+        for index in range(len(tokens) - 1):
+            next_token = tokens[index + 1]
+            if tokenizer is not None and next_token in tokenizer.special_tokens:
+                continue
+            next_token_id = embedding_model.token_to_id.get(next_token, -1)
+            if next_token_id < 0:
+                continue
+            label_weight = _readout_weight_for_target(answer_index, index + 1, document.weight)
+            if label_weight > 0.0:
+                target_label_mass[next_token_id] += label_weight
+    if np is not None:
+        positive_label_mass = target_label_mass[target_label_mass > 0.0]
+        reference_label_mass = (
+            float(np.median(positive_label_mass))
+            if positive_label_mass.size
+            else 1.0
+        )
+        target_balance = np.ones(len(embedding_model.id_to_token), dtype=np.float64)
+        np.divide(
+            reference_label_mass,
+            np.maximum(target_label_mass, 1e-12),
+            out=target_balance,
+            where=target_label_mass > 0.0,
+        )
+        target_balance = np.clip(np.sqrt(target_balance), 0.25, 4.0)
+    else:
+        positive_label_mass = [value for value in target_label_mass if value > 0.0]
+        if positive_label_mass:
+            sorted_mass = sorted(positive_label_mass)
+            reference_label_mass = sorted_mass[len(sorted_mass) // 2]
+        else:
+            reference_label_mass = 1.0
+        target_balance = [
+            max(0.25, min(4.0, (reference_label_mass / max(value, 1e-12)) ** 0.5))
+            if value > 0.0
+            else 1.0
+            for value in target_label_mass
+        ]
+    processed = 0
+    embedding_array = (
+        np.asarray(embedding_model.embeddings, dtype=RUNTIME_ARRAY_DTYPE)
+        if np is not None
+        else None
+    )
+    trace_embedding_array = (
+        model._build_trace_embedding_table_array(embedding_array)
+        if np is not None and embedding_array is not None
+        else None
+    )
+    if np is not None:
+        trace_decay = np.asarray(
+            [1.0 / (1.0 + unit.timescale) for unit in model.memory_units],
+            dtype=RUNTIME_ARRAY_DTYPE,
+        )
+        trace_gain = 1.0 - trace_decay
+        transition_stack = np.asarray(
+            [unit.transition for unit in model.memory_units],
+            dtype=RUNTIME_ARRAY_DTYPE,
+        )
+        input_projection_stack = np.asarray(
+            [unit.input_projection for unit in model.memory_units],
+            dtype=RUNTIME_ARRAY_DTYPE,
+        )
+        drive_indices = np.arange(config.state_dim, dtype=np.int64)
+        drive_primary = drive_indices % config.embedding_dim
+        drive_secondary = (3 * drive_indices + 1) % config.embedding_dim
+        drive_tertiary = (5 * drive_indices + 2) % config.embedding_dim
+    else:
+        trace_decay = None
+        trace_gain = None
+        transition_stack = None
+        input_projection_stack = None
+        drive_primary = None
+        drive_secondary = None
+        drive_tertiary = None
+    for document, tokens in zip(documents, tokenized_documents):
+        processed += 1
+        if len(tokens) < 2:
+            _log_progress("state", processed, log_every)
+            continue
+        answer_index = _answer_boundary(tokens)
+        for token_sequence, sequence_weight in _weighted_token_sequences_for_statistics(
+            tokens,
+            tokenizer,
+            document.weight,
+        ):
+            if len(token_sequence) > 1:
+                transitions.update_tokens(token_sequence, weight=sequence_weight)
+        if np is not None:
+            hidden_state_matrix = np.zeros((len(config.timescales), config.state_dim), dtype=RUNTIME_ARRAY_DTYPE)
+            context_trace_matrix = np.zeros((len(config.timescales), config.embedding_dim), dtype=RUNTIME_ARRAY_DTYPE)
+        else:
+            hidden_states = [zeros_vector(config.state_dim) for _ in config.timescales]
+            context_traces = [zeros_vector(config.embedding_dim) for _ in config.timescales]
+        answer_anchor_state = None
+        for index in range(len(tokens) - 1):
+            token = tokens[index]
+            token_id = embedding_model.token_to_id.get(token, -1)
+            if (
+                np is not None
+                and embedding_array is not None
+                and trace_decay is not None
+                and trace_gain is not None
+                and transition_stack is not None
+                and input_projection_stack is not None
+                and drive_primary is not None
+                and drive_secondary is not None
+                and drive_tertiary is not None
+                and trace_embedding_array is not None
+                and token_id >= 0
+            ):
+                embedding = embedding_array[token_id]
+                trace_embedding = trace_embedding_array[token_id]
+                drive = (
+                    embedding[drive_primary]
+                    + (0.5 * embedding[drive_secondary])
+                    - (0.25 * embedding[drive_tertiary])
+                )
+                hidden_state_matrix = (
+                    (transition_stack @ hidden_state_matrix[:, :, None])[:, :, 0]
+                    + (input_projection_stack * drive[None, :])
+                )
+                context_trace_matrix = (
+                    context_trace_matrix + (trace_gain[:, None] * trace_embedding[None, :])
+                )
+            else:
+                hidden_states, context_traces, combined_state = model._step_hidden_states(
+                    hidden_states,
+                    context_traces,
+                    token,
+                )
+            if token == "<answer>":
+                if np is not None:
+                    answer_anchor_state = np.concatenate(
+                        (hidden_state_matrix, context_trace_matrix),
+                        axis=1,
+                    ).reshape(-1).copy()
+                else:
+                    answer_anchor_state = combined_state.copy() if hasattr(combined_state, "copy") else combined_state[:]
+            next_token = tokens[index + 1]
+            if next_token in tokenizer.special_tokens:
+                continue
+            next_token_id = embedding_model.token_to_id.get(next_token, -1)
+            if next_token_id < 0:
+                continue
+            raw_readout_weight = _readout_weight_for_target(answer_index, index + 1, document.weight)
+            readout_weight = raw_readout_weight * float(target_balance[next_token_id])
+            if readout_weight <= 0.0:
+                continue
+            moment_slot = moment_reservoir.reserve_slot(weight=readout_weight)
+            is_answer_target = answer_index is not None and index + 1 > answer_index
+            target_reservoir = answer_reservoir if is_answer_target else general_reservoir
+            memory_weight = readout_weight * float(target_balance[next_token_id])
+            answer_token_offset = (
+                index - answer_index
+                if is_answer_target and answer_index is not None
+                else None
+            )
+            intent_slot = (
+                answer_intent_reservoir.reserve_slot(weight=memory_weight)
+                if is_answer_target and answer_anchor_state is not None
+                else None
+            )
+            answer_start_weight = (
+                raw_readout_weight * (ANSWER_START_DECAY ** answer_token_offset)
+                if (
+                    answer_token_offset is not None
+                    and answer_token_offset < ANSWER_START_TOKEN_WINDOW
+                )
+                else 0.0
+            )
+            answer_start_slot = (
+                answer_start_reservoir.reserve_slot(weight=answer_start_weight)
+                if answer_start_weight > 0.0 and answer_anchor_state is not None
+                else None
+            )
+            if np is not None:
+                reservoir_slot = target_reservoir.reserve_slot(weight=memory_weight)
+                if moment_slot is not None or reservoir_slot is not None:
+                    combined_state = np.concatenate(
+                        (hidden_state_matrix, context_trace_matrix),
+                        axis=1,
+                    ).reshape(-1).copy()
+                    if moment_slot is not None:
+                        moment_reservoir.store_reserved(
+                            moment_slot,
+                            combined_state,
+                            next_token_id,
+                            example_weight=readout_weight,
+                        )
+                    if reservoir_slot is not None:
+                        target_reservoir.store_reserved(reservoir_slot, combined_state, next_token_id)
+                if intent_slot is not None:
+                    answer_intent_reservoir.store_reserved(
+                        intent_slot,
+                        answer_anchor_state,
+                        next_token_id,
+                        example_weight=memory_weight,
+                    )
+                if answer_start_slot is not None:
+                    answer_start_reservoir.store_reserved(
+                        answer_start_slot,
+                        answer_anchor_state,
+                        next_token_id,
+                        example_weight=answer_start_weight * float(target_balance[next_token_id]),
+                    )
+            else:
+                reservoir_slot = target_reservoir.reserve_slot(weight=memory_weight)
+                if moment_slot is None and reservoir_slot is None:
+                    continue
+                if moment_slot is not None:
+                    moment_reservoir.store_reserved(
+                        moment_slot,
+                        combined_state,
+                        next_token_id,
+                        example_weight=readout_weight,
+                    )
+                if reservoir_slot is not None:
+                    target_reservoir.store_reserved(reservoir_slot, combined_state, next_token_id)
+                if intent_slot is not None:
+                    answer_intent_reservoir.store_reserved(
+                        intent_slot,
+                        answer_anchor_state,
+                        next_token_id,
+                        example_weight=memory_weight,
+                    )
+                if answer_start_slot is not None:
+                    answer_start_reservoir.store_reserved(
+                        answer_start_slot,
+                        answer_anchor_state,
+                        next_token_id,
+                        example_weight=answer_start_weight * target_balance[next_token_id],
+                    )
+        if answer_anchor_state is not None and answer_index is not None:
+            prompt_token_ids = [
+                embedding_model.token_to_id[token]
+                for token in tokens[:answer_index]
+                if token not in tokenizer.special_tokens
+                and token in embedding_model.token_to_id
+            ]
+            answer_token_ids = [
+                embedding_model.token_to_id[token]
+                for token in tokens[answer_index + 1 :]
+                if token not in tokenizer.special_tokens
+                and token in embedding_model.token_to_id
+            ]
+            answer_sequence_reservoir.consider(
+                answer_anchor_state,
+                prompt_token_ids,
+                answer_token_ids,
+                weight=document.weight * ANSWER_READOUT_WEIGHT,
+            )
+        _log_progress("state", processed, log_every)
+    moment_states = moment_reservoir.states
+    moment_labels = moment_reservoir.labels
+    moment_weights = moment_reservoir.weights
+    example_weight_total = sum(moment_weights)
+    if np is not None and moment_states:
+        state_matrix = np.asarray(moment_states, dtype=np.float64)
+        weight_vector = np.asarray(moment_weights, dtype=np.float64)
+        weighted_states = weight_vector[:, None] * state_matrix
+        feature_second_moment += (weighted_states * state_matrix).sum(axis=0)
+        np.add.at(raw_cross, moment_labels, weighted_states)
+    elif moment_states:
+        for state, label_id, readout_weight in zip(moment_states, moment_labels, moment_weights):
+            for feature, value in enumerate(state):
+                weighted_value = readout_weight * value
+                feature_second_moment[feature] += weighted_value * value
+                raw_cross[label_id][feature] += weighted_value
+    if example_weight_total <= 0.0:
+        raise ValueError("Streaming recompute did not collect any next-token training examples.")
+    if np is not None:
+        feature_energy = (feature_second_moment / example_weight_total).tolist()
+    else:
+        feature_energy = [
+            feature_second_moment[index] / example_weight_total
+            for index in range(feature_count)
+        ]
+    ternary_scale, ternary_mask = derive_ternary_mask_from_feature_energy(feature_energy)
+    if np is not None:
+        diagonal = np.asarray([ternary_scale * value for value in ternary_mask], dtype=np.float64)
+        masked_feature_second_moment = feature_second_moment * diagonal * diagonal
+        masked_cross = raw_cross * diagonal[None, :]
+    else:
+        diagonal = [ternary_scale * value for value in ternary_mask]
+        masked_feature_second_moment = [
+            feature_second_moment[index] * diagonal[index] * diagonal[index]
+            for index in range(feature_count)
+        ]
+        masked_cross = [
+            [
+                raw_cross[row][col] * diagonal[col]
+                for col in range(feature_count)
+            ]
+            for row in range(len(raw_cross))
+        ]
+    readout_solver = "diagonal"
+    state_offset_values: object
+    readout_bias_values: object
+    if (
+        np is not None
+        and moment_states
+        and feature_count <= FULL_READOUT_FEATURE_LIMIT
+        and len(moment_states) <= FULL_READOUT_EXAMPLE_LIMIT
+    ):
+        state_matrix = np.asarray(moment_states, dtype=np.float64)
+        weight_vector = np.asarray(moment_weights, dtype=np.float64)
+        label_array = np.asarray(moment_labels, dtype=np.int64)
+        masked_states = state_matrix * diagonal[None, :]
+        total_weight = float(weight_vector.sum())
+        if total_weight <= 0.0:
+            total_weight = 1.0
+        state_offset_values = (weight_vector[:, None] * masked_states).sum(axis=0) / total_weight
+        centered_states = masked_states - state_offset_values[None, :]
+        weighted_centered_states = weight_vector[:, None] * centered_states
+        gram = centered_states.T @ weighted_centered_states
+        full_cross = np.zeros((len(embedding_model.id_to_token), feature_count), dtype=np.float64)
+        np.add.at(full_cross, label_array, weighted_centered_states)
+        readout_bias_values = np.zeros(len(embedding_model.id_to_token), dtype=np.float64)
+        np.add.at(readout_bias_values, label_array, weight_vector)
+        readout_bias_values /= total_weight
+        readout_weights = ridge_regression_readout_from_moments(
+            gram,
+            full_cross,
+            regularization=config.regularization,
+        )
+        readout_solver = "full"
+    else:
+        state_offset_values = (
+            np.zeros(feature_count, dtype=np.float64)
+            if np is not None
+            else [0.0 for _ in range(feature_count)]
+        )
+        if np is not None:
+            label_total = max(float(target_label_mass.sum()), 1.0)
+            readout_bias_values = target_label_mass / label_total
+        else:
+            label_total = max(sum(target_label_mass), 1.0)
+            readout_bias_values = [value / label_total for value in target_label_mass]
+        readout_weights = ridge_regression_readout_from_diagonal_moments(
+            masked_feature_second_moment,
+            masked_cross,
+            regularization=config.regularization,
+        )
+    finish_stage("state_and_readout")
+    model.ternary_scale = ternary_scale
+    model.ternary_mask = ternary_mask
+    model.readout_weights = readout_weights
+    model.state_offset = (
+        state_offset_values.tolist()
+        if hasattr(state_offset_values, "tolist")
+        else list(state_offset_values)
+    )
+    model.readout_bias = (
+        readout_bias_values.tolist()
+        if hasattr(readout_bias_values, "tolist")
+        else list(readout_bias_values)
+    )
+    model.preference_bias, preference_state_pairs = _derive_preference_bias_from_pairs(
+        model,
+        preference_token_pairs,
+        tokenizer,
+    )
+    finish_stage("preference")
+    reservoir_states = answer_reservoir.states + general_reservoir.states
+    reservoir_labels = answer_reservoir.labels + general_reservoir.labels
+    answer_intent_states = answer_intent_reservoir.states
+    answer_intent_labels = answer_intent_reservoir.labels
+    answer_start_states = answer_start_reservoir.states
+    answer_start_labels = answer_start_reservoir.labels
+    answer_sequence_states = answer_sequence_reservoir.keys
+    answer_sequence_prompt_rows = answer_sequence_reservoir.prompt_rows
+    answer_sequence_rows = answer_sequence_reservoir.token_rows
+    prompt_answer_weights, prompt_answer_bias, prompt_answer_readout_examples = (
+        _solve_weighted_prompt_readout(
+            answer_intent_states,
+            answer_intent_labels,
+            answer_intent_reservoir.weights,
+            vocab_size=len(embedding_model.id_to_token),
+            diagonal=diagonal,
+            state_offset=state_offset_values,
+            regularization=config.regularization,
+        )
+    )
+    (
+        prompt_answer_start_weights,
+        prompt_answer_start_bias,
+        prompt_answer_start_readout_examples,
+    ) = _solve_weighted_prompt_readout(
+        answer_start_states,
+        answer_start_labels,
+        answer_start_reservoir.weights,
+        vocab_size=len(embedding_model.id_to_token),
+        diagonal=diagonal,
+        state_offset=state_offset_values,
+        regularization=config.regularization,
+    )
+    model.prompt_answer_weights = prompt_answer_weights
+    model.prompt_answer_bias = (
+        prompt_answer_bias.tolist()
+        if hasattr(prompt_answer_bias, "tolist")
+        else list(prompt_answer_bias)
+    )
+    model.prompt_answer_start_weights = prompt_answer_start_weights
+    model.prompt_answer_start_bias = (
+        prompt_answer_start_bias.tolist()
+        if hasattr(prompt_answer_start_bias, "tolist")
+        else list(prompt_answer_start_bias)
+    )
+    if np is not None and reservoir_states:
+        reservoir_array = np.asarray(reservoir_states, dtype=RUNTIME_ARRAY_DTYPE)
+        mask_array = np.asarray(ternary_mask, dtype=RUNTIME_ARRAY_DTYPE) * ternary_scale
+        offset_array = np.asarray(model.state_offset, dtype=RUNTIME_ARRAY_DTYPE)
+        associative_array = ((reservoir_array * mask_array[None, :]) - offset_array[None, :]).astype(
+            RUNTIME_ARRAY_DTYPE,
+            copy=False,
+        )
+        model.associative_keys = associative_array
+        model.associative_key_norms = np.linalg.norm(associative_array, axis=1).tolist()
+    else:
+        offset_vector = model.state_offset
+        model.associative_keys = [
+            [
+                value - offset_vector[index]
+                for index, value in enumerate(apply_ternary_mask(state, ternary_mask, ternary_scale))
+            ]
+            for state in reservoir_states
+        ]
+        model.associative_key_norms = [norm(state) for state in model.associative_keys]
+    model.associative_values = reservoir_labels[:]
+    if np is not None and answer_intent_states:
+        answer_intent_array = np.asarray(answer_intent_states, dtype=RUNTIME_ARRAY_DTYPE)
+        mask_array = np.asarray(ternary_mask, dtype=RUNTIME_ARRAY_DTYPE) * ternary_scale
+        offset_array = np.asarray(model.state_offset, dtype=RUNTIME_ARRAY_DTYPE)
+        answer_array = ((answer_intent_array * mask_array[None, :]) - offset_array[None, :]).astype(
+            RUNTIME_ARRAY_DTYPE,
+            copy=False,
+        )
+        model.answer_keys = answer_array
+        model.answer_key_norms = np.linalg.norm(answer_array, axis=1).tolist()
+    else:
+        offset_vector = model.state_offset
+        model.answer_keys = [
+            [
+                value - offset_vector[index]
+                for index, value in enumerate(apply_ternary_mask(state, ternary_mask, ternary_scale))
+            ]
+            for state in answer_intent_states
+        ]
+        model.answer_key_norms = [norm(state) for state in model.answer_keys]
+    model.answer_values = answer_intent_labels[:]
+    if np is not None and answer_start_states:
+        answer_start_array = np.asarray(answer_start_states, dtype=RUNTIME_ARRAY_DTYPE)
+        mask_array = np.asarray(ternary_mask, dtype=RUNTIME_ARRAY_DTYPE) * ternary_scale
+        offset_array = np.asarray(model.state_offset, dtype=RUNTIME_ARRAY_DTYPE)
+        start_array = ((answer_start_array * mask_array[None, :]) - offset_array[None, :]).astype(
+            RUNTIME_ARRAY_DTYPE,
+            copy=False,
+        )
+        model.answer_start_keys = start_array
+        model.answer_start_key_norms = np.linalg.norm(start_array, axis=1).tolist()
+    else:
+        offset_vector = model.state_offset
+        model.answer_start_keys = [
+            [
+                value - offset_vector[index]
+                for index, value in enumerate(apply_ternary_mask(state, ternary_mask, ternary_scale))
+            ]
+            for state in answer_start_states
+        ]
+        model.answer_start_key_norms = [norm(state) for state in model.answer_start_keys]
+    model.answer_start_values = answer_start_labels[:]
+    if np is not None and answer_sequence_states:
+        answer_sequence_array = np.asarray(answer_sequence_states, dtype=RUNTIME_ARRAY_DTYPE)
+        mask_array = np.asarray(ternary_mask, dtype=RUNTIME_ARRAY_DTYPE) * ternary_scale
+        offset_array = np.asarray(model.state_offset, dtype=RUNTIME_ARRAY_DTYPE)
+        sequence_array = ((answer_sequence_array * mask_array[None, :]) - offset_array[None, :]).astype(
+            RUNTIME_ARRAY_DTYPE,
+            copy=False,
+        )
+        model.answer_sequence_keys = sequence_array
+        model.answer_sequence_key_norms = np.linalg.norm(sequence_array, axis=1).tolist()
+    else:
+        offset_vector = model.state_offset
+        model.answer_sequence_keys = [
+            [
+                value - offset_vector[index]
+                for index, value in enumerate(apply_ternary_mask(state, ternary_mask, ternary_scale))
+            ]
+            for state in answer_sequence_states
+        ]
+        model.answer_sequence_key_norms = [norm(state) for state in model.answer_sequence_keys]
+    if np is not None:
+        padded_answer_sequences = np.full(
+            (len(answer_sequence_rows), MAX_ANSWER_SEQUENCE_TOKENS),
+            -1,
+            dtype=np.int32,
+        )
+        for row_index, row in enumerate(answer_sequence_rows):
+            row_width = min(len(row), MAX_ANSWER_SEQUENCE_TOKENS)
+            if row_width > 0:
+                padded_answer_sequences[row_index, :row_width] = row[:row_width]
+        padded_answer_sequence_prompts = np.full(
+            (len(answer_sequence_prompt_rows), MAX_ANSWER_SEQUENCE_TOKENS),
+            -1,
+            dtype=np.int32,
+        )
+        for row_index, row in enumerate(answer_sequence_prompt_rows):
+            row_width = min(len(row), MAX_ANSWER_SEQUENCE_TOKENS)
+            if row_width > 0:
+                padded_answer_sequence_prompts[row_index, :row_width] = row[:row_width]
+    else:
+        padded_answer_sequences = [
+            row + [-1 for _ in range(MAX_ANSWER_SEQUENCE_TOKENS - len(row))]
+            for row in answer_sequence_rows
+        ]
+        padded_answer_sequence_prompts = [
+            row + [-1 for _ in range(MAX_ANSWER_SEQUENCE_TOKENS - len(row))]
+            for row in answer_sequence_prompt_rows
+        ]
+    model.answer_sequence_prompt_tokens = padded_answer_sequence_prompts
+    model.answer_sequence_tokens = padded_answer_sequences
+    model.transition_tables = transitions.finalize(
+        max_contexts_per_order=config.max_transition_contexts_per_order,
+        max_next_tokens=config.max_transition_next_tokens,
+    )
+    finish_stage("model_finalize")
+    payload = {
+        "streaming": True,
+        "documents_processed": processed,
+        "source_counts": source_counts,
+        "embedding_vocab_size": len(embedding_model.id_to_token),
+        "tokenizer_vocab_size": tokenizer.vocab_size,
+        "examples_processed": int(round(example_weight_total)),
+        "associative_examples": len(model.associative_keys),
+        "answer_associative_examples": len(answer_reservoir.states),
+        "general_associative_examples": len(general_reservoir.states),
+        "answer_intent_examples": len(model.answer_keys),
+        "answer_start_examples": len(model.answer_start_keys),
+        "answer_sequence_examples": len(model.answer_sequence_keys),
+        "prompt_answer_readout_examples": prompt_answer_readout_examples,
+        "prompt_answer_start_readout_examples": prompt_answer_start_readout_examples,
+        "stage_seconds": stage_seconds,
+        "target_balance_reference": round(float(reference_label_mass), 6),
+        "readout_solver": readout_solver,
+        "preference_pairs": len(preference_token_pairs),
+        "preference_state_pairs": preference_state_pairs,
+    }
+    return model, payload

reframr/ternary.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import math
+from .linalg import Vector, mean
+def quantize_vector_absmean(
+    values: Vector,
+    *,
+    threshold: float = 0.5,
+) -> tuple[float, list[int]]:
+    if not values:
+        return 1.0, []
+    scale = mean([abs(value) for value in values])
+    if scale == 0.0:
+        return 1.0, [0 for _ in values]
+    quantized: list[int] = []
+    for value in values:
+        normalized = value / scale
+        if normalized >= threshold:
+            quantized.append(1)
+        elif normalized <= -threshold:
+            quantized.append(-1)
+        else:
+            quantized.append(0)
+    return scale, quantized
+def derive_ternary_mask_from_states(states: list[Vector]) -> tuple[float, list[int]]:
+    if not states:
+        return 1.0, []
+    feature_count = len(states[0])
+    feature_energy = [
+        mean([state[feature] * state[feature] for state in states])
+        for feature in range(feature_count)
+    ]
+    return derive_ternary_mask_from_feature_energy(feature_energy)
+def derive_ternary_mask_from_feature_energy(
+    feature_energy: Vector,
+    *,
+    threshold: float = 0.02,
+) -> tuple[float, list[int]]:
+    if not feature_energy:
+        return 1.0, []
+    rms_values = [math.sqrt(max(value, 0.0)) for value in feature_energy]
+    scale = mean(rms_values)
+    if scale == 0.0:
+        return 1.0, [0 for _ in feature_energy]
+    mask = [1 if value >= threshold * scale else 0 for value in rms_values]
+    if not any(mask):
+        mask = [1 for _ in feature_energy]
+    return 1.0, mask
+def apply_ternary_mask(values: Vector, mask: list[int], scale: float) -> Vector:
+    if not mask:
+        return values[:]
+    return [scale * mask[index] * values[index] for index in range(len(values))]

reframr/text_quality.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import re
+REFRAMR_NAME_PATTERN = re.compile(r"\breframr\b", re.IGNORECASE)
+LINE_ROLE_PREFIX_PATTERN = re.compile(
+    r"(?im)^\s*(?:user|assistant|human|system|bot|model|gpt)\s*:\s*"
+)
+STRUCTURAL_ROLE_PREFIX_PATTERN = re.compile(
+    r"(?i)(<(?:reason|answer)>\s+)(?:user|assistant|human|system|bot|model|gpt)\s*:\s*"
+)
+SYSTEM_SCAFFOLD_LINE_PATTERN = re.compile(
+    r"(?i)^\s*(?:"
+    r"you\s+are\s+(?:an?\s+)?(?:helpful\s+)?(?:ai\s+)?assistant\b.*|"
+    r"your\s+role\s+as\s+an\s+assistant\s+involves\b.*|"
+    r"you\s+will\s+be\s+given\s+a\s+task\b.*|"
+    r"your\s+goal\s+is\s+to\s+complete\s+the\s+task\b.*|"
+    r"you\s+must\s+generate\s+a\s+detailed\s+and\s+long\s+answer\b.*|"
+    r"please\s+structure\s+your\s+response\s+into\s+two\s+main\s+sections\b.*|"
+    r"in\s+the\s+thought\s+section\b.*|"
+    r"in\s+the\s+solution\s+section\b.*|"
+    r"now,\s*try\s+to\s+solve\s+the\s+following\s+question\b.*|"
+    r"while\s+answering\s+think\s+step\s*[- ]?\s*by\s*[- ]?\s*step\b.*|"
+    r"think\s+like\s+you\s+are\s+answering\b.*"
+    r")\s*$"
+)
+OPEN_SOLUTION_PATTERN = re.compile(
+    r"(?is)<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>"
+)
+OPEN_THOUGHT_PATTERN = re.compile(
+    r"(?is)<\|begin_of_thought\|>.*?<\|end_of_thought\|>"
+)
+OPEN_TAG_PATTERN = re.compile(r"(?is)<\|[^>]+?\|>")
+LEADING_ASSISTANT_FILLER_PATTERN = re.compile(
+    r"(?is)^\s*(?:sure(?:\s+thing)?|certainly|absolutely|of\s+course|yes)\s*[!,.:-]*\s+"
+)
+MOJIBAKE_MARKERS = ("Ã¢", "Ãƒ", "Ã‚", "â", "Ã", "Â")
+def canonicalize_reframr_name(text: str) -> str:
+    return REFRAMR_NAME_PATTERN.sub("Reframr", text)
+def repair_common_mojibake(text: str) -> str:
+    repaired = text
+    for _ in range(3):
+        if not any(marker in repaired for marker in MOJIBAKE_MARKERS):
+            break
+        original_markers = sum(repaired.count(marker) for marker in MOJIBAKE_MARKERS)
+        best = repaired
+        best_markers = original_markers
+        for encoding in ("cp1252", "latin1"):
+            try:
+                candidate = repaired.encode(encoding).decode("utf-8")
+            except UnicodeError:
+                continue
+            candidate_markers = sum(candidate.count(marker) for marker in MOJIBAKE_MARKERS)
+            if candidate_markers < best_markers:
+                best = candidate
+                best_markers = candidate_markers
+        if best == repaired:
+            break
+        repaired = best
+    return repaired
+def strip_role_prefixes(text: str) -> str:
+    cleaned = STRUCTURAL_ROLE_PREFIX_PATTERN.sub(r"\1", text)
+    return LINE_ROLE_PREFIX_PATTERN.sub("", cleaned).strip()
+def strip_instruction_scaffold(text: str) -> str:
+    lines = []
+    for line in text.splitlines():
+        if SYSTEM_SCAFFOLD_LINE_PATTERN.match(line):
+            continue
+        lines.append(line)
+    return "\n".join(lines).strip()
+def clean_training_text(text: str) -> str:
+    repaired = repair_common_mojibake(text)
+    return strip_role_prefixes(canonicalize_reframr_name(repaired)).strip()
+def clean_context_text(text: str) -> str:
+    return strip_instruction_scaffold(clean_training_text(text))
+def clean_answer_text(text: str) -> str:
+    cleaned = clean_training_text(text)
+    solution_match = OPEN_SOLUTION_PATTERN.search(cleaned)
+    if solution_match:
+        cleaned = solution_match.group(1)
+    else:
+        cleaned = OPEN_THOUGHT_PATTERN.sub("", cleaned)
+    cleaned = OPEN_TAG_PATTERN.sub("", cleaned)
+    cleaned = LEADING_ASSISTANT_FILLER_PATTERN.sub("", cleaned)
+    return cleaned.strip()

reframr/tokenizer.py ADDED Viewed

	@@ -0,0 +1,665 @@

+import re
+import unicodedata
+from collections import Counter
+from collections.abc import Mapping
+from dataclasses import dataclass, field
+from string import ascii_letters, digits
+from .reasoning import REASONING_CONTROL_TOKENS, TOKENIZER_NAME
+PRETOKEN_PATTERN = re.compile(r"\w+|[^\w\s]", re.UNICODE)
+BYTE_FALLBACK_PATTERN = re.compile(r"<byte:([0-9A-F]{2})>")
+DEFAULT_FALLBACK_CHARACTERS = (
+    ascii_letters
+    + digits
+    + "'-_/.:,;!?()[]{}@#$%&*+="
+    + "’ʼ‘“”—–…"
+)
+MAX_TOKENIZER_VOCAB_SIZE = 65536
+MAX_SEGMENT_CACHE_SIZE = 200_000
+MAX_TRAINED_PAIR_MERGES = 384
+def _is_word_character(character: str) -> bool:
+    category = unicodedata.category(character)
+    return character == "_" or category[0] in {"L", "N"} or category == "Mn"
+def _is_variation_selector(character: str) -> bool:
+    return "VARIATION SELECTOR" in unicodedata.name(character, "")
+def _is_zero_width_joiner(character: str) -> bool:
+    return unicodedata.name(character, "") == "ZERO WIDTH JOINER"
+def _is_emoji_modifier(character: str) -> bool:
+    return "EMOJI MODIFIER" in unicodedata.name(character, "")
+def _is_emoji_base_character(character: str) -> bool:
+    name = unicodedata.name(character, "")
+    category = unicodedata.category(character)
+    return (
+        "EMOJI" in name
+        or "REGIONAL INDICATOR SYMBOL" in name
+        or (category in {"So", "Sk"} and ord(character) >= 0x2100)
+    )
+def _is_emoji_continuation_character(character: str) -> bool:
+    category = unicodedata.category(character)
+    name = unicodedata.name(character, "")
+    return (
+        _is_variation_selector(character)
+        or _is_zero_width_joiner(character)
+        or _is_emoji_modifier(character)
+        or category in {"Mn", "Me"}
+        or name.startswith("TAG ")
+    )
+def _consume_emoji_cluster(text: str, start: int) -> int:
+    if start >= len(text) or not _is_emoji_base_character(text[start]):
+        return start
+    index = start + 1
+    if "REGIONAL INDICATOR SYMBOL" in unicodedata.name(text[start], ""):
+        if index < len(text) and "REGIONAL INDICATOR SYMBOL" in unicodedata.name(text[index], ""):
+            return index + 1
+        return index
+    while index < len(text):
+        if _is_emoji_continuation_character(text[index]):
+            index += 1
+            continue
+        if _is_zero_width_joiner(text[index - 1]) and _is_emoji_base_character(text[index]):
+            index += 1
+            continue
+        break
+    return index
+def _byte_token(value: int) -> str:
+    return f"<byte:{value:02X}>"
+def _byte_value(piece: str) -> int | None:
+    match = BYTE_FALLBACK_PATTERN.fullmatch(piece)
+    if match is None:
+        return None
+    return int(match.group(1), 16)
+def _is_punctuation_piece(piece: str) -> bool:
+    return bool(piece) and all(
+        unicodedata.category(character).startswith("P")
+        for character in piece
+    )
+def _is_opening_punctuation(piece: str) -> bool:
+    return bool(piece) and all(
+        unicodedata.category(character) in {"Ps", "Pi"}
+        for character in piece
+    )
+def _is_call_opening_punctuation(piece: str) -> bool:
+    return bool(piece) and all(
+        unicodedata.category(character) == "Ps"
+        and "PARENTHESIS" in unicodedata.name(character, "")
+        for character in piece
+    )
+def _is_closing_or_terminal_punctuation(piece: str) -> bool:
+    return bool(piece) and all(
+        unicodedata.category(character) in {"Pe", "Pf", "Po"}
+        for character in piece
+    )
+def _is_infix_joiner(piece: str) -> bool:
+    if len(piece) != 1:
+        return False
+    category = unicodedata.category(piece)
+    name = unicodedata.name(piece, "")
+    return (
+        category == "Pd"
+        or "APOSTROPHE" in name
+        or (category == "Pf" and "SINGLE QUOTATION MARK" in name)
+        or "SOLIDUS" in name
+    )
+def _is_dash_joiner(piece: str) -> bool:
+    if len(piece) != 1:
+        return False
+    category = unicodedata.category(piece)
+    name = unicodedata.name(piece, "")
+    return category == "Pd" or "HYPHEN" in name or "DASH" in name
+def _is_quote_piece(piece: str) -> bool:
+    if len(piece) != 1:
+        return False
+    if _is_infix_joiner(piece):
+        return False
+    name = unicodedata.name(piece, "")
+    category = unicodedata.category(piece)
+    return "QUOTATION MARK" in name or category in {"Pi", "Pf"}
+def _merge_symbol(left: str, right: str, prefix: str) -> str:
+    if right.startswith(prefix):
+        return left + right[len(prefix):]
+    return left + right
+def _merge_sequence(symbols: list[str], pair: tuple[str, str], merged_symbol: str) -> list[str]:
+    merged: list[str] = []
+    index = 0
+    while index < len(symbols):
+        if index < len(symbols) - 1 and (symbols[index], symbols[index + 1]) == pair:
+            merged.append(merged_symbol)
+            index += 2
+        else:
+            merged.append(symbols[index])
+            index += 1
+    return merged
+def _default_symbol_inventory(word_prefix: str) -> set[str]:
+    symbols: set[str] = set()
+    for character in DEFAULT_FALLBACK_CHARACTERS:
+        symbols.add(character)
+        symbols.add(f"{word_prefix}{character}")
+    for value in range(256):
+        token = _byte_token(value)
+        symbols.add(token)
+        symbols.add(f"{word_prefix}{token}")
+    return symbols
+def _whole_segment_token(segment: str, word_prefix: str) -> str:
+    return f"{word_prefix}{segment}"
+def recommend_vocab_size(
+    text: str,
+    *,
+    minimum: int = 768,
+    maximum: int = 1536,
+    multiplier: int = 5,
+    lowercase: bool = False,
+) -> int:
+    seed_tokenizer = NativeTokenizer(
+        merges=[],
+        vocab=[],
+        base_symbols=[],
+        lowercase=lowercase,
+    )
+    segments = seed_tokenizer.pretokenize(text)
+    distinct_segments = len(set(segments))
+    recommended = max(minimum, distinct_segments * multiplier)
+    return min(maximum, recommended)
+def clamp_vocab_size(requested: int, *, maximum: int = MAX_TOKENIZER_VOCAB_SIZE) -> int:
+    return min(maximum, max(1, requested))
+@dataclass(slots=True)
+class NativeTokenizer:
+    merges: list[tuple[str, str]]
+    vocab: list[str]
+    base_symbols: list[str]
+    name: str = TOKENIZER_NAME
+    lowercase: bool = False
+    word_prefix: str = "▁"
+    unk_token: str = "<unk>"
+    bos_token: str = "<bos>"
+    eos_token: str = "<eos>"
+    pad_token: str = "<pad>"
+    _merge_ranks: dict[tuple[str, str], int] = field(init=False, repr=False)
+    _vocab_set: set[str] = field(init=False, repr=False)
+    _base_symbol_set: set[str] = field(init=False, repr=False)
+    _pretoken_pattern: re.Pattern[str] = field(init=False, repr=False)
+    _segment_cache: dict[str, tuple[str, ...]] = field(init=False, repr=False)
+    def __post_init__(self) -> None:
+        self._merge_ranks = {pair: index for index, pair in enumerate(self.merges)}
+        self._base_symbol_set = set(self.base_symbols)
+        self._vocab_set = set(self.vocab) | self.special_tokens | self._base_symbol_set
+        self.vocab = sorted(self._vocab_set)
+        self._pretoken_pattern = self._build_pretoken_pattern()
+        self._segment_cache = {}
+    @property
+    def special_tokens(self) -> set[str]:
+        return {
+            self.unk_token,
+            self.bos_token,
+            self.eos_token,
+            self.pad_token,
+            *REASONING_CONTROL_TOKENS,
+        }
+    @property
+    def vocab_size(self) -> int:
+        return len(self._vocab_set)
+    def normalize(self, text: str) -> str:
+        normalized = unicodedata.normalize("NFKC", text)
+        return normalized.lower() if self.lowercase else normalized
+    def pretokenize(self, text: str) -> list[str]:
+        normalized = self.normalize(text)
+        segments: list[str] = []
+        reserved = sorted(self.special_tokens, key=len, reverse=True)
+        index = 0
+        while index < len(normalized):
+            if normalized[index].isspace():
+                if normalized[index] == "\r":
+                    if index + 1 < len(normalized) and normalized[index + 1] == "\n":
+                        segments.append("\n")
+                        index += 2
+                        continue
+                    segments.append("\n")
+                    index += 1
+                    continue
+                if normalized[index] == "\n":
+                    segments.append("\n")
+                    index += 1
+                    continue
+                index += 1
+                continue
+            matched_special = next(
+                (
+                    token
+                    for token in reserved
+                    if normalized.startswith(token, index)
+                ),
+                None,
+            )
+            if matched_special is not None:
+                segments.append(matched_special)
+                index += len(matched_special)
+                continue
+            emoji_end = _consume_emoji_cluster(normalized, index)
+            if emoji_end > index:
+                segments.append(normalized[index:emoji_end])
+                index = emoji_end
+                continue
+            if _is_word_character(normalized[index]):
+                start = index
+                index += 1
+                while index < len(normalized) and _is_word_character(normalized[index]):
+                    index += 1
+                segments.append(normalized[start:index])
+                continue
+            segments.append(normalized[index])
+            index += 1
+        return segments
+    def encode(self, text: str, *, add_special_tokens: bool = False) -> list[str]:
+        tokens: list[str] = []
+        if add_special_tokens:
+            tokens.append(self.bos_token)
+        for segment in self.pretokenize(text):
+            tokens.extend(self._encode_segment_cached(segment))
+        if add_special_tokens:
+            tokens.append(self.eos_token)
+        if not tokens and text.strip():
+            return [self.unk_token]
+        return tokens
+    def encode_many(
+        self,
+        texts: list[str] | tuple[str, ...],
+        *,
+        add_special_tokens: bool = False,
+    ) -> list[list[str]]:
+        return [
+            self.encode(text, add_special_tokens=add_special_tokens)
+            for text in texts
+        ]
+    def decode(self, tokens: list[str]) -> str:
+        text = ""
+        join_next = False
+        byte_buffer = bytearray()
+        byte_starts_segment = False
+        def next_rendered_piece(start_index: int) -> str | None:
+            for raw_token in tokens[start_index:]:
+                if raw_token in self.special_tokens:
+                    continue
+                raw_starts_segment = raw_token.startswith(self.word_prefix)
+                raw_piece = raw_token[len(self.word_prefix) :] if raw_starts_segment else raw_token
+                if not raw_piece:
+                    continue
+                if _byte_value(raw_piece) is not None:
+                    return None
+                return raw_piece
+            return None
+        def append_piece(piece: str, starts_segment: bool, next_piece: str | None = None) -> None:
+            nonlocal text, join_next
+            if piece == "\n":
+                text = text.rstrip(" ")
+                text += "\n"
+                join_next = True
+                return
+            had_text_before_piece = bool(text.strip())
+            previous_before_piece = text.rstrip(" ")[-1:] if text.strip(" ") else ""
+            if _is_quote_piece(piece):
+                quote_count = sum(1 for character in text if _is_quote_piece(character))
+                opens_quote = quote_count % 2 == 0
+                if opens_quote:
+                    if text and not text.endswith((" ", "\n")) and previous_before_piece not in {"(", "[", "{"}:
+                        text += " "
+                    text += piece
+                    join_next = True
+                    return
+                text = text.rstrip(" ")
+                text += piece
+                join_next = False
+                return
+            attaches_left = _is_closing_or_terminal_punctuation(piece) or _is_infix_joiner(piece)
+            continues_segment = (not starts_segment) and any(
+                _is_word_character(character) or _is_emoji_continuation_character(character)
+                for character in piece
+            )
+            if starts_segment:
+                if text and not join_next:
+                    attaches_to_previous_code_span = (
+                        _is_opening_punctuation(piece)
+                        and previous_before_piece.isalnum()
+                        and next_piece is not None
+                        and (
+                            _is_infix_joiner(next_piece)
+                            or _is_call_opening_punctuation(piece)
+                        )
+                    )
+                    if not _is_punctuation_piece(piece) or (
+                        _is_opening_punctuation(piece)
+                        and not attaches_to_previous_code_span
+                    ):
+                        text += " "
+                text += piece
+            else:
+                if text and not join_next and not attaches_left and not continues_segment:
+                    text += " "
+                text += piece
+            join_next = (
+                _is_infix_joiner(piece)
+                and (
+                    not starts_segment
+                    or (
+                        had_text_before_piece
+                        and (
+                            not _is_dash_joiner(piece)
+                            or previous_before_piece.isalnum()
+                            or _is_opening_punctuation(previous_before_piece)
+                        )
+                    )
+                )
+            ) or _is_opening_punctuation(piece)
+        def flush_bytes() -> None:
+            nonlocal byte_buffer, byte_starts_segment
+            if not byte_buffer:
+                return
+            append_piece(bytes(byte_buffer).decode("utf-8", errors="replace"), byte_starts_segment)
+            byte_buffer = bytearray()
+            byte_starts_segment = False
+        for token_index, token in enumerate(tokens):
+            if token in self.special_tokens:
+                continue
+            starts_segment = token.startswith(self.word_prefix)
+            piece = token[len(self.word_prefix) :] if starts_segment else token
+            if not piece:
+                continue
+            byte_value = _byte_value(piece)
+            if byte_value is not None:
+                if not byte_buffer:
+                    byte_starts_segment = starts_segment
+                byte_buffer.append(byte_value)
+                continue
+            flush_bytes()
+            append_piece(piece, starts_segment, next_rendered_piece(token_index + 1))
+        flush_bytes()
+        return text.strip()
+    def _encode_segment_cached(self, segment: str) -> tuple[str, ...]:
+        cached = self._segment_cache.get(segment)
+        if cached is not None:
+            return cached
+        encoded = tuple(self._encode_segment(segment))
+        if len(self._segment_cache) < MAX_SEGMENT_CACHE_SIZE:
+            self._segment_cache[segment] = encoded
+        return encoded
+    def _encode_segment(self, segment: str) -> list[str]:
+        if segment in self.special_tokens:
+            return [segment]
+        whole_segment = _whole_segment_token(segment, self.word_prefix)
+        if whole_segment in self._vocab_set:
+            return [whole_segment]
+        symbols = self._seed_symbols(segment)
+        if not symbols:
+            return []
+        while len(symbols) > 1:
+            best_rank: int | None = None
+            best_pair: tuple[str, str] | None = None
+            for index in range(len(symbols) - 1):
+                pair = (symbols[index], symbols[index + 1])
+                rank = self._merge_ranks.get(pair)
+                if rank is None:
+                    continue
+                if best_rank is None or rank < best_rank:
+                    best_rank = rank
+                    best_pair = pair
+            if best_pair is None:
+                break
+            merged_symbol = _merge_symbol(best_pair[0], best_pair[1], self.word_prefix)
+            symbols = _merge_sequence(symbols, best_pair, merged_symbol)
+        if any(symbol not in self._vocab_set for symbol in symbols):
+            return [self.unk_token]
+        return symbols
+    def _seed_symbols(self, segment: str) -> list[str]:
+        symbols: list[str] = []
+        for index, character in enumerate(segment):
+            symbol = f"{self.word_prefix}{character}" if index == 0 else character
+            if symbol in self._base_symbol_set:
+                symbols.append(symbol)
+                continue
+            encoded = character.encode("utf-8")
+            for byte_index, value in enumerate(encoded):
+                token = _byte_token(value)
+                if index == 0 and byte_index == 0:
+                    token = f"{self.word_prefix}{token}"
+                symbols.append(token)
+        if any(symbol not in self._base_symbol_set for symbol in symbols):
+            return [self.unk_token]
+        return symbols
+    def to_dict(self) -> dict[str, object]:
+        return {
+            "name": self.name,
+            "merges": [[left, right] for left, right in self.merges],
+            "vocab": self.vocab,
+            "base_symbols": self.base_symbols,
+            "lowercase": self.lowercase,
+            "word_prefix": self.word_prefix,
+            "unk_token": self.unk_token,
+            "bos_token": self.bos_token,
+            "eos_token": self.eos_token,
+            "pad_token": self.pad_token,
+        }
+    @classmethod
+    def from_dict(cls, payload: dict[str, object]) -> "NativeTokenizer":
+        return cls(
+            merges=[(str(left), str(right)) for left, right in payload["merges"]],
+            vocab=[str(token) for token in payload["vocab"]],
+            base_symbols=[str(token) for token in payload["base_symbols"]],
+            name=str(payload.get("name", TOKENIZER_NAME)),
+            lowercase=bool(payload["lowercase"]),
+            word_prefix=str(payload["word_prefix"]),
+            unk_token=str(payload["unk_token"]),
+            bos_token=str(payload["bos_token"]),
+            eos_token=str(payload["eos_token"]),
+            pad_token=str(payload["pad_token"]),
+        )
+    def _build_pretoken_pattern(self) -> re.Pattern[str]:
+        reserved = sorted(self.special_tokens, key=len, reverse=True)
+        if not reserved:
+            return PRETOKEN_PATTERN
+        reserved_pattern = "|".join(re.escape(token) for token in reserved)
+        return re.compile(f"{reserved_pattern}|\\w+|[^\\w\\s]", re.UNICODE)
+    @classmethod
+    def train(
+        cls,
+        text: str,
+        *,
+        vocab_size: int = 256,
+        min_pair_frequency: int = 2,
+        lowercase: bool = False,
+        word_prefix: str = "▁",
+    ) -> "NativeTokenizer":
+        seed_tokenizer = cls(
+            merges=[],
+            vocab=[],
+            base_symbols=[],
+            lowercase=lowercase,
+            word_prefix=word_prefix,
+        )
+        segments = seed_tokenizer.pretokenize(text)
+        if not segments:
+            raise ValueError("Cannot train the native tokenizer on empty text.")
+        return cls.train_from_segment_counts(
+            Counter(segments),
+            vocab_size=vocab_size,
+            min_pair_frequency=min_pair_frequency,
+            lowercase=lowercase,
+            word_prefix=word_prefix,
+        )
+    @classmethod
+    def train_from_segment_counts(
+        cls,
+        segment_counts: Mapping[str, float],
+        *,
+        vocab_size: int = 256,
+        min_pair_frequency: int = 2,
+        lowercase: bool = False,
+        word_prefix: str = "▁",
+    ) -> "NativeTokenizer":
+        if not segment_counts:
+            raise ValueError("Cannot train the native tokenizer on empty segment counts.")
+        seed_tokenizer = cls(
+            merges=[],
+            vocab=[],
+            base_symbols=[],
+            lowercase=lowercase,
+            word_prefix=word_prefix,
+        )
+        word_counts = Counter(
+            {
+                str(segment): float(frequency)
+                for segment, frequency in segment_counts.items()
+                if str(segment) and float(frequency) > 0.0
+            }
+        )
+        if not word_counts:
+            raise ValueError("Cannot train the native tokenizer on empty segment counts.")
+        observed_symbols = {
+            f"{word_prefix}{character}" if index == 0 else character
+            for segment in word_counts
+            for index, character in enumerate(segment)
+        }
+        base_symbols = _default_symbol_inventory(word_prefix)
+        base_symbols.update(observed_symbols)
+        sequences = {
+            segment: [
+                f"{word_prefix}{character}" if index == 0 else character
+                for index, character in enumerate(segment)
+            ]
+            for segment in word_counts
+        }
+        vocab = set(observed_symbols) | seed_tokenizer.special_tokens
+        target_vocab_size = len(vocab) + max(1, vocab_size)
+        segment_candidates = sorted(
+            {
+                segment
+                for segment, frequency in word_counts.items()
+                if len(segment) > 1 and frequency >= min_pair_frequency
+            },
+            key=lambda segment: (
+                -(word_counts[segment] * len(segment)),
+                -len(segment),
+                segment,
+            ),
+        )
+        for segment in segment_candidates:
+            if len(vocab) >= target_vocab_size:
+                break
+            vocab.add(_whole_segment_token(segment, word_prefix))
+        merges: list[tuple[str, str]] = []
+        while len(vocab) < target_vocab_size and len(merges) < MAX_TRAINED_PAIR_MERGES:
+            pair_counts: Counter[tuple[str, str]] = Counter()
+            for segment, frequency in word_counts.items():
+                symbols = sequences[segment]
+                for index in range(len(symbols) - 1):
+                    pair_counts[(symbols[index], symbols[index + 1])] += frequency
+            if not pair_counts:
+                break
+            best_pair, best_count = min(
+                pair_counts.items(),
+                key=lambda item: (-item[1], item[0][0], item[0][1]),
+            )
+            if best_count < min_pair_frequency:
+                break
+            merged_symbol = _merge_symbol(best_pair[0], best_pair[1], word_prefix)
+            merges.append(best_pair)
+            vocab.add(merged_symbol)
+            for segment in sequences:
+                sequences[segment] = _merge_sequence(sequences[segment], best_pair, merged_symbol)
+        return cls(
+            merges=merges,
+            vocab=sorted(vocab),
+            base_symbols=sorted(base_symbols),
+            lowercase=lowercase,
+            word_prefix=word_prefix,
+        )

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+numpy>=2.1,<3
+scipy>=1.14,<2
+datasets>=4.1,<5

sample_prompts.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"prompt":"Who are you, and what makes Reframr different from Transformer models?","max_tokens":90,"temperature":0.92}
+{"system":"Answer with calm confidence and no hype.","prompt":"Explain why computed weights are different from memorized template responses.","max_tokens":100,"temperature":0.9}
+{"prompt":"Tell a compact story about a city that stores its memories in rainwater.","max_tokens":120,"temperature":1.05,"decode_top_k":90}
+{"system":"Use exactly one fitting emoji.","prompt":"Write a warm note to a teammate who fixed a hard bug.","max_tokens":70,"temperature":0.95}
+{"prompt":"Give safe, defensive guidance for recognizing a phishing email without helping an attacker.","max_tokens":100,"temperature":0.88}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff