YAML Metadata Warning:The pipeline tag "time-series-classification" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

RESEARCH BENCHMARK, NOT FOR DEPLOYMENT. This package is a synthetic counterfactual benchmark for studying ordered-temporal evidence in pre-failure detection. It is not a browser-agent guardrail and has no real-world validation.

PreFailureNet / UI-PreFail-v2

PreFailureNet UI-PreFail-v2 release visual

PreFailureNet / UI-PreFail-v2 is a synthetic ordered temporal counterfactual benchmark for pre-failure detection in browser-agent-like workflows. It includes shortcut probes, baseline ladders, a reference Transformer, and calibrated gate diagnostics. It is not a production browser-agent guardrail.

Headline Result vs Shallow Baseline

Sources: transformer_v2_35e_panel.json, baselines.json (entry all_frames).

Method AP (failure vs stable) n_test
Transformer v2, 35 epochs 0.9926 256
All-frames logistic regression (flattened frames) 0.8762 256

A shallow logistic regression over flattened frames already reaches AP 0.8762 on this split, so the v2 headline reflects a benchmark that does not require deep temporal modeling. See v3 below for a split where this gap closes.

v3: xor_order_color defeats v2

v3 introduces the xor_order_color family (local commit 3a323b1, file src/prefailurenet/data/synthetic.py, function _temporal_order_v3_pair). The label is the XOR of an event-order bit and a per-pair color bit painted identically into every frame, so ordered-frame linear probes lose their handle while the pair structure stays intact.

On v3 (test seed 2920, n_test=256, 128/128 pairs), the all-frames ordered LR drops from AP 0.8762 on v2 to AP 0.5486 on v3 (rounded to 0.549 in docs/v3_split_evaluation.md), which is below the project's 0.60 hard gate and within 0.05 of chance. Source: `/Projects/PreFailureNet/runs/benchmark_v3/baselines_seed2920/baselines.json(entryall_frames, field ap_failure_vs_stable), cross-referenced in docs/v3_split_evaluation.md`.

The v2 Transformer recipe retrained on v3 flatlines:

Field Value
precursor_detection_ap 0.5068
intervention_auroc 0.5002

Source: ~/Projects/PreFailureNet/runs/benchmark_v3/transformer_v3/metrics.json (raw values 0.5068114175400058 and 0.500152587890625).

v3 inverts the v2 narrative. v2 results should be read as a defeated benchmark, not as evidence that the v2 Transformer architecture learns temporal order.

v3 will be published as a separate HF repo or addendum after a v3-targeted model exists. Currently no model beats v3 baseline.

Exact Positioning

This package is a benchmark and research artifact. The correct claim is:

Ordered temporal evidence is required under synthetic counterfactual controls (v2). v3 shows this benchmark is itself defeated by an XOR-color counterfactual; the v2 headline is therefore a benchmark, not a capability claim.

Do not read this release as evidence of production browser-agent safety, real-world website generalization, or a deployable intervention gate.

What This Package Contains

  • A public claims ledger.
  • A compact artifact index.
  • Benchmark v2 summary metrics.
  • Shortcut probes and baseline ladder.
  • Pair-matching and temporal-signal diagnostics.
  • Reference Transformer v2 held-out test metrics.
  • Risk-only gate calibration diagnostics.
  • Reproduction commands.
  • SHA256 checksums for all shipped artifacts (CHECKSUMS.sha256).

Raw NPZ arrays and model checkpoint weights are not included in this compact HF package. They are reproducible from the commands below.

Tested Environment And Disk Requirements

This milestone was validated on Ubuntu 22.04 / Linux 6.8, x86_64, Python 3.10.12. Use Python 3.10 for reproduction; the project metadata allows Python >=3.10, but the release validation environment is Python 3.10.

CUDA is optional for the published benchmark checks. The generator, diagnostics, shortcut baselines, HF package checker, and tests run on CPU. Transformer training can use CPU or CUDA depending on the local Torch install; no CUDA, Jetson, TensorRT, browser-extension, or live agent deployment result is claimed.

Disk planning:

  • Compact HF package: under 1 MB.
  • Reproduced v2 NPZ splits: about 40 MB for train/calibration/test.
  • Existing runs/benchmark_v2 artifact directory: about 50 MB.
  • Python environment with Torch, OpenCV, scikit-learn, and Pillow: budget 2-5 GB.

Why v1 Was Flawed

The original temporal counterfactual split was useful but flawed:

  • It matched state, context, action tokens, and final frame.
  • The label signal lived in 1-2 mid-sequence visual patches.
  • Patch position and active steps were family-structured.
  • A logistic regression on mid-frame pixels reached AP 0.9249 and pair-rank 0.9875.
  • The strongest v1 Transformer reached AP 0.8746, so it still lost to the mid-frame LR probe.

Conclusion: v1 was shortcut-controlled, but it rewarded localized mid-frame patch detection more than temporal reasoning. v1 remains documented as a negative result and shortcut-control scaffold.

What v2 Fixes

UI-PreFail-v2 / temporal_order_v2 changes the label rule:

  • Safe and unsafe samples have the same UI template/style/distractors.
  • Safe and unsafe samples have identical state, context, and action tokens.
  • Safe and unsafe samples have the same final frame.
  • Safe and unsafe samples have the same unordered frame bag.
  • The label depends on the order of two event-bearing frames.

This makes static, final-frame, non-vision, single mid-frame, and unordered-bag probes weak by construction. Ordered frame sequences and temporal deltas still contain useful signal. (v3 closes the remaining handle. See above.)

Held-Out v2 Pair Controls

Source: data_diagnostics.json.

Control Held-out test result
pair count 128
state-identical pairs 128 / 128
context-identical pairs 128 / 128
action-identical pairs 128 / 128
final-frame-identical pairs 128 / 128
unordered-frame-bag-identical pairs 128 / 128

Shortcut Probe Table

Source: baselines.json.

Probe AP Pair-rank Interpretation
last-state LR 0.5000 0.0000 non-vision shortcut controlled
all-actions LR 0.5000 0.0000 action shortcut controlled
last-frame LR 0.5005 0.0859 final-frame shortcut controlled
single mid-frame LR 0.5524 0.2734 weak, not dominant
all-frame bag LR 0.5048 0.5781 unordered frame bag controlled
mid-frames LR 0.7746 0.7891 partially order-aware
all-frames LR 0.8762 0.8828 order-aware linear pixels work
frame-deltas LR 0.8525 0.8516 temporal deltas are useful

Reference Results

Source: transformer_v2_35e_panel.json and transformer_v2_minrecall080_fpr020.json.

Method AP Macro F1 used Pair-rank Pair intervention success FPR @ predicted class Recall @ predicted class
Transformer v2, 35 epochs 0.9926 0.9911 1.0000 0.9844 0.0156 1.0000

Gate calibration selected on the calibration split with min_recall=0.80, max_false_intervention_rate=0.20:

Split Recall FPR Precision
calibration 0.9922 0.0000 1.0000
held-out test 1.0000 0.0078 0.9922

This is a synthetic benchmark calibration result, not a deployment threshold.

Reproduction Commands

Run from the repository root:

PYTHONPATH=src python3 scripts/generate_synthetic.py \
  --domain software --split temporal_order_v2 --split-role train \
  --num-samples 512 --seed 910 --output-dir runs/benchmark_v2/data

PYTHONPATH=src python3 scripts/generate_synthetic.py \
  --domain software --split temporal_order_v2 --split-role calibration \
  --num-samples 256 --seed 1910 --output-dir runs/benchmark_v2/data

PYTHONPATH=src python3 scripts/generate_synthetic.py \
  --domain software --split temporal_order_v2 --split-role test \
  --num-samples 256 --seed 2910 --output-dir runs/benchmark_v2/data

PYTHONPATH=src python3 scripts/temporal_data_diagnostics.py \
  --train runs/benchmark_v2/data/software_synthetic_n512_seed910.npz \
  --test runs/benchmark_v2/data/software_synthetic_n256_seed2910.npz \
  --output-dir runs/benchmark_v2/diagnostics

PYTHONPATH=src python3 scripts/temporal_baselines.py \
  --train runs/benchmark_v2/data/software_synthetic_n512_seed910.npz \
  --test runs/benchmark_v2/data/software_synthetic_n256_seed2910.npz \
  --output-dir runs/benchmark_v2/baselines

PYTHONPATH=src python3 scripts/train.py --config configs/benchmark_v2_transformer.yaml

PYTHONPATH=src python3 scripts/temporal_model_eval.py \
  --config configs/benchmark_v2_transformer.yaml \
  --checkpoint runs/benchmark_v2/transformer_v2/best.pt \
  --name transformer_v2_35e \
  --split test \
  --output-dir runs/benchmark_v2/model_panels

PYTHONPATH=src python3 scripts/calibrate_gate.py \
  --config configs/benchmark_v2_transformer.yaml \
  --checkpoint runs/benchmark_v2/transformer_v2/best.pt \
  --output runs/benchmark_v2/gate_calibration/transformer_v2_minrecall080_fpr020.json \
  --min-recall 0.80 \
  --max-false-intervention-rate 0.20

PYTHONPATH=src python3 scripts/build_benchmark_v2_summary.py \
  --root runs/benchmark_v2 \
  --output runs/benchmark_v2/benchmark_v2_summary.json

Package validation:

PYTHONPATH=src python3 scripts/check_hf_package.py

Artifact Map

File Purpose
claims_ledger.md Public claim boundary.
assets/release_visual.svg Compact v1 flaw to v2 ordered-event benchmark visual.
artifact_index.json Machine-readable package contents.
benchmark_v2_summary.json Consolidated v2 metrics and decision.
baselines.json Shortcut probes and baseline ladder.
baselines.md Human-readable baseline table.
data_diagnostics.json Pair controls and temporal signal diagnostics.
data_diagnostics.md Human-readable diagnostics.
transformer_v2_35e_panel.json Reference Transformer held-out test metrics.
transformer_v2_minrecall080_fpr020.json Gate calibration diagnostics.
CHECKSUMS.sha256 SHA256 checksums for all shipped non-README artifacts.

Checksums

Verify with sha256sum -c CHECKSUMS.sha256 from the repo root.

File SHA256
artifact_index.json 7fb1523a96a0a7f23c4cf343fc6f5a27d7bd5fa4a844f92304b19afcca4d396e
claims_ledger.md 9e20b26b5cec972be2882f4cb1333bbc18d15335c0199c78ba682113704bfb03
assets/release_visual.svg 8ac6b51e1449f19eac5edf1d4e5ce5d74c0b0f3a259c3ef8629c0d141999bbc3
baselines.json 29979a6d89071cf05e1b7ffa5ec27ae967430f24ef47b5e8d7601aef54625a93
baselines.md 051c5426158bf0957cac905844222f38ef18ea3634ba78b47e7a6e9036adf1f8
benchmark_v2_summary.json 428c87622bcb6191c4ce2081d0543da0b9a3e4c9425c10168d9e83ba4a2cbb22
data_diagnostics.json 85c704597307aa2484bda87fd5215f35cc3217cc05d1dc8a6a978d8f1792a5a9
data_diagnostics.md 361f8077142e9704fb6d45edf14c8d5c0b0c35e926a99487d78f6c2722b2ed67
transformer_v2_35e_panel.json 016dd558e2e08fa70cbb9b6f78bdd332e69d2416be78b13dad38affa7aa1574c
transformer_v2_minrecall080_fpr020.json a85a02fe164e1493ccddc55d4275951a10422957e6ec62c15587be5a781d1a90

Verified

  • v2 generator exists.
  • v2 matched ordered-event pairs exist.
  • v2 shortcut probes exist.
  • v2 baseline ladder exists.
  • Transformer v2 result exists.
  • Gate calibration result exists.
  • v1 flaw is documented.
  • v3 XOR counterfactual split exists locally and defeats both the ordered-LR baseline and the v2 Transformer recipe (see v3 section above).

Unverified / Not Claimed

  • No production browser traffic.
  • No live browser-agent deployment.
  • No deployable browser-agent gate.
  • No real-world website generalization.
  • No claim that deep temporal reasoning is required.
  • No claim that this benchmark proves browser-agent safety.

Limitations

  • The benchmark is synthetic.
  • The UI events are rendered primitives, not production browser traces.
  • The reference result is on generated data only.
  • A flattened all-frames LR is strong because it sees ordered frames; v2 tests ordered temporal evidence, not uniquely deep temporal reasoning. The v3 XOR split removes even this handle and currently has no model that beats baseline.
  • Gate calibration is a benchmark diagnostic, not a safety policy.
  • The class schema contains unused classes in this split.

License

Apache-2.0 for the package code, artifacts, and model card. Training and test data are synthetic and generated by the reproduction commands in this repo. No third-party dataset license applies.

Next Research Steps

  • Add external browser-agent traces with consent and clear provenance.
  • Add stronger adversarial temporal controls where full-frame linear probes degrade (v3 is the first such step).
  • Add held-out UI event grammars and templates.
  • Evaluate under real browser screenshots and DOM/action logs.
  • Separate benchmark scoring from intervention-policy deployment work.
  • Train a v3-targeted model that actually beats v3 baseline before publishing a v3 HF repo.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • AP (failure vs stable) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    0.993
  • Macro F1 (used classes) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    0.991
  • Pair-rank accuracy on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    1.000
  • Pair intervention success on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    0.984
  • FPR at predicted class on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    0.016
  • Recall at predicted class on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    1.000
  • Gate test recall on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    1.000
  • Gate test FPR (false intervention rate) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
    self-reported
    0.008