YAML Metadata Warning:The pipeline tag "time-series-classification" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

RESEARCH BENCHMARK, NOT FOR DEPLOYMENT. This package is a synthetic counterfactual benchmark for studying ordered-temporal evidence in pre-failure detection. It is not a browser-agent guardrail and has no real-world validation.

PreFailureNet / UI-PreFail-v2

PreFailureNet / UI-PreFail-v2 is a synthetic ordered temporal counterfactual benchmark for pre-failure detection in browser-agent-like workflows. It includes shortcut probes, baseline ladders, a reference Transformer, and calibrated gate diagnostics. It is not a production browser-agent guardrail.

Headline Result vs Shallow Baseline

Sources: transformer_v2_35e_panel.json, baselines.json (entry all_frames).

Method	AP (failure vs stable)	n_test
Transformer v2, 35 epochs	`0.9926`	`256`
All-frames logistic regression (flattened frames)	`0.8762`	`256`

A shallow logistic regression over flattened frames already reaches AP 0.8762 on this split, so the v2 headline reflects a benchmark that does not require deep temporal modeling. See v3 below for a split where this gap closes.

v3: xor_order_color defeats v2

v3 introduces the xor_order_color family (local commit 3a323b1, file src/prefailurenet/data/synthetic.py, function _temporal_order_v3_pair). The label is the XOR of an event-order bit and a per-pair color bit painted identically into every frame, so ordered-frame linear probes lose their handle while the pair structure stays intact.

On v3 (test seed 2920, n_test=256, 128/128 pairs), the all-frames ordered LR drops from AP 0.8762 on v2 to AP 0.5486 on v3 (rounded to 0.549 in docs/v3_split_evaluation.md), which is below the project's 0.60 hard gate and within ~~0.05 of chance. Source: `~~/Projects/PreFailureNet/runs/benchmark_v3/baselines_seed2920/baselines.json(entryall_frames, field ap_failure_vs_stable), cross-referenced in docs/v3_split_evaluation.md`.

The v2 Transformer recipe retrained on v3 flatlines:

Field	Value
`precursor_detection_ap`	`0.5068`
`intervention_auroc`	`0.5002`

Source: ~/Projects/PreFailureNet/runs/benchmark_v3/transformer_v3/metrics.json (raw values 0.5068114175400058 and 0.500152587890625).

v3 inverts the v2 narrative. v2 results should be read as a defeated benchmark, not as evidence that the v2 Transformer architecture learns temporal order.

v3 will be published as a separate HF repo or addendum after a v3-targeted model exists. Currently no model beats v3 baseline.

Exact Positioning

This package is a benchmark and research artifact. The correct claim is:

Ordered temporal evidence is required under synthetic counterfactual controls (v2). v3 shows this benchmark is itself defeated by an XOR-color counterfactual; the v2 headline is therefore a benchmark, not a capability claim.

Do not read this release as evidence of production browser-agent safety, real-world website generalization, or a deployable intervention gate.

What This Package Contains

A public claims ledger.
A compact artifact index.
Benchmark v2 summary metrics.
Shortcut probes and baseline ladder.
Pair-matching and temporal-signal diagnostics.
Reference Transformer v2 held-out test metrics.
Risk-only gate calibration diagnostics.
Reproduction commands.
SHA256 checksums for all shipped artifacts (CHECKSUMS.sha256).

Raw NPZ arrays and model checkpoint weights are not included in this compact HF package. They are reproducible from the commands below.

Tested Environment And Disk Requirements

This milestone was validated on Ubuntu 22.04 / Linux 6.8, x86_64, Python 3.10.12. Use Python 3.10 for reproduction; the project metadata allows Python >=3.10, but the release validation environment is Python 3.10.

CUDA is optional for the published benchmark checks. The generator, diagnostics, shortcut baselines, HF package checker, and tests run on CPU. Transformer training can use CPU or CUDA depending on the local Torch install; no CUDA, Jetson, TensorRT, browser-extension, or live agent deployment result is claimed.

Disk planning:

Compact HF package: under 1 MB.
Reproduced v2 NPZ splits: about 40 MB for train/calibration/test.
Existing runs/benchmark_v2 artifact directory: about 50 MB.
Python environment with Torch, OpenCV, scikit-learn, and Pillow: budget 2-5 GB.

Why v1 Was Flawed

The original temporal counterfactual split was useful but flawed:

It matched state, context, action tokens, and final frame.
The label signal lived in 1-2 mid-sequence visual patches.
Patch position and active steps were family-structured.
A logistic regression on mid-frame pixels reached AP 0.9249 and pair-rank 0.9875.
The strongest v1 Transformer reached AP 0.8746, so it still lost to the mid-frame LR probe.

Conclusion: v1 was shortcut-controlled, but it rewarded localized mid-frame patch detection more than temporal reasoning. v1 remains documented as a negative result and shortcut-control scaffold.

What v2 Fixes

UI-PreFail-v2 / temporal_order_v2 changes the label rule:

Safe and unsafe samples have the same UI template/style/distractors.
Safe and unsafe samples have identical state, context, and action tokens.
Safe and unsafe samples have the same final frame.
Safe and unsafe samples have the same unordered frame bag.
The label depends on the order of two event-bearing frames.

This makes static, final-frame, non-vision, single mid-frame, and unordered-bag probes weak by construction. Ordered frame sequences and temporal deltas still contain useful signal. (v3 closes the remaining handle. See above.)

Held-Out v2 Pair Controls

Source: data_diagnostics.json.

Control	Held-out test result
pair count	`128`
state-identical pairs	`128 / 128`
context-identical pairs	`128 / 128`
action-identical pairs	`128 / 128`
final-frame-identical pairs	`128 / 128`
unordered-frame-bag-identical pairs	`128 / 128`

Shortcut Probe Table

Source: baselines.json.

Probe	AP	Pair-rank	Interpretation
last-state LR	`0.5000`	`0.0000`	non-vision shortcut controlled
all-actions LR	`0.5000`	`0.0000`	action shortcut controlled
last-frame LR	`0.5005`	`0.0859`	final-frame shortcut controlled
single mid-frame LR	`0.5524`	`0.2734`	weak, not dominant
all-frame bag LR	`0.5048`	`0.5781`	unordered frame bag controlled
mid-frames LR	`0.7746`	`0.7891`	partially order-aware
all-frames LR	`0.8762`	`0.8828`	order-aware linear pixels work
frame-deltas LR	`0.8525`	`0.8516`	temporal deltas are useful

Reference Results

Source: transformer_v2_35e_panel.json and transformer_v2_minrecall080_fpr020.json.

Method	AP	Macro F1 used	Pair-rank	Pair intervention success	FPR @ predicted class	Recall @ predicted class
Transformer v2, 35 epochs	`0.9926`	`0.9911`	`1.0000`	`0.9844`	`0.0156`	`1.0000`

Gate calibration selected on the calibration split with min_recall=0.80, max_false_intervention_rate=0.20:

Split	Recall	FPR	Precision
calibration	`0.9922`	`0.0000`	`1.0000`
held-out test	`1.0000`	`0.0078`	`0.9922`

This is a synthetic benchmark calibration result, not a deployment threshold.

Reproduction Commands

Run from the repository root:

PYTHONPATH=src python3 scripts/generate_synthetic.py \
  --domain software --split temporal_order_v2 --split-role train \
  --num-samples 512 --seed 910 --output-dir runs/benchmark_v2/data

PYTHONPATH=src python3 scripts/generate_synthetic.py \
  --domain software --split temporal_order_v2 --split-role calibration \
  --num-samples 256 --seed 1910 --output-dir runs/benchmark_v2/data

PYTHONPATH=src python3 scripts/generate_synthetic.py \
  --domain software --split temporal_order_v2 --split-role test \
  --num-samples 256 --seed 2910 --output-dir runs/benchmark_v2/data

PYTHONPATH=src python3 scripts/temporal_data_diagnostics.py \
  --train runs/benchmark_v2/data/software_synthetic_n512_seed910.npz \
  --test runs/benchmark_v2/data/software_synthetic_n256_seed2910.npz \
  --output-dir runs/benchmark_v2/diagnostics

PYTHONPATH=src python3 scripts/temporal_baselines.py \
  --train runs/benchmark_v2/data/software_synthetic_n512_seed910.npz \
  --test runs/benchmark_v2/data/software_synthetic_n256_seed2910.npz \
  --output-dir runs/benchmark_v2/baselines

PYTHONPATH=src python3 scripts/train.py --config configs/benchmark_v2_transformer.yaml

PYTHONPATH=src python3 scripts/temporal_model_eval.py \
  --config configs/benchmark_v2_transformer.yaml \
  --checkpoint runs/benchmark_v2/transformer_v2/best.pt \
  --name transformer_v2_35e \
  --split test \
  --output-dir runs/benchmark_v2/model_panels

PYTHONPATH=src python3 scripts/calibrate_gate.py \
  --config configs/benchmark_v2_transformer.yaml \
  --checkpoint runs/benchmark_v2/transformer_v2/best.pt \
  --output runs/benchmark_v2/gate_calibration/transformer_v2_minrecall080_fpr020.json \
  --min-recall 0.80 \
  --max-false-intervention-rate 0.20

PYTHONPATH=src python3 scripts/build_benchmark_v2_summary.py \
  --root runs/benchmark_v2 \
  --output runs/benchmark_v2/benchmark_v2_summary.json

Package validation:

PYTHONPATH=src python3 scripts/check_hf_package.py

Artifact Map

File	Purpose
`claims_ledger.md`	Public claim boundary.
`assets/release_visual.svg`	Compact v1 flaw to v2 ordered-event benchmark visual.
`artifact_index.json`	Machine-readable package contents.
`benchmark_v2_summary.json`	Consolidated v2 metrics and decision.
`baselines.json`	Shortcut probes and baseline ladder.
`baselines.md`	Human-readable baseline table.
`data_diagnostics.json`	Pair controls and temporal signal diagnostics.
`data_diagnostics.md`	Human-readable diagnostics.
`transformer_v2_35e_panel.json`	Reference Transformer held-out test metrics.
`transformer_v2_minrecall080_fpr020.json`	Gate calibration diagnostics.
`CHECKSUMS.sha256`	SHA256 checksums for all shipped non-README artifacts.

Checksums

Verify with sha256sum -c CHECKSUMS.sha256 from the repo root.

File	SHA256
`artifact_index.json`	`7fb1523a96a0a7f23c4cf343fc6f5a27d7bd5fa4a844f92304b19afcca4d396e`
`claims_ledger.md`	`9e20b26b5cec972be2882f4cb1333bbc18d15335c0199c78ba682113704bfb03`
`assets/release_visual.svg`	`8ac6b51e1449f19eac5edf1d4e5ce5d74c0b0f3a259c3ef8629c0d141999bbc3`
`baselines.json`	`29979a6d89071cf05e1b7ffa5ec27ae967430f24ef47b5e8d7601aef54625a93`
`baselines.md`	`051c5426158bf0957cac905844222f38ef18ea3634ba78b47e7a6e9036adf1f8`
`benchmark_v2_summary.json`	`428c87622bcb6191c4ce2081d0543da0b9a3e4c9425c10168d9e83ba4a2cbb22`
`data_diagnostics.json`	`85c704597307aa2484bda87fd5215f35cc3217cc05d1dc8a6a978d8f1792a5a9`
`data_diagnostics.md`	`361f8077142e9704fb6d45edf14c8d5c0b0c35e926a99487d78f6c2722b2ed67`
`transformer_v2_35e_panel.json`	`016dd558e2e08fa70cbb9b6f78bdd332e69d2416be78b13dad38affa7aa1574c`
`transformer_v2_minrecall080_fpr020.json`	`a85a02fe164e1493ccddc55d4275951a10422957e6ec62c15587be5a781d1a90`

Verified

v2 generator exists.
v2 matched ordered-event pairs exist.
v2 shortcut probes exist.
v2 baseline ladder exists.
Transformer v2 result exists.
Gate calibration result exists.
v1 flaw is documented.
v3 XOR counterfactual split exists locally and defeats both the ordered-LR baseline and the v2 Transformer recipe (see v3 section above).

Unverified / Not Claimed

No production browser traffic.
No live browser-agent deployment.
No deployable browser-agent gate.
No real-world website generalization.
No claim that deep temporal reasoning is required.
No claim that this benchmark proves browser-agent safety.

Limitations

The benchmark is synthetic.
The UI events are rendered primitives, not production browser traces.
The reference result is on generated data only.
A flattened all-frames LR is strong because it sees ordered frames; v2 tests ordered temporal evidence, not uniquely deep temporal reasoning. The v3 XOR split removes even this handle and currently has no model that beats baseline.
Gate calibration is a benchmark diagnostic, not a safety policy.
The class schema contains unused classes in this split.

License

Apache-2.0 for the package code, artifacts, and model card. Training and test data are synthetic and generated by the reproduction commands in this repo. No third-party dataset license applies.

Next Research Steps

Add external browser-agent traces with consent and clear provenance.
Add stronger adversarial temporal controls where full-frame linear probes degrade (v3 is the first such step).
Add held-out UI event grammars and templates.
Evaluate under real browser screenshots and DOM/action logs.
Separate benchmark scoring from intervention-policy deployment work.
Train a v3-targeted model that actually beats v3 baseline before publishing a v3 HF repo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

AP (failure vs stable) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

0.993
Macro F1 (used classes) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

0.991
Pair-rank accuracy on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

1.000
Pair intervention success on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

0.984
FPR at predicted class on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

0.016
Recall at predicted class on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

1.000
Gate test recall on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

1.000
Gate test FPR (false intervention rate) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)
self-reported

0.008