Spaces:

TIGER-Lab
/

ClawBench

Running

App Files Files Community

AgPerry commited on about 11 hours ago

Commit

a036d16

verified ·

1 Parent(s): 50a75ee

V2: switch to raw Intercepted DESC sort + Hermes-only filter (visible-column order)

Browse files

Files changed (1) hide show

app.py +5 -13

app.py CHANGED Viewed

@@ -33,7 +33,7 @@ INTRO = """# 🏆 ClawBench — Web Agent Benchmark
 [**📖 Paper**](https://arxiv.org/abs/2604.08523) · [**💻 GitHub**](https://github.com/reacher-z/ClawBench) · [**🗂 Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) · [**🎞 Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) · [**🎞 Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) · [**🌐 Site**](https://claw-bench.com)
 """
-TABLE_INTRO = """**Intercepted** (sort key) = agent's final HTTP request matched the task's URL/method schema — Stage 1, deterministic, no judge. **Reward** = additionally requires the LLM judge (default `deepseek/deepseek-v4-pro`) to confirm the payload fulfilled the instruction — Stage 2. Rows are ranked by Intercepted (corpus-normalized: `intercepted / 130` for V2 so partials don't outrank complete batches) with Reward as tiebreak. `—` = no Stage-2 data yet."""
 ABOUT = """## About ClawBench
@@ -109,23 +109,15 @@ def load_results() -> pd.DataFrame:
     df = pd.read_csv(io.BytesIO(raw))
     if "reward_rate" not in df.columns:
         df["reward_rate"] = pd.NA
-    # Rank by corpus interception rate (intercepted_count / full_corpus_size) as
-    # the headline metric — Stage 1 is deterministic (URL/method match) and
-    # universally comparable. Tiebreak by corpus reward (passed / corpus_size)
-    # so partial batches don't outrank complete ones with lower rates.
-    df["_corpus_size"] = df["dataset"].map(CORPUS_SIZE).fillna(df["total"])
-    # `pass_rate` in our CSV is the Stage-1 intercept rate (%) over attempted.
-    # Convert it to a fraction over the full corpus.
-    df["_intercepted_count"] = (df["pass_rate"].astype(float) / 100.0 * df["total"]).round().astype(int)
-    df["_corpus_intercepted"] = df["_intercepted_count"] / df["_corpus_size"]
-    df["_corpus_reward"] = df["passed"] / df["_corpus_size"]
     df = df.sort_values(
-        ["dataset", "_corpus_intercepted", "_corpus_reward"],
         ascending=[True, False, False],
         na_position="last",
     ).reset_index(drop=True)
     df.insert(0, "rank", df.groupby("dataset").cumcount() + 1)
-    df = df.drop(columns=["_corpus_size", "_corpus_reward", "_intercepted_count", "_corpus_intercepted"])
     df["pass_rate"] = df["pass_rate"].map(_format_pct)
     df["reward_rate"] = df["reward_rate"].map(_format_pct)
     df["wall_hours"] = df["wall_hours"].map(_format_wall)

 [**📖 Paper**](https://arxiv.org/abs/2604.08523) · [**💻 GitHub**](https://github.com/reacher-z/ClawBench) · [**🗂 Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) · [**🎞 Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) · [**🎞 Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) · [**🌐 Site**](https://claw-bench.com)
 """
+TABLE_INTRO = """**Intercepted** (sort key) = agent's final HTTP request matched the task's URL/method schema — Stage 1, deterministic, no judge. **Reward** = additionally requires the LLM judge (default `deepseek/deepseek-v4-pro`) to confirm the payload fulfilled the instruction — Stage 2. Rows are ranked by Intercepted DESC, then Reward DESC as tiebreak. V2 is **Hermes-only**; alternative harnesses are evaluated separately. *Partial* = batch attempted fewer than the full corpus (mid-run abort / queue cap); rates are over attempted, not over corpus."""
 ABOUT = """## About ClawBench
     df = pd.read_csv(io.BytesIO(raw))
     if "reward_rate" not in df.columns:
         df["reward_rate"] = pd.NA
+    # Rank by raw Intercepted (Stage 1 rate over attempted tasks) descending, then
+    # Reward as tiebreak. Visible-column order: what you see in the Intercepted
+    # column is what sorts. Partial batches keep their attempted-rate.
     df = df.sort_values(
+        ["dataset", "pass_rate", "reward_rate"],
         ascending=[True, False, False],
         na_position="last",
     ).reset_index(drop=True)
     df.insert(0, "rank", df.groupby("dataset").cumcount() + 1)
     df["pass_rate"] = df["pass_rate"].map(_format_pct)
     df["reward_rate"] = df["reward_rate"].map(_format_pct)
     df["wall_hours"] = df["wall_hours"].map(_format_wall)