Spaces:

InternScience
/

ResearchClawBench-Task-Submit

Running

App Files Files Community

CoCoOne commited on 4 days ago

Commit

28a3671

verified ·

1 Parent(s): 3c9abae

chore: create final submit space

Browse files

Files changed (6) hide show

README.md +66 -6
__init__.py +1 -0
app.py +660 -0
repo_ops.py +114 -0
requirements.txt +2 -0
validator.py +518 -0

README.md CHANGED Viewed

@@ -1,12 +1,72 @@
 ---
-title: ResearchClawBench Task Submit
-emoji: 📚
-colorFrom: indigo
-colorTo: purple
 sdk: gradio
-sdk_version: 6.10.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ResearchClawBench Task Submission
+emoji: 📦
+colorFrom: blue
+colorTo: indigo
 sdk: gradio
+sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 ---
+# ResearchClawBench Hugging Face Submission Space
+This directory contains a deployable MVP for a Hugging Face Space that lets users submit a new ResearchClawBench task as a zip archive.
+## What it does
+- accepts a single `.zip` upload
+- requires exactly one top-level task directory inside the archive
+- validates the full ResearchClawBench task structure and JSON/path format
+- allocates the next available `Domain_NNN` task id from the Hugging Face dataset repo
+- creates a PR against `InternScience/ResearchClawBench` when validation passes
+## Files
+- `app.py`: Gradio Space UI
+- `validator.py`: archive extraction and task-format validation
+- `repo_ops.py`: Hugging Face repo scanning, task-id allocation, PR creation
+- `requirements.txt`: extra Python dependencies beyond the built-in Gradio SDK
+## Expected upload format
+The uploaded zip must contain exactly one task directory:
+```text
+Astronomy_submission.zip
+└── some_folder_name/
+    ├── task_info.json
+    ├── data/
+    ├── related_work/
+    └── target_study/
+        ├── checklist.json
+        ├── paper.pdf
+        └── images/
+```
+The top-level directory name inside the zip does not need to be the final task id. The Space validates the structure, then renames it to the next available `Domain_NNN` id when opening the PR.
+## Required environment variables / Space secrets
+- `RCB_SPACE_HF_TOKEN` or `HF_TOKEN`: Hugging Face write token for creating PRs to `InternScience/ResearchClawBench`
+Optional limits:
+- `RCB_SPACE_MAX_FILES`
+- `RCB_SPACE_MAX_TOTAL_BYTES`
+- `RCB_SPACE_MAX_SINGLE_FILE_BYTES`
+## Local run
+```bash
+cd ResearchClawBench-Self/huggingface/space_submitter
+/home/xwh/miniconda3/envs/agent/bin/python -m pip install gradio==5.49.1 -r requirements.txt
+/home/xwh/miniconda3/envs/agent/bin/python app.py
+```
+## Notes
+- validation does not modify the main `ResearchClawBench` repo
+- PR creation targets the Hugging Face dataset repo directly with `create_pr=True`
+- after a PR is created, maintainers still decide whether to merge it
+- on Hugging Face Spaces, the Gradio version comes from the README YAML `sdk_version`, not from `requirements.txt`

__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """ResearchClawBench Hugging Face Space submission tools."""

app.py ADDED Viewed

	@@ -0,0 +1,660 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+import gradio as gr
+try:
+    from .repo_ops import DEFAULT_REPO_ID, allocate_next_task_id, create_dataset_pr, list_existing_task_ids, load_hf_token
+    from .validator import (
+        DOMAINS,
+        PreparedSubmission,
+        SubmissionMetadata,
+        ValidationError,
+        build_public_report,
+        cleanup_work_dir,
+        normalize_domain_token,
+        validate_and_prepare_submission,
+    )
+except ImportError:
+    from repo_ops import DEFAULT_REPO_ID, allocate_next_task_id, create_dataset_pr, list_existing_task_ids, load_hf_token
+    from validator import (
+        DOMAINS,
+        PreparedSubmission,
+        SubmissionMetadata,
+        ValidationError,
+        build_public_report,
+        cleanup_work_dir,
+        normalize_domain_token,
+        validate_and_prepare_submission,
+    )
+SPACE_TITLE = 'ResearchClawBench Task Submission'
+GITHUB_REPO_URL = 'https://github.com/InternScience/ResearchClawBench'
+DATASET_URL = f'https://huggingface.co/datasets/{DEFAULT_REPO_ID}'
+SPACE_URL = 'https://huggingface.co/spaces/InternScience/ResearchClawBench-Task-Submit'
+CSS = """
+@import url('https://fonts.googleapis.com/css2?family=Manrope:wght@400;500;600;700;800&display=swap');
+:root {
+  --page-text: #0f172a;
+  --page-muted: #526075;
+  --page-line: rgba(15, 23, 42, 0.12);
+  --page-surface: rgba(255, 255, 255, 0.78);
+  --page-surface-strong: #ffffff;
+}
+body {
+  background:
+    radial-gradient(circle at top left, rgba(54, 107, 245, 0.12), transparent 34%),
+    radial-gradient(circle at top right, rgba(15, 118, 110, 0.08), transparent 28%),
+    linear-gradient(180deg, #f8fafc 0%, #f3f6fb 55%, #f6f8fb 100%);
+  color: var(--page-text);
+}
+body,
+button,
+input,
+textarea {
+  font-family: 'Manrope', 'Noto Sans SC', 'PingFang SC', 'Microsoft YaHei', sans-serif !important;
+}
+.gradio-container {
+  max-width: 1220px !important;
+  margin: 0 auto !important;
+  padding: 34px 28px 56px !important;
+  --block-background-fill: transparent;
+  --block-border-width: 0px;
+  --block-border-color: transparent;
+  --block-label-background-fill: transparent;
+  --block-label-border-width: 0px;
+  --panel-background-fill: transparent;
+  --panel-border-width: 0px;
+  --panel-border-color: transparent;
+  --background-fill-secondary: transparent;
+  --body-background-fill: transparent;
+}
+.page-shell {
+  margin-top: 26px;
+  padding: 30px 34px 34px;
+  background: #ffffff;
+  border: 1px solid rgba(15, 23, 42, 0.08);
+  border-radius: 22px;
+  box-shadow: 0 18px 48px rgba(15, 23, 42, 0.05);
+}
+.hero {
+  padding: 38px 42px 34px;
+  border-radius: 24px;
+  color: #f8fbff;
+  background:
+    radial-gradient(circle at 14% 18%, rgba(255, 255, 255, 0.16), transparent 18%),
+    linear-gradient(135deg, #0f274d 0%, #133c7c 46%, #124f75 100%);
+  box-shadow: 0 26px 60px rgba(15, 39, 77, 0.18);
+}
+.hero h1 {
+  margin: 0;
+  font-size: 2.4rem;
+  line-height: 1.02;
+  letter-spacing: -0.04em;
+  color: #f8fbff !important;
+  text-shadow: 0 1px 12px rgba(0, 0, 0, 0.14);
+}
+.hero-copy {
+  margin-top: 16px;
+  max-width: 860px;
+  font-size: 1.04rem;
+  line-height: 1.72;
+  color: rgba(248, 251, 255, 0.9) !important;
+}
+.hero-links {
+  display: flex;
+  gap: 14px;
+  flex-wrap: wrap;
+  margin-top: 22px;
+}
+.hero-links a {
+  color: #f8fbff !important;
+  text-decoration: none;
+  font-weight: 700;
+  letter-spacing: -0.01em;
+}
+.hero-links a:hover {
+  text-decoration: underline;
+}
+.hero-meta {
+  margin-top: 18px;
+  font-size: 0.93rem;
+  color: rgba(248, 251, 255, 0.72) !important;
+}
+.section-row {
+  margin-top: 30px;
+}
+.section-row,
+.section-row > div,
+.section-copy,
+.section-copy > div,
+.main-form,
+.side-notes {
+  background: transparent !important;
+  border: 0 !important;
+  box-shadow: none !important;
+}
+.section-copy h2 {
+  margin: 0 0 10px;
+  font-size: 1.2rem;
+  letter-spacing: -0.03em;
+}
+.section-copy h3 {
+  margin: 24px 0 8px;
+  font-size: 1rem;
+}
+.section-copy p,
+.section-copy li {
+  color: #5a667a;
+  line-height: 1.72;
+}
+.section-copy ul,
+.section-copy ol {
+  margin: 10px 0 0;
+  padding-left: 1.2rem;
+}
+.section-copy code {
+  font-size: 0.95em;
+}
+.section-copy .prose {
+  max-width: 100%;
+}
+.subtle-block {
+  padding-bottom: 22px;
+  border-bottom: 1px solid var(--page-line);
+}
+.section-copy .prose,
+.section-copy .prose *,
+.section-copy .md,
+.section-copy .md *,
+.section-copy .markdown,
+.section-copy .markdown * {
+  background: transparent !important;
+}
+.main-form {
+  padding-right: 14px;
+}
+.side-notes {
+  padding-left: 10px;
+}
+.caption {
+  margin-top: 4px;
+  color: var(--page-muted);
+  font-size: 0.93rem;
+  line-height: 1.6;
+}
+.field-label {
+  margin: 18px 0 8px;
+  color: var(--page-text);
+  font-size: 0.95rem;
+  font-weight: 700;
+  letter-spacing: -0.01em;
+}
+.results-shell {
+  margin-top: 26px;
+  padding-top: 22px;
+  border-top: 1px solid var(--page-line);
+}
+.action-row {
+  margin-top: 10px;
+}
+.upload-row {
+  margin-top: 18px;
+  margin-bottom: 10px;
+}
+.upload-button button {
+  border-radius: 12px !important;
+  min-height: 48px !important;
+  padding: 0 18px !important;
+  background: #ffffff !important;
+  color: var(--page-text) !important;
+  border: 1px solid rgba(19, 70, 162, 0.16) !important;
+  box-shadow: 0 8px 22px rgba(15, 23, 42, 0.04) !important;
+}
+.upload-status {
+  padding-top: 10px;
+}
+.upload-status p {
+  margin: 0 !important;
+  color: var(--page-muted) !important;
+}
+.primary-button button,
+.secondary-button button {
+  border-radius: 12px !important;
+  min-height: 48px !important;
+  font-weight: 700 !important;
+  letter-spacing: -0.01em;
+}
+.primary-button button {
+  background: linear-gradient(135deg, #1346a2 0%, #155eef 100%) !important;
+  box-shadow: 0 16px 32px rgba(21, 94, 239, 0.2) !important;
+}
+.secondary-button button {
+  background: var(--page-surface-strong) !important;
+  color: var(--page-text) !important;
+  border: 1px solid rgba(15, 23, 42, 0.12) !important;
+}
+.gradio-container .block,
+.gradio-container .gr-box,
+.gradio-container .gr-form,
+.gradio-container .gr-group,
+.gradio-container .form,
+.gradio-container .input-container,
+.gradio-container .wrap,
+.gradio-container .row,
+.gradio-container .column,
+.gradio-container fieldset {
+  background: transparent !important;
+  box-shadow: none !important;
+  border-color: transparent !important;
+}
+.gradio-container input:not([type="checkbox"]),
+.gradio-container textarea,
+.gradio-container button[aria-haspopup="listbox"],
+.gradio-container button[role="listbox"],
+.gradio-container .wrap:has(input:not([type="checkbox"])),
+.gradio-container .wrap:has(textarea),
+.gradio-container .wrap:has(button[aria-haspopup="listbox"]),
+.gradio-container .wrap:has(button[role="listbox"]),
+.gradio-container .wrap:has(select),
+.gradio-container .input-container:has(input:not([type="checkbox"])),
+.gradio-container .input-container:has(textarea),
+.gradio-container .input-container:has(button[aria-haspopup="listbox"]),
+.gradio-container .input-container:has(button[role="listbox"]),
+.gradio-container input:not([type="checkbox"]),
+.gradio-container textarea {
+  background: var(--page-surface-strong) !important;
+  border: 1px solid rgba(19, 70, 162, 0.16) !important;
+  border-radius: 10px !important;
+  box-shadow: 0 1px 0 rgba(15, 23, 42, 0.02), 0 8px 22px rgba(15, 23, 42, 0.04) !important;
+}
+.gradio-container .block,
+.gradio-container .wrap,
+.gradio-container .gr-box,
+.gradio-container .gr-form,
+.gradio-container .gr-panel,
+.gradio-container .gr-group,
+.gradio-container .form,
+.gradio-container .input-container,
+.gradio-container .wrap-inner {
+  overflow: visible !important;
+}
+.gradio-container label,
+.gradio-container .label-wrap,
+.gradio-container .caption-label {
+  color: var(--page-text) !important;
+}
+.link-list a {
+  color: #1346a2;
+  text-decoration: none;
+  font-weight: 600;
+}
+.link-list a:hover {
+  text-decoration: underline;
+}
+@media (max-width: 900px) {
+  .gradio-container {
+    padding: 22px 16px 42px !important;
+  }
+  .hero {
+    padding: 28px 24px 26px;
+    border-radius: 20px;
+  }
+  .hero h1 {
+    font-size: 2rem;
+  }
+  .main-form,
+  .side-notes {
+    padding-right: 0;
+    padding-left: 0;
+  }
+}
+"""
+def build_hero_html() -> str:
+    return f"""
+<section class="hero">
+  <h1>{SPACE_TITLE}</h1>
+  <p class="hero-copy">
+    Submit a new ResearchClawBench task as a single ZIP archive. This Space validates the full task
+    structure, checks JSON fields and referenced paths, allocates the next available task ID, and then
+    opens a PR against the official Hugging Face dataset for maintainer review.
+  </p>
+  <div class="hero-links">
+    <a href="{GITHUB_REPO_URL}" target="_blank">GitHub Repository</a>
+    <a href="{DATASET_URL}" target="_blank">Hugging Face Dataset</a>
+    <a href="{SPACE_URL}" target="_blank">Space Repository</a>
+  </div>
+  <div class="hero-meta">
+    ZIP upload only · full task-format validation · PR to dataset repo after passing checks
+  </div>
+</section>
+"""
+def field_label_html(text: str) -> str:
+    return f'<div class="field-label">{text}</div>'
+def submission_guide_markdown() -> str:
+    return """
+## Before You Upload
+1. Put exactly one task directory at the top level of the ZIP.
+2. Make sure the directory contains `task_info.json`, `data/`, `related_work/`, and `target_study/`.
+3. Keep every data reference inside `task_info.json` in the `./data/...` format.
+4. Make sure every checklist image path points to `target_study/images/...`.
+5. Ensure that uploaded files can be redistributed through Hugging Face before submitting.
+## Expected ZIP Layout
+```text
+your_submission.zip
+└── any_folder_name/
+    ├── task_info.json
+    ├── data/
+    ├── related_work/
+    └── target_study/
+        ├── checklist.json
+        ├── paper.pdf
+        └── images/
+```
+## What The Space Checks
+- top-level folder structure and missing or extra files
+- `task_info.json` and `checklist.json` parseability and required keys
+- file naming conventions such as `related_work/paper_000.pdf`
+- whether declared data paths actually exist
+- whether image references actually exist
+- whether old source paths or stale `/tasks/...` references remain in descriptions
+Example task in GitHub:
+[tasks/Astronomy_000](https://github.com/InternScience/ResearchClawBench/tree/main/tasks/Astronomy_000)
+"""
+def final_task_help_html() -> str:
+    return (
+        '<div class="caption">'
+        'The final task ID is assigned automatically after the Space scans existing <code>tasks/</code> folders. '
+        'You do not need to choose the numeric suffix yourself. The selected domain becomes the prefix, and if the '
+        'custom field is filled, it overrides the suggested domain.'
+        '</div>'
+    )
+def resolve_domain(selected_domain: str, custom_domain: str) -> str:
+    raw_value = (custom_domain or '').strip() or (selected_domain or '').strip()
+    normalized = normalize_domain_token(raw_value)
+    if not normalized:
+        raise ValidationError('Please select a suggested domain or provide a custom domain.')
+    return normalized
+def handle_archive_upload(archive_path: str | None):
+    if not archive_path:
+        return '', 'No ZIP file selected yet.'
+    filename = Path(archive_path).name
+    return archive_path, f'Selected ZIP: `{filename}`'
+def build_validation_markdown(prepared: PreparedSubmission) -> str:
+    metadata = prepared.metadata
+    return '\n'.join([
+        '## Validation passed',
+        '',
+        f'- Final task ID: `{prepared.assigned_task_id}`',
+        '- This is the folder name that will be created under `tasks/` in the dataset repo.',
+        f'- Domain token used for allocation: `{metadata.domain}`',
+        f'- Submitter: `{metadata.submitter}`',
+        f'- Archive file count: `{prepared.archive_stats.file_count}`',
+        f'- Archive total bytes: `{prepared.archive_stats.total_bytes}`',
+        '',
+        'You can now create a PR to the Hugging Face dataset repo.',
+    ])
+def build_failure_markdown(message: str) -> str:
+    items = [line.strip() for line in message.splitlines() if line.strip()]
+    bullets = '\n'.join(f'- {item}' for item in items) if items else '- Unknown validation error'
+    return f'## Validation failed\n\n{bullets}'
+def validate_submission(
+    archive_path: str,
+    suggested_domain: str,
+    custom_domain: str,
+    submitter: str,
+    email: str,
+    paper_title: str,
+    paper_url: str,
+    notes: str,
+    current_state: dict | None,
+):
+    if current_state:
+        cleanup_work_dir(current_state.get('work_dir'))
+    if not archive_path:
+        return None, '', '## Validation failed\n\n- Please upload a zip file.', '{}', gr.update(interactive=False), ''
+    domain = resolve_domain(suggested_domain, custom_domain)
+    metadata = SubmissionMetadata(
+        domain=domain,
+        submitter=submitter,
+        email=email,
+        paper_title=paper_title,
+        paper_url=paper_url,
+        notes=notes or '',
+    )
+    try:
+        existing_ids = list_existing_task_ids(repo_id=DEFAULT_REPO_ID, token=load_hf_token())
+        assigned_task_id = allocate_next_task_id(domain, existing_ids)
+        prepared = validate_and_prepare_submission(archive_path, metadata, assigned_task_id)
+        pr_ready = bool(load_hf_token())
+        return (
+            prepared.to_state(),
+            prepared.assigned_task_id,
+            build_validation_markdown(prepared),
+            json.dumps(build_public_report(prepared), indent=2, ensure_ascii=False),
+            gr.update(interactive=pr_ready),
+            '' if pr_ready else 'Validation passed, but PR creation is disabled until a write token is configured.',
+        )
+    except ValidationError as exc:
+        return (
+            None,
+            '',
+            build_failure_markdown(str(exc)),
+            json.dumps({'status': 'error', 'errors': str(exc).splitlines()}, indent=2, ensure_ascii=False),
+            gr.update(interactive=False),
+            '',
+        )
+    except Exception as exc:
+        return (
+            None,
+            '',
+            build_failure_markdown(str(exc)),
+            json.dumps({'status': 'error', 'errors': [str(exc)]}, indent=2, ensure_ascii=False),
+            gr.update(interactive=False),
+            '',
+        )
+def create_pr(state: dict | None):
+    if not state:
+        return '## PR creation failed\n\n- Validate a submission first.'
+    prepared = PreparedSubmission.from_state(state)
+    try:
+        commit_info = create_dataset_pr(prepared, repo_id=DEFAULT_REPO_ID, token=load_hf_token())
+        pr_url = commit_info.pr_url or commit_info.commit_url
+        return '\n'.join([
+            '## PR created',
+            '',
+            f'- Task ID: `{prepared.assigned_task_id}`',
+            f'- PR: {pr_url}',
+        ])
+    finally:
+        cleanup_work_dir(prepared.work_dir)
+with gr.Blocks(title=SPACE_TITLE, theme=gr.themes.Base(), css=CSS, fill_width=True) as demo:
+    state = gr.State(None)
+    archive_state = gr.State('')
+    gr.HTML(build_hero_html())
+    with gr.Group(elem_classes=['page-shell']):
+        with gr.Row(elem_classes=['section-row']):
+            with gr.Column(scale=7, elem_classes=['section-copy', 'main-form']):
+                gr.HTML(field_label_html('Task ZIP archive'))
+                with gr.Row(elem_classes=['upload-row']):
+                    archive = gr.UploadButton(
+                        'Select ZIP file',
+                        file_types=['.zip'],
+                        file_count='single',
+                        type='filepath',
+                        variant='secondary',
+                        elem_classes=['upload-button'],
+                    )
+                archive_notice = gr.Markdown('No ZIP file selected yet.', elem_classes=['upload-status'])
+                with gr.Row():
+                    with gr.Column():
+                        gr.HTML(field_label_html('Suggested domain'))
+                        suggested_domain = gr.Dropdown(
+                            choices=list(DOMAINS),
+                            value='Astronomy',
+                            show_label=False,
+                            container=False,
+                        )
+                    with gr.Column():
+                        gr.HTML(field_label_html('Custom domain (optional)'))
+                        custom_domain = gr.Textbox(
+                            placeholder='e.g. Robotics or Robot-Learning',
+                            show_label=False,
+                            container=False,
+                        )
+                gr.Markdown(
+                    '<div class="caption">Use the custom field if your task does not belong to the suggested list. '
+                    'If the custom field is filled, it overrides the suggested domain and becomes the prefix of the final task ID.</div>'
+                )
+                gr.HTML(field_label_html('Submitter name or HF username'))
+                submitter = gr.Textbox(
+                    placeholder='e.g. your-hf-handle',
+                    show_label=False,
+                    container=False,
+                )
+                gr.HTML(field_label_html('Contact email'))
+                email = gr.Textbox(
+                    placeholder='name@example.com',
+                    show_label=False,
+                    container=False,
+                )
+                gr.HTML(field_label_html('Target paper title'))
+                paper_title = gr.Textbox(show_label=False, container=False)
+                gr.HTML(field_label_html('Target paper URL or DOI'))
+                paper_url = gr.Textbox(
+                    placeholder='https://... or DOI',
+                    show_label=False,
+                    container=False,
+                )
+                gr.HTML(field_label_html('Optional notes for reviewers'))
+                notes = gr.Textbox(
+                    lines=4,
+                    placeholder='Anything maintainers should know about licensing, preprocessing, or provenance.',
+                    show_label=False,
+                    container=False,
+                )
+            with gr.Column(scale=5, elem_classes=['section-copy', 'side-notes']):
+                gr.Markdown(submission_guide_markdown(), elem_classes=['subtle-block'])
+        with gr.Row(elem_classes=['action-row']):
+            validate_btn = gr.Button('Validate ZIP', variant='primary', elem_classes=['primary-button'])
+            create_pr_btn = gr.Button('Create Dataset PR', interactive=False, elem_classes=['secondary-button'])
+        with gr.Column(elem_classes=['section-copy', 'results-shell']):
+            gr.HTML(field_label_html('Final task ID (assigned automatically)'))
+            assigned_task_id = gr.Textbox(
+                interactive=False,
+                show_label=False,
+                container=False,
+            )
+            gr.Markdown(final_task_help_html())
+            validation_md = gr.Markdown()
+            gr.HTML(field_label_html('Validation report'))
+            validation_report = gr.Code(language='json', show_label=False, container=False)
+            pr_md = gr.Markdown()
+    archive.upload(fn=handle_archive_upload, inputs=[archive], outputs=[archive_state, archive_notice])
+    validate_btn.click(
+        fn=validate_submission,
+        inputs=[
+            archive_state,
+            suggested_domain,
+            custom_domain,
+            submitter,
+            email,
+            paper_title,
+            paper_url,
+            notes,
+            state,
+        ],
+        outputs=[state, assigned_task_id, validation_md, validation_report, create_pr_btn, pr_md],
+    )
+    create_pr_btn.click(fn=create_pr, inputs=[state], outputs=[pr_md])
+if __name__ == '__main__':
+    demo.launch()

repo_ops.py ADDED Viewed

	@@ -0,0 +1,114 @@

+from __future__ import annotations
+from pathlib import Path, PurePosixPath
+from typing import Iterable
+from huggingface_hub import CommitOperationAdd, HfApi
+try:
+    from .validator import DOMAIN_TOKEN_RE, PreparedSubmission, TASK_ID_RE, normalize_domain_token
+except ImportError:
+    from validator import DOMAIN_TOKEN_RE, PreparedSubmission, TASK_ID_RE, normalize_domain_token
+DEFAULT_REPO_ID = 'InternScience/ResearchClawBench'
+TOKEN_ENV_KEYS = (
+    'RCB_SPACE_HF_TOKEN',
+    'HF_TOKEN',
+    'HUGGINGFACEHUB_API_TOKEN',
+    'HUGGINGFACE_TOKEN',
+)
+def load_hf_token() -> str | None:
+    import os
+    for key in TOKEN_ENV_KEYS:
+        value = os.environ.get(key)
+        if value:
+            return value
+    return None
+def list_existing_task_ids(repo_id: str = DEFAULT_REPO_ID, token: str | None = None) -> set[str]:
+    api = HfApi(token=token)
+    task_ids: set[str] = set()
+    for remote_path in api.list_repo_files(repo_id=repo_id, repo_type='dataset', token=token):
+        parts = PurePosixPath(remote_path).parts
+        if len(parts) >= 2 and parts[0] == 'tasks':
+            task_ids.add(parts[1])
+    return task_ids
+def allocate_next_task_id(domain: str, existing_task_ids: Iterable[str]) -> str:
+    domain = normalize_domain_token(domain)
+    if not DOMAIN_TOKEN_RE.fullmatch(domain):
+        raise ValueError(
+            'Domain must start with a letter and contain only letters, numbers, or hyphens '
+            f'after normalization. Got: {domain!r}'
+        )
+    used_numbers = []
+    for task_id in existing_task_ids:
+        match = TASK_ID_RE.match(task_id)
+        if match and match.group(1) == domain:
+            used_numbers.append(int(match.group(2)))
+    next_number = (max(used_numbers) + 1) if used_numbers else 0
+    if next_number > 999:
+        raise ValueError(f'No task IDs left for domain {domain}.')
+    return f'{domain}_{next_number:03d}'
+def build_commit_description(prepared: PreparedSubmission) -> str:
+    metadata = prepared.metadata
+    lines = [
+        f'Submitter: {metadata.submitter}',
+        f'Contact email: {metadata.email}',
+        f'Domain: {metadata.domain}',
+        f'Assigned task id: {prepared.assigned_task_id}',
+        f'Paper title: {metadata.paper_title}',
+        f'Paper URL/DOI: {metadata.paper_url}',
+        f'Archive files: {prepared.archive_stats.file_count}',
+        f'Archive total bytes: {prepared.archive_stats.total_bytes}',
+    ]
+    if metadata.notes.strip():
+        lines.extend(['', 'Submitter notes:', metadata.notes.strip()])
+    lines.extend(['', 'This PR was created automatically by the ResearchClawBench submission Space after passing format validation.'])
+    return '\n'.join(lines)
+def create_dataset_pr(
+    prepared: PreparedSubmission,
+    *,
+    repo_id: str = DEFAULT_REPO_ID,
+    token: str | None = None,
+):
+    token = token or load_hf_token()
+    if not token:
+        raise RuntimeError('No Hugging Face write token configured. Set RCB_SPACE_HF_TOKEN or HF_TOKEN.')
+    staged_task_dir = Path(prepared.staged_task_dir)
+    if not staged_task_dir.is_dir():
+        raise RuntimeError(f'Staged task directory does not exist: {staged_task_dir}')
+    operations = []
+    for path in sorted(staged_task_dir.rglob('*')):
+        if not path.is_file():
+            continue
+        rel_path = path.relative_to(staged_task_dir).as_posix()
+        operations.append(
+            CommitOperationAdd(
+                path_in_repo=f'tasks/{prepared.assigned_task_id}/{rel_path}',
+                path_or_fileobj=str(path),
+            )
+        )
+    api = HfApi(token=token)
+    return api.create_commit(
+        repo_id=repo_id,
+        repo_type='dataset',
+        operations=operations,
+        commit_message=f'Add task submission {prepared.assigned_task_id}',
+        commit_description=build_commit_description(prepared),
+        token=token,
+        create_pr=True,
+        revision='main',
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Gradio is provided by the Hugging Face Space SDK via README.yaml `sdk_version`.
2	+ huggingface_hub>=0.34.0,<1

validator.py ADDED Viewed

	@@ -0,0 +1,518 @@

+from __future__ import annotations
+import json
+import os
+import re
+import shutil
+import stat
+import tempfile
+import zipfile
+from dataclasses import asdict, dataclass
+from pathlib import Path, PurePosixPath
+from typing import Any
+DOMAINS = (
+    'Astronomy',
+    'Chemistry',
+    'Earth',
+    'Energy',
+    'Information',
+    'Life',
+    'Material',
+    'Math',
+    'Neuroscience',
+    'Physics',
+)
+DOMAIN_TOKEN_RE = re.compile(r'^[A-Za-z][A-Za-z0-9-]*$')
+TASK_ID_RE = re.compile(r'^([A-Za-z][A-Za-z0-9-]*)_(\d{3})$')
+STALE_TOKENS = (
+    '/mnt/shared-storage-user/',
+    'SGI-EvalAgent',
+    'prior_literature',
+    'target_literature',
+)
+STALE_TASK_REF_RE = re.compile(r'(?:\./|/)tasks/[\w/.\-]+')
+DATA_REF_RE = re.compile(r"""\./data/[^'"`\n;,]+""")
+EXPECTED_TOP_LEVEL = {'data', 'related_work', 'target_study', 'task_info.json'}
+EXPECTED_TARGET_STUDY = ('checklist.json', 'images/', 'paper.pdf')
+EXPECTED_TASK_INFO_KEYS = ('data', 'task')
+EXPECTED_DATA_ITEM_KEYS = ('description', 'name', 'path', 'type')
+EXPECTED_CHECKLIST_ITEM_KEYS = ('content', 'keywords', 'path', 'type', 'weight')
+IGNORED_ARCHIVE_PARTS = {'__MACOSX'}
+IGNORED_ARCHIVE_NAMES = {'.DS_Store'}
+DEFAULT_MAX_FILES = int(os.environ.get('RCB_SPACE_MAX_FILES', '5000'))
+DEFAULT_MAX_TOTAL_BYTES = int(os.environ.get('RCB_SPACE_MAX_TOTAL_BYTES', str(5 * 1024 * 1024 * 1024)))
+DEFAULT_MAX_SINGLE_FILE_BYTES = int(os.environ.get('RCB_SPACE_MAX_SINGLE_FILE_BYTES', str(1024 * 1024 * 1024)))
+@dataclass
+class SubmissionMetadata:
+    domain: str
+    submitter: str
+    email: str
+    paper_title: str
+    paper_url: str
+    notes: str = ''
+@dataclass
+class ArchiveStats:
+    file_count: int
+    total_bytes: int
+@dataclass
+class PreparedSubmission:
+    work_dir: str
+    uploaded_task_dir: str
+    staged_task_dir: str
+    assigned_task_id: str
+    archive_stats: ArchiveStats
+    metadata: SubmissionMetadata
+    def to_state(self) -> dict[str, Any]:
+        return {
+            'work_dir': self.work_dir,
+            'uploaded_task_dir': self.uploaded_task_dir,
+            'staged_task_dir': self.staged_task_dir,
+            'assigned_task_id': self.assigned_task_id,
+            'archive_stats': asdict(self.archive_stats),
+            'metadata': asdict(self.metadata),
+        }
+    @classmethod
+    def from_state(cls, state: dict[str, Any]) -> 'PreparedSubmission':
+        return cls(
+            work_dir=state['work_dir'],
+            uploaded_task_dir=state['uploaded_task_dir'],
+            staged_task_dir=state['staged_task_dir'],
+            assigned_task_id=state['assigned_task_id'],
+            archive_stats=ArchiveStats(**state['archive_stats']),
+            metadata=SubmissionMetadata(**state['metadata']),
+        )
+class ValidationError(RuntimeError):
+    pass
+def normalize_domain_token(domain: str) -> str:
+    value = re.sub(r'[\s_]+', '-', (domain or '').strip())
+    value = re.sub(r'-{2,}', '-', value)
+    return value.strip('-')
+def load_json(path: Path) -> Any:
+    try:
+        return json.loads(path.read_text(encoding='utf-8'))
+    except Exception as exc:
+        raise ValidationError(f'Failed to parse JSON: {path}: {exc}') from exc
+def rel(path: Path, base: Path) -> str:
+    try:
+        return str(path.relative_to(base))
+    except Exception:
+        return str(path)
+def _target_entries(target_dir: Path) -> tuple[str, ...]:
+    return tuple(sorted(x.name + ('/' if x.is_dir() else '') for x in target_dir.iterdir()))
+def _is_ignored_archive_path(path: PurePosixPath) -> bool:
+    return any(part in IGNORED_ARCHIVE_PARTS for part in path.parts) or path.name in IGNORED_ARCHIVE_NAMES or path.name.startswith('._')
+def _is_zip_symlink(info: zipfile.ZipInfo) -> bool:
+    mode = info.external_attr >> 16
+    return stat.S_ISLNK(mode)
+def _iter_data_refs(text: str) -> list[str]:
+    refs = []
+    for raw_ref in DATA_REF_RE.findall(text):
+        ref = raw_ref.rstrip('.')
+        if ref not in refs:
+            refs.append(ref)
+    return refs
+def cleanup_work_dir(work_dir: str | Path | None) -> None:
+    if not work_dir:
+        return
+    shutil.rmtree(Path(work_dir), ignore_errors=True)
+def create_work_dir() -> Path:
+    return Path(tempfile.mkdtemp(prefix='rcb_space_submit_'))
+def extract_submission_zip(
+    archive_path: str | Path,
+    work_dir: str | Path,
+    *,
+    max_files: int = DEFAULT_MAX_FILES,
+    max_total_bytes: int = DEFAULT_MAX_TOTAL_BYTES,
+    max_single_file_bytes: int = DEFAULT_MAX_SINGLE_FILE_BYTES,
+) -> tuple[Path, ArchiveStats]:
+    archive_path = Path(archive_path)
+    work_dir = Path(work_dir)
+    extract_root = work_dir / 'extracted'
+    extract_root.mkdir(parents=True, exist_ok=True)
+    if archive_path.suffix.lower() != '.zip':
+        raise ValidationError('Only .zip uploads are supported.')
+    file_count = 0
+    total_bytes = 0
+    safe_infos: list[tuple[zipfile.ZipInfo, PurePosixPath]] = []
+    with zipfile.ZipFile(archive_path) as zf:
+        infos = zf.infolist()
+        if not infos:
+            raise ValidationError('The uploaded zip archive is empty.')
+        for info in infos:
+            raw_name = info.filename.replace('\\', '/')
+            if not raw_name:
+                continue
+            posix_path = PurePosixPath(raw_name)
+            if _is_ignored_archive_path(posix_path):
+                continue
+            if posix_path.is_absolute() or '..' in posix_path.parts:
+                raise ValidationError(f'Archive contains an invalid path: {raw_name}')
+            if _is_zip_symlink(info):
+                raise ValidationError(f'Archive contains a symbolic link, which is not allowed: {raw_name}')
+            safe_infos.append((info, posix_path))
+            if info.is_dir():
+                continue
+            file_count += 1
+            total_bytes += info.file_size
+            if info.file_size > max_single_file_bytes:
+                raise ValidationError(
+                    f'Archive file exceeds the per-file limit ({max_single_file_bytes} bytes): {raw_name}'
+                )
+            if file_count > max_files:
+                raise ValidationError(f'Archive exceeds the file-count limit ({max_files}).')
+            if total_bytes > max_total_bytes:
+                raise ValidationError(f'Archive exceeds the total-size limit ({max_total_bytes} bytes).')
+        if file_count == 0:
+            raise ValidationError('The uploaded zip archive does not contain any files.')
+        for info, posix_path in safe_infos:
+            destination = extract_root.joinpath(*posix_path.parts)
+            if info.is_dir():
+                destination.mkdir(parents=True, exist_ok=True)
+                continue
+            destination.parent.mkdir(parents=True, exist_ok=True)
+            with zf.open(info) as src, destination.open('wb') as dst:
+                shutil.copyfileobj(src, dst)
+    return extract_root, ArchiveStats(file_count=file_count, total_bytes=total_bytes)
+def find_single_task_dir(extract_root: str | Path) -> Path:
+    extract_root = Path(extract_root)
+    entries = []
+    for path in sorted(extract_root.iterdir(), key=lambda p: p.name.lower()):
+        if path.name in IGNORED_ARCHIVE_NAMES or path.name in IGNORED_ARCHIVE_PARTS or path.name.startswith('._'):
+            continue
+        entries.append(path)
+    if len(entries) != 1 or not entries[0].is_dir():
+        names = [p.name for p in entries]
+        raise ValidationError(
+            'Zip must contain exactly one top-level task directory. '
+            f'Found: {names if names else "(none)"}'
+        )
+    return entries[0]
+def validate_submission_metadata(metadata: SubmissionMetadata) -> list[str]:
+    errors: list[str] = []
+    normalized_domain = normalize_domain_token(metadata.domain)
+    if not normalized_domain:
+        errors.append('A domain is required.')
+    elif not DOMAIN_TOKEN_RE.fullmatch(normalized_domain):
+        errors.append(
+            'Domain must start with a letter and contain only letters, numbers, or hyphens '
+            f'after normalization. Got: {metadata.domain!r}'
+        )
+    if not metadata.submitter.strip():
+        errors.append('Submitter name or HF username is required.')
+    if not metadata.email.strip():
+        errors.append('Contact email is required.')
+    elif not re.fullmatch(r'[^@\s]+@[^@\s]+\.[^@\s]+', metadata.email.strip()):
+        errors.append('Contact email must be a valid email address.')
+    if not metadata.paper_title.strip():
+        errors.append('Paper title is required.')
+    if not metadata.paper_url.strip():
+        errors.append('Paper URL or DOI is required.')
+    return errors
+def validate_task_dir(
+    task_dir: str | Path,
+    *,
+    enforce_task_name: bool = True,
+    expected_domain: str | None = None,
+) -> list[str]:
+    task_dir = Path(task_dir)
+    errors: list[str] = []
+    task_name = task_dir.name
+    match = TASK_ID_RE.match(task_name)
+    if enforce_task_name:
+        if not match:
+            errors.append(f'invalid task directory name: {task_name}')
+        elif expected_domain and match.group(1) != expected_domain:
+            errors.append(f'task directory domain {match.group(1)!r} does not match selected domain {expected_domain!r}')
+    elif expected_domain and match and match.group(1) != expected_domain:
+        errors.append(f'task directory domain {match.group(1)!r} does not match selected domain {expected_domain!r}')
+    if not task_dir.is_dir():
+        return [f'task directory does not exist: {task_dir}']
+    actual_top = {p.name for p in task_dir.iterdir()}
+    if actual_top != EXPECTED_TOP_LEVEL:
+        errors.append(f'top-level entries mismatch: expected {sorted(EXPECTED_TOP_LEVEL)}, got {sorted(actual_top)}')
+    data_dir = task_dir / 'data'
+    related_dir = task_dir / 'related_work'
+    target_dir = task_dir / 'target_study'
+    task_info_path = task_dir / 'task_info.json'
+    checklist_path = target_dir / 'checklist.json'
+    paper_path = target_dir / 'paper.pdf'
+    images_dir = target_dir / 'images'
+    if not data_dir.is_dir():
+        errors.append('missing data/ directory')
+    if not related_dir.is_dir():
+        errors.append('missing related_work/ directory')
+    if not target_dir.is_dir():
+        errors.append('missing target_study/ directory')
+    if not task_info_path.is_file():
+        errors.append('missing task_info.json')
+        return errors
+    if not checklist_path.is_file():
+        errors.append('missing target_study/checklist.json')
+        return errors
+    if not paper_path.is_file():
+        errors.append('missing target_study/paper.pdf')
+    if not images_dir.is_dir():
+        errors.append('missing target_study/images/ directory')
+    try:
+        task_info = load_json(task_info_path)
+    except ValidationError as exc:
+        errors.append(str(exc))
+        return errors
+    if tuple(sorted(task_info.keys())) != EXPECTED_TASK_INFO_KEYS:
+        errors.append(f'task_info.json keys mismatch: {sorted(task_info.keys())}')
+    if not isinstance(task_info.get('task'), str) or not task_info['task'].strip():
+        errors.append('task_info.json field `task` must be a non-empty string')
+    if not isinstance(task_info.get('data'), list):
+        errors.append('task_info.json field `data` must be a list')
+        task_info['data'] = []
+    covered_files: set[Path] = set()
+    declared_paths: set[str] = set()
+    for idx, item in enumerate(task_info['data']):
+        prefix = f'task_info.data[{idx}]'
+        if tuple(sorted(item.keys())) != EXPECTED_DATA_ITEM_KEYS:
+            errors.append(f'{prefix} keys mismatch: {sorted(item.keys())}')
+            continue
+        for field in EXPECTED_DATA_ITEM_KEYS:
+            if not isinstance(item.get(field), str) or not item[field].strip():
+                errors.append(f'{prefix}.{field} must be a non-empty string')
+        data_path = item.get('path')
+        if not isinstance(data_path, str):
+            continue
+        if not data_path.startswith('./data/') and data_path != './data':
+            errors.append(f'{prefix}.path must start with ./data/: {data_path}')
+            continue
+        if '\\' in data_path or '..' in Path(data_path).parts:
+            errors.append(f'{prefix}.path contains an invalid segment: {data_path}')
+            continue
+        if data_path in declared_paths:
+            errors.append(f'duplicate data path declaration: {data_path}')
+            continue
+        declared_paths.add(data_path)
+        rel_path = data_path[2:] if data_path.startswith('./') else data_path
+        target = task_dir / rel_path
+        if not target.exists():
+            errors.append(f'{prefix}.path does not exist: {data_path}')
+            continue
+        if target.is_file():
+            covered_files.add(target)
+        elif target.is_dir():
+            nested_files = {p for p in target.rglob('*') if p.is_file()}
+            if not nested_files:
+                errors.append(f'{prefix}.path points to an empty directory: {data_path}')
+            covered_files.update(nested_files)
+        else:
+            errors.append(f'{prefix}.path is neither file nor directory: {data_path}')
+        description = item.get('description', '')
+        if any(token in description for token in STALE_TOKENS):
+            errors.append(f'{prefix}.description still contains stale source paths or legacy directories')
+    actual_data_files = {p for p in data_dir.rglob('*') if p.is_file()} if data_dir.exists() else set()
+    uncovered = sorted(actual_data_files - covered_files)
+    if uncovered:
+        errors.append('data/ contains undeclared files: ' + ', '.join(rel(p, task_dir) for p in uncovered[:20]))
+    missing_backing = sorted(covered_files - actual_data_files)
+    if missing_backing:
+        errors.append('declared data coverage points outside data/: ' + ', '.join(rel(p, task_dir) for p in missing_backing[:20]))
+    related_entries = sorted(related_dir.iterdir(), key=lambda p: p.name) if related_dir.exists() else []
+    related_files = [p for p in related_entries if p.is_file()]
+    related_dirs = [p for p in related_entries if p.is_dir()]
+    if related_dirs:
+        errors.append('related_work/ must not contain subdirectories')
+    if not related_files:
+        errors.append('related_work/ must contain at least one PDF')
+    pdf_names = []
+    for path in related_files:
+        if not re.fullmatch(r'paper_\d{3}\.pdf', path.name):
+            errors.append(f'invalid related_work filename: {path.name}')
+        pdf_names.append(path.name)
+    expected_pdf_names = [f'paper_{i:03d}.pdf' for i in range(len(pdf_names))]
+    if pdf_names and pdf_names != expected_pdf_names:
+        errors.append(f'related_work PDFs must be contiguous starting from paper_000.pdf; got {pdf_names}')
+    if target_dir.exists() and _target_entries(target_dir) != EXPECTED_TARGET_STUDY:
+        errors.append(f'target_study entries mismatch: {_target_entries(target_dir)}')
+    try:
+        checklist = load_json(checklist_path)
+    except ValidationError as exc:
+        errors.append(str(exc))
+        return errors
+    if not isinstance(checklist, list) or not checklist:
+        errors.append('checklist.json must be a non-empty list')
+        checklist = []
+    referenced_images: set[str] = set()
+    for idx, item in enumerate(checklist):
+        prefix = f'checklist[{idx}]'
+        if tuple(sorted(item.keys())) != EXPECTED_CHECKLIST_ITEM_KEYS:
+            errors.append(f'{prefix} keys mismatch: {sorted(item.keys())}')
+            continue
+        item_type = item.get('type')
+        if item_type not in {'text', 'image'}:
+            errors.append(f'{prefix}.type must be text or image, got {item_type!r}')
+        if not isinstance(item.get('content'), str) or not item['content'].strip():
+            errors.append(f'{prefix}.content must be a non-empty string')
+        if not isinstance(item.get('keywords'), list) or not item['keywords']:
+            errors.append(f'{prefix}.keywords must be a non-empty list')
+        elif not all(isinstance(x, str) and x.strip() for x in item['keywords']):
+            errors.append(f'{prefix}.keywords must contain only non-empty strings')
+        if not isinstance(item.get('weight'), (int, float)) or item['weight'] <= 0:
+            errors.append(f'{prefix}.weight must be a positive number')
+        path_value = item.get('path')
+        if item_type == 'text':
+            if path_value is not None:
+                errors.append(f'{prefix}.path must be null for text items')
+        elif item_type == 'image':
+            if not isinstance(path_value, str) or not path_value.startswith('images/'):
+                errors.append(f'{prefix}.path must start with images/ for image items')
+            else:
+                if '\\' in path_value or '..' in Path(path_value).parts:
+                    errors.append(f'{prefix}.path contains an invalid segment: {path_value}')
+                image_path = target_dir / path_value
+                if not image_path.is_file():
+                    errors.append(f'{prefix}.path does not exist: {path_value}')
+                referenced_images.add(path_value)
+    actual_image_files = {str(p.relative_to(target_dir)) for p in images_dir.rglob('*') if p.is_file()} if images_dir.exists() else set()
+    extra_images = sorted(actual_image_files - referenced_images)
+    missing_images = sorted(referenced_images - actual_image_files)
+    if extra_images:
+        errors.append('target_study/images contains unreferenced files: ' + ', '.join(extra_images[:20]))
+    if missing_images:
+        errors.append('checklist image references are missing from target_study/images: ' + ', '.join(missing_images[:20]))
+    for text_path in (task_info_path, checklist_path):
+        text = text_path.read_text(encoding='utf-8')
+        if any(token in text for token in STALE_TOKENS):
+            errors.append(f'stale source path tokens remain in {rel(text_path, task_dir)}')
+    task_text = task_info.get('task', '') if isinstance(task_info, dict) else ''
+    if isinstance(task_text, str):
+        for ref in STALE_TASK_REF_RE.findall(task_text):
+            errors.append(f'task description contains stale path: {ref}')
+    for idx, item in enumerate(task_info.get('data', [])):
+        desc = item.get('description', '')
+        for ref in STALE_TASK_REF_RE.findall(desc):
+            errors.append(f'task_info.data[{idx}].description contains stale path: {ref}')
+        for data_ref in _iter_data_refs(desc):
+            rel_ref = data_ref[2:]
+            if not (task_dir / rel_ref).exists():
+                errors.append(f'task_info.data[{idx}].description references non-existent path: {data_ref}')
+    return errors
+def stage_submission(task_dir: str | Path, assigned_task_id: str, work_dir: str | Path) -> Path:
+    task_dir = Path(task_dir)
+    work_dir = Path(work_dir)
+    staged_root = work_dir / 'staged'
+    staged_root.mkdir(parents=True, exist_ok=True)
+    staged_task_dir = staged_root / assigned_task_id
+    if staged_task_dir.exists():
+        shutil.rmtree(staged_task_dir)
+    shutil.copytree(task_dir, staged_task_dir)
+    return staged_task_dir
+def build_public_report(prepared: PreparedSubmission) -> dict[str, Any]:
+    return {
+        'status': 'ok',
+        'assigned_task_id': prepared.assigned_task_id,
+        'archive': {
+            'file_count': prepared.archive_stats.file_count,
+            'total_bytes': prepared.archive_stats.total_bytes,
+        },
+        'metadata': {
+            'domain': prepared.metadata.domain,
+            'submitter': prepared.metadata.submitter,
+            'paper_title': prepared.metadata.paper_title,
+            'paper_url': prepared.metadata.paper_url,
+        },
+    }
+def validate_and_prepare_submission(
+    archive_path: str | Path,
+    metadata: SubmissionMetadata,
+    assigned_task_id: str,
+) -> PreparedSubmission:
+    metadata_errors = validate_submission_metadata(metadata)
+    if metadata_errors:
+        raise ValidationError('\n'.join(metadata_errors))
+    work_dir = create_work_dir()
+    try:
+        extract_root, archive_stats = extract_submission_zip(archive_path, work_dir)
+        uploaded_task_dir = find_single_task_dir(extract_root)
+        errors = validate_task_dir(uploaded_task_dir, enforce_task_name=False, expected_domain=metadata.domain)
+        if errors:
+            raise ValidationError('\n'.join(errors))
+        staged_task_dir = stage_submission(uploaded_task_dir, assigned_task_id, work_dir)
+        return PreparedSubmission(
+            work_dir=str(work_dir),
+            uploaded_task_dir=str(uploaded_task_dir),
+            staged_task_dir=str(staged_task_dir),
+            assigned_task_id=assigned_task_id,
+            archive_stats=archive_stats,
+            metadata=metadata,
+        )
+    except Exception:
+        cleanup_work_dir(work_dir)
+        raise