Spaces:
Sleeping
Sleeping
| # MVP build for “Data Curation Workbench” (Hugging Face Space) | |
| ## 0) MVP Goal & Scope | |
| **Goal:** Let a signed‑in user upload **D₀** (or reference a Hub dataset), pick a **model** + **metrics**, choose candidate datasets **{D₁…Dₙ}**, launch **small‑scale fine‑tunes/evals** as detached **Jobs**, and view: | |
| * per‑run metrics (loss / F1 / Exact‑Match), | |
| * a **scaling‑law** plot, and | |
| * a table ranking which Dₖ helps the most, | |
| * with all artifacts saved to a results dataset or Space storage. | |
| **Out of scope (for MVP):** | |
| * Multi‑GPU distributed training, multi‑task mixing UI, complex hyperparam sweeps. | |
| * Non‑text tasks. | |
| --- | |
| ## 1) Repository Layout | |
| Create these files/folders: | |
| ``` | |
| . | |
| ├─ README.md | |
| ├─ PLAN.md # this file | |
| ├─ app.py # Gradio UI + Job submission + status polling | |
| ├─ requirements.txt | |
| ├─ catalog/ | |
| │ └─ candidates.json # curated {D₁…Dₙ} | |
| ├─ utils/ | |
| │ ├─ hub.py # upload to Hub, results repo helpers | |
| │ ├─ data.py # dataset loading/mixing/helpers | |
| │ └─ plotting.py # scaling plot helper | |
| └─ jobs/ | |
| ├─ run_experiment.py # orchestrates one D₀ ⊕ Dₖ experiment (multi sizes) | |
| ├─ train.py # PEFT/QLoRA SFT | |
| ├─ eval.py # metrics (loss/F1/Exact-Match) | |
| └─ scaling.py # fit & predict scaling law | |
| ``` | |
| --- | |
| ## 2) Configuration & Env | |
| **Space Settings → Secrets/Variables (already done for step 2, list here for reference):** | |
| * `SERVICE_HF_TOKEN` (secret, write‑scoped; used to create/push results datasets) | |
| * `RESULTS_REPO` (optional, like `your-org/curation-results`; if absent, create on first run) | |
| * `HF_HOME=/data/.huggingface` (variable) **if** Persistent Storage is enabled | |
| * `PERSIST_DIR=/data` (variable) **if** Persistent Storage is enabled | |
| **NOTE: RESULTS_REPO is absent now; Persistent Storage is NOT enabled yet.** | |
| **Runtime assumptions:** | |
| * Space uses **Gradio SDK**. | |
| * Jobs will request a **GPU flavor** (e.g., `a10g-small`) for training; UI itself can run on CPU. | |
| **Currently the Space Hardware is ZeroGPU.** | |
| --- | |
| ## 3) Dependencies | |
| `requirements.txt` | |
| ``` | |
| gradio>=5 | |
| huggingface_hub>=0.25 | |
| datasets>=2.20 | |
| transformers>=4.44 | |
| peft>=0.13 | |
| trl>=0.9 | |
| evaluate>=0.4 | |
| scikit-learn>=1.5 | |
| numpy>=1.26 | |
| pandas>=2.2 | |
| matplotlib>=3.8 | |
| ``` | |
| --- | |
| ## 4) Candidate Datasets Catalog | |
| `catalog/candidates.json` (minimal starter; adjust to your domain) | |
| ```json | |
| [ | |
| { | |
| "id": "glue/sst2", | |
| "task": "classification", | |
| "license": "open", | |
| "size_hint": "67k", | |
| "columns": {"text": "sentence", "label": "label"}, | |
| "labels": ["negative","positive"] | |
| }, | |
| { | |
| "id": "ag_news", | |
| "task": "classification", | |
| "license": "cc-by-3.0", | |
| "size_hint": "120k", | |
| "columns": {"text": "text", "label": "label"}, | |
| "labels": ["World","Sports","Business","Sci/Tech"] | |
| }, | |
| { | |
| "id": "squad", | |
| "task": "qa", | |
| "license": "cc-by-sa-4.0", | |
| "size_hint": "100k", | |
| "columns": {"question": "question", "context": "context", "answers": "answers"} | |
| } | |
| ] | |
| ``` | |
| > For MVP, support **classification** and **extractive QA**. The `columns` mapping lets us normalize heterogeneous datasets without complex UI. | |
| --- | |
| ## 5) UI — `app.py` (Gradio) | |
| ### 5.1 Features | |
| * **LoginButton** (OAuth) → captures `gr.OAuthProfile` and `gr.OAuthToken`. | |
| * **D₀ input**: either upload files (`.jsonl/.csv/.parquet/.zip`) or provide a **Hub dataset id**. | |
| * **Model** dropdown: start with `meta-llama/Llama-3.1-8B-Instruct`. | |
| * **Task** selector (classification or QA). (MVP: single task per run.) | |
| * **Benchmark/test set**: upload small test data or provide Hub split. | |
| * **Metrics** checkboxes: `loss`, `f1`, `exact_match` (show `exact_match` only for QA). | |
| * **Candidate datasets**: multiselect from `candidates.json`. | |
| * **Run experiments** button: submits **one Job per selected Dₖ**. | |
| * **Jobs table**: ID, Dₖ, status, logs link, artifacts link. | |
| * **Results view**: scaling plot + ranked table when jobs finish. | |
| ### 5.2 Implementation Sketch | |
| * Parse OAuth token; we’ll prefer the user token for **reading gated models**, but use `SERVICE_HF_TOKEN` for **writing** artifacts. | |
| * If user **uploads D₀**, compress if needed and push to a **private dataset repo** via `utils/hub.ensure_uploaded_dataset(...)`. | |
| * Submit a **Job** per Dₖ with: | |
| * command: `python jobs/run_experiment.py --model ... --d0 ... --dk ... --task ... --metrics ... --results_repo ...` | |
| * `flavor="a10g-small"` (configurable) | |
| * `timeout` (e.g., 7200 seconds) | |
| * `env`: `HF_TOKEN` (read), `SERVICE_HF_TOKEN` (write), plus `RESULTS_REPO` if set. | |
| * Store job metadata in a `gr.State` list; start a **poller** (every ~10–15s) to refresh status via `huggingface_hub.inspect_job(...)`. | |
| * When a job completes, show a link to its artifacts (scaling plot, metrics JSON) and update the results table. | |
| **Acceptance criteria** | |
| * Launching a run queues N jobs (N = number of selected Dₖ). | |
| * Status column transitions through “queued/running/completed/failed”. | |
| * Clicking an artifacts link opens an image/json from results repo (or Space storage). | |
| --- | |
| ## 6) Hub Utilities — `utils/hub.py` | |
| ### Functions to implement | |
| * `ensure_uploaded_dataset(upload_files, d0_dataset_id, user_token) -> str` | |
| * If `d0_dataset_id` is provided, return it. | |
| * Else create a **private dataset repo** under your org (e.g., `your-org/curation-upload-<uuid>`), upload files/folder, and return repo id. | |
| * `ensure_results_repo(service_token, results_repo_env) -> str` | |
| * If `RESULTS_REPO` is set, ensure it exists; else create `your-org/curation-results`. | |
| * `push_artifacts(repo_id, local_dir, subdir) -> None` | |
| * Upload a local folder (e.g., `artifacts/<job-id>/...`) to `repo_id/subdir`. | |
| **Acceptance criteria** | |
| * Uploading a small CSV/JSONL creates a private dataset and returns a valid repo id. | |
| * Pushing artifacts creates/updates files in the results repo with versioned commits. | |
| --- | |
| ## 7) Data Helpers — `utils/data.py` | |
| ### Responsibilities | |
| * Load D₀ and Dₖ from the Hub (and optional **test set**). | |
| * Normalize columns using the `columns` mapping from `candidates.json` or a provided override. | |
| * Build **mixtures** of D₀ ⊕ Dₖ at multiple sizes (e.g., `{10k, 20k, 40k}` examples). | |
| * For **classification**: expect `{"text": str, "label": int}` after normalization. | |
| For **QA**: expect `{"question": str, "context": str, "answers": {"text":[...], "answer_start":[...]}}`. | |
| ### API | |
| ```python | |
| def load_dataset_normalized(repo_or_id, task, columns_map=None, split="train"): | |
| """Return a datasets.Dataset with normalized columns for the given task.""" | |
| ... | |
| def build_mixtures(d0_ds, dk_ds, sizes=[10_000, 20_000, 40_000], d0_ratio=0.5, seed=42): | |
| """Return dict: size -> datasets.Dataset of mixed examples (shuffled, repeat/trim as needed).""" | |
| def load_benchmark(repo_or_id_or_path, task, split="validation"): | |
| """Return a small test set normalized for the chosen task.""" | |
| ``` | |
| **Acceptance criteria** | |
| * Given a known dataset id, `load_dataset_normalized(...)` returns columns as specified. | |
| * `build_mixtures(...)` returns ≥2 sizes with the requested counts. | |
| --- | |
| ## 8) Plotting Helper — `utils/plotting.py` | |
| ### API | |
| ```python | |
| def plot_scaling(sizes, y_values, y_label, out_path): | |
| """Save a simple matplotlib PNG (log-x) with points + fitted curve if provided.""" | |
| ``` | |
| * Use matplotlib; one figure per plot; do not enforce custom colors/styles. | |
| **Acceptance criteria** | |
| * Calling `plot_scaling(...)` produces a PNG saved to `out_path` without errors. | |
| --- | |
| ## 9) Training — `jobs/train.py` (PEFT/QLoRA SFT) | |
| **NOTE: Currently the Space Hardware is ZeroGPU. For testing purpose, the training part can be replaced by extremely small models.** | |
| ### Responsibilities | |
| * Load model + tokenizer (e.g., `meta-llama/Llama-3.1-8B-Instruct`). | |
| * Apply LoRA (or QLoRA). | |
| * Tokenize dataset and run short SFT. | |
| ### API (sketch) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments | |
| from peft import LoraConfig, get_peft_model | |
| from trl import SFTTrainer | |
| def train_peft(model_id, train_ds, output_dir, max_steps=500, lr=2e-4, lora_r=8): | |
| tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) | |
| base = AutoModelForCausalLM.from_pretrained(model_id) | |
| peft_cfg = LoraConfig(r=lora_r, lora_alpha=16, lora_dropout=0.05, task_type="CAUSAL_LM") | |
| model = get_peft_model(base, peft_cfg) | |
| def format_example(ex): | |
| # classification: concatenate prompt; QA: question + context formatting | |
| # MVP: simple "<s>[INST] ... [/INST]" style or plain text target | |
| return {"text": ex["text"]} # adjust per task | |
| # Tokenization & SFTTrainer; keep it simple for MVP | |
| tr_args = TrainingArguments(output_dir=output_dir, per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, learning_rate=lr, | |
| max_steps=max_steps, logging_steps=50, save_steps=0) | |
| trainer = SFTTrainer(model=model, tokenizer=tok, train_dataset=train_ds, | |
| dataset_text_field="text", args=tr_args) | |
| trainer.train() | |
| # Save adapter only | |
| trainer.save_model(output_dir) | |
| return output_dir | |
| ``` | |
| **Acceptance criteria** | |
| * On a tiny dataset (few hundred samples), training completes and saves an adapter folder. | |
| --- | |
| ## 10) Evaluation — `jobs/eval.py` | |
| ### Responsibilities | |
| * Run evaluation for the selected task using the fine‑tuned adapter. | |
| * For **classification**: compute `loss` (optional) and `f1`. | |
| * For **QA**: compute `exact_match` (and `f1` if you want both). | |
| ### API (sketch) | |
| ```python | |
| import evaluate | |
| import numpy as np | |
| def eval_classification(model_id_or_path, test_ds): | |
| # Use pipeline or model.generate + simple argmax classifier (MVP) | |
| # Better: a small classification head; MVP keeps it simple. | |
| f1 = evaluate.load("f1") | |
| preds, refs = ..., ... | |
| return {"f1": f1.compute(predictions=preds, references=refs)["f1"]} | |
| def eval_qa(model_id_or_path, test_ds): | |
| exact = evaluate.load("exact_match") | |
| # MVP: heuristic span matching if using generative outputs; | |
| # or reuse baseline SQuAD eval if test_ds has 'answers'. | |
| em = exact.compute(predictions=preds, references=refs)["exact_match"] | |
| return {"exact_match": em} | |
| ``` | |
| > **Note:** For MVP, inference can be slow. Keep test sets **small** (e.g., 500–1,000 examples) and batch where possible. | |
| **Acceptance criteria** | |
| * For a toy dataset, returns a metrics dict with expected keys. | |
| --- | |
| ## 11) Scaling Law — `jobs/scaling.py` | |
| ### Responsibilities | |
| * Fit a simple power‑law over points `(size → metric)`. | |
| * For “higher‑is‑better” metrics, convert to a pseudo‑loss (e.g., `1 - score`) during fitting if desired. | |
| * Produce a **prediction** at a user‑defined large‑scale target (e.g., `N* = 200k` examples). | |
| ### API (sketch) | |
| ```python | |
| import numpy as np | |
| def fit_powerlaw(sizes, scores, higher_is_better=True): | |
| sizes = np.asarray(sizes, float) | |
| y = np.asarray(scores, float) | |
| if higher_is_better: | |
| # Fit to (1 - score) ~ b * N^{-alpha} | |
| z = np.log(np.maximum(1e-9, 1 - y)) | |
| else: | |
| # Direct loss scaling | |
| z = np.log(np.maximum(1e-9, y)) | |
| x = np.log(sizes) | |
| k, c = np.polyfit(x, z, 1) # z ≈ k*log N + c | |
| alpha = -k; b = np.exp(c) | |
| return {"alpha": float(alpha), "b": float(b)} | |
| def predict_powerlaw(size, fit_params, higher_is_better=True): | |
| alpha, b = fit_params["alpha"], fit_params["b"] | |
| if higher_is_better: | |
| loss_hat = b * (size ** (-alpha)) | |
| return float(1 - loss_hat) | |
| return float(b * (size ** (-alpha))) | |
| ``` | |
| **Acceptance criteria** | |
| * Given ≥2 points (prefer 3+), returns fit parameters and a plausible prediction. | |
| * Combined with `utils/plotting.plot_scaling(...)`, writes a PNG with points + curve. | |
| --- | |
| ## 12) Experiment Orchestrator — `jobs/run_experiment.py` | |
| ### Responsibilities | |
| * Parse CLI args: `--model`, `--task`, `--d0`, `--dk`, `--metrics ...`, `--sizes 10000 20000`, `--target_size 200000`, `--results_repo <id>`, `--job_id <uuid>`. | |
| * Create working dirs: `artifacts/<job_id>/`. | |
| * Load datasets (D₀, Dₖ), build mixtures for requested sizes. | |
| * For each size: | |
| 1. run short **train** (adapter saved under `artifacts/<job_id>/adapters/size-<N>`), | |
| 2. run **eval** on the benchmark set → collect metrics. | |
| * Fit **scaling** across sizes; produce: | |
| * `metrics.json` (per‑size metrics, fit params, predicted large‑scale performance), | |
| * `scaling.png` (plot). | |
| * Push `artifacts/<job_id>/` to `results_repo` under `experiments/<user>/<job_id>/...` using `utils/hub.push_artifacts(...)`. | |
| * Print a final JSON line to stdout with the artifacts path (UI can parse logs if needed). | |
| ### CLI Skeleton | |
| ```python | |
| import argparse, json, os, uuid | |
| from utils import hub, data, plotting | |
| from jobs import train, eval as evalm, scaling | |
| def main(): | |
| ap = argparse.ArgumentParser() | |
| ap.add_argument("--model", required=True) | |
| ap.add_argument("--task", choices=["classification","qa"], required=True) | |
| ap.add_argument("--d0", required=True) | |
| ap.add_argument("--dk", required=True) | |
| ap.add_argument("--metrics", nargs="+", default=["f1"]) | |
| ap.add_argument("--sizes", nargs="+", type=int, default=[10000, 20000, 40000]) | |
| ap.add_argument("--target_size", type=int, default=200000) | |
| ap.add_argument("--results_repo", default=os.getenv("RESULTS_REPO","")) | |
| ap.add_argument("--job_id", default=str(uuid.uuid4())) | |
| args = ap.parse_args() | |
| # Setup dirs | |
| out_dir = os.path.abspath(os.path.join("artifacts", args.job_id)) | |
| os.makedirs(out_dir, exist_ok=True) | |
| # Load datasets | |
| d0 = data.load_dataset_normalized(args.d0, args.task) | |
| dk = data.load_dataset_normalized(args.dk, args.task) | |
| test = data.load_benchmark(args.d0, args.task, split="validation") # MVP: reuse D₀ val if none provided | |
| # Build mixtures & run train/eval | |
| per_size = [] | |
| for N in args.sizes: | |
| mix = data.build_mixtures(d0, dk, sizes=[N])[N] | |
| adapter_dir = os.path.join(out_dir, f"adapter_size_{N}") | |
| train.train_peft(args.model, mix, adapter_dir, max_steps=300) # MVP: few steps | |
| metrics = {} | |
| if args.task == "classification": | |
| metrics.update(evalm.eval_classification(adapter_dir, test)) | |
| else: | |
| metrics.update(evalm.eval_qa(adapter_dir, test)) | |
| per_size.append({"size": N, "metrics": metrics}) | |
| # Fit scaling on the primary metric | |
| key = "exact_match" if args.task == "qa" else "f1" | |
| sizes = [r["size"] for r in per_size] | |
| scores = [r["metrics"][key] for r in per_size] | |
| fit = scaling.fit_powerlaw(sizes, scores, higher_is_better=True) | |
| pred = scaling.predict_powerlaw(args.target_size, fit, higher_is_better=True) | |
| # Write artifacts | |
| mpath = os.path.join(out_dir, "metrics.json") | |
| with open(mpath, "w") as f: | |
| json.dump({"runs": per_size, "fit": fit, "prediction": { "target_size": args.target_size, key: pred }}, f, indent=2) | |
| plotting.plot_scaling(sizes, scores, key, os.path.join(out_dir, "scaling.png")) | |
| # Push artifacts | |
| repo_id = hub.ensure_results_repo(os.getenv("SERVICE_HF_TOKEN"), args.results_repo) | |
| hub.push_artifacts(repo_id, out_dir, subdir=f"experiments/{args.job_id}") | |
| print(json.dumps({"status":"ok","artifacts_repo": repo_id, "path": f"experiments/{args.job_id}"})) | |
| if __name__ == "__main__": | |
| main() | |
| ``` | |
| **Acceptance criteria** | |
| * Running with tiny toy inputs creates `artifacts/<job_id>/` + pushes to results repo. | |
| * `metrics.json` and `scaling.png` exist and look sensible. | |
| --- | |
| ## 13) Job Submission from UI — `app.py` (continued) | |
| ### Core actions | |
| * **Submit**: for each selected Dₖ → call `huggingface_hub.run_job(...)` with: | |
| * `image`: CUDA‑capable (e.g., `pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel`) | |
| * `command`: `["python","jobs/run_experiment.py", "--model", model_id, "--task", task, "--d0", d0_repo, "--dk", dk_id, "--metrics", *metrics, "--sizes", *sizes, "--target_size", str(target_size), "--results_repo", results_repo_or_empty]` | |
| * `flavor`: `"a10g-small"` | |
| * `timeout`: e.g., `7200` (seconds) | |
| * `env`: `{"HF_TOKEN": user_token or SERVICE_HF_TOKEN, "SERVICE_HF_TOKEN": SERVICE_HF_TOKEN, "RESULTS_REPO": RESULTS_REPO}` | |
| * **Poll**: keep a dict `{job_id: {dk, status, url, artifacts}}`; update via `inspect_job(job_id)`; for `completed`, set artifacts link to `hf://<results_repo>/experiments/<job_id>/`. | |
| **Acceptance criteria** | |
| * Submitting 2 Dₖ creates 2 jobs; both progress independently; artifacts link works. | |
| --- | |
| ## 14) Guardrails & Licensing | |
| * **Gated models**: probe download with `hf_hub_download(model_id, filename="README.md", token=user_token)` to confirm access; if 401/403, show a clear message to accept the license on the model card. | |
| * **Dataset licensing**: surface the `license` field from `candidates.json` next to each Dₖ; later fetch from Hub. | |
| * **Uploads**: warn users that uploaded D₀ will be stored in a **private dataset** (repo id shown in UI); provide a “Delete my upload” note linking to the repo. | |
| * **Resource limits**: cap sizes (`sizes=[5_000, 10_000]` for MVP), cap number of concurrent jobs per user (client‑side only for MVP). | |
| --- | |
| ## 15) Testing | |
| ### Local (CPU) sanity checks | |
| * Use a very small subset (e.g., 200 examples) and `max_steps=10` to verify the end‑to‑end loop without a GPU. | |
| * Mock `run_job(...)` (optional) to test UI job table logic. | |
| ### Space integration | |
| * Create a private test Space results repo (e.g., `your-org/curation-results-test`). | |
| * Submit a single Dₖ job and verify: | |
| * `artifacts/` created, | |
| * `metrics.json` contains per‑size metrics and prediction, | |
| * `scaling.png` renders, | |
| * artifacts are uploaded and visible from the UI link. | |
| --- | |
| ## 16) Definition of Done (DoD) | |
| * A signed‑in user can: | |
| 1. Provide **D₀** (upload or Hub id), | |
| 2. Choose **model**, **task**, **metrics**, and ≥1 **Dₖ**, | |
| 3. Click **Run** and see a job per Dₖ with live status, | |
| 4. Open **artifacts** (plot + metrics), | |
| 5. See a **ranked table** of Dₖ by the chosen primary metric, | |
| 6. (Optional) Download `metrics.json`. | |
| * All long work executes as **Jobs** (no HTTP timeouts). | |
| * Artifacts persist in a results dataset or Space storage. | |
| --- | |
| ## 17) Nice‑to‑Have (post‑MVP) | |
| * **Column mapping UI**: let users map their D₀ columns to `text/label` or `question/context/answers`. | |
| * **Seed sweeps** and confidence intervals on scaling fit. | |
| * **Hardware selector** and budget estimator. | |
| * **vLLM/TGI** inference for faster eval. | |
| * **Per‑user “My Experiments”** page (prefix `experiments/<username>/...`). | |
| --- | |
| ## 18) Task Checklist (assignable to your agent) | |
| **A. Scaffolding** | |
| * [ ] Add `requirements.txt`; ensure importable on the Space. | |
| * [ ] Create folders: `catalog/`, `utils/`, `jobs/`. | |
| **B. Catalog** | |
| * [ ] Fill `catalog/candidates.json` (3–6 datasets), including `columns` mapping. | |
| **C. Hub utilities (`utils/hub.py`)** | |
| * [ ] `ensure_uploaded_dataset(...)` | |
| * [ ] `ensure_results_repo(...)` | |
| * [ ] `push_artifacts(...)` | |
| **D. Data helpers (`utils/data.py`)** | |
| * [ ] `load_dataset_normalized(...)` for classification + QA | |
| * [ ] `build_mixtures(...)` | |
| * [ ] `load_benchmark(...)` | |
| **E. Plotting (`utils/plotting.py`)** | |
| * [ ] `plot_scaling(...)` | |
| **F. Jobs** | |
| * [ ] `jobs/train.py` (PEFT SFT) | |
| * [ ] `jobs/eval.py` (classification + QA) | |
| * [ ] `jobs/scaling.py` (fit + predict) | |
| * [ ] `jobs/run_experiment.py` (glue the above, produce artifacts, push) | |
| **G. UI (`app.py`)** | |
| * [ ] Build form (inputs, selectors, candidates list) | |
| * [ ] Submit one job per Dₖ via `run_job(...)` | |
| * [ ] Poll job status & render jobs table | |
| * [ ] Artifacts viewer: link to results repo path | |
| * [ ] Basic error messages (license issues, upload failures) | |
| **H. Tests** | |
| * [ ] Local micro‑run (CPU) with tiny sizes | |
| * [ ] Space run on GPU flavor with one Dₖ | |
| * [ ] Verify artifacts + plot + ranking table | |
| --- | |
| ## 19) Code Snippets to Start Implementation | |
| ### `app.py` — minimal UI skeleton (submit + poll) | |
| ```python | |
| import os, json, time, gradio as gr | |
| from huggingface_hub import run_job, inspect_job | |
| from utils.hub import ensure_uploaded_dataset, ensure_results_repo | |
| CANDIDATES = json.load(open("catalog/candidates.json")) | |
| def submit(d0_files, d0_id, task, model, metrics, dk_list, sizes, target_size, | |
| profile: gr.OAuthProfile | None, oauth: gr.OAuthToken | None): | |
| user_token = getattr(oauth, "token", None) | |
| d0_repo = ensure_uploaded_dataset(d0_files, d0_id, user_token=user_token) | |
| results_repo = ensure_results_repo(os.getenv("SERVICE_HF_TOKEN"), os.getenv("RESULTS_REPO","")) | |
| jobs = [] | |
| for dk in dk_list: | |
| cmd = ["python","jobs/run_experiment.py", | |
| "--model", model, "--task", task, "--d0", d0_repo, "--dk", dk, | |
| "--metrics", *metrics, "--sizes", *[str(s) for s in sizes], | |
| "--target_size", str(target_size), "--results_repo", results_repo] | |
| job = run_job( | |
| image="pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel", | |
| command=cmd, | |
| flavor="a10g-small", | |
| timeout=7200, | |
| env={"HF_TOKEN": user_token or os.getenv("SERVICE_HF_TOKEN"), | |
| "SERVICE_HF_TOKEN": os.getenv("SERVICE_HF_TOKEN"), | |
| "RESULTS_REPO": results_repo}, | |
| ) | |
| jobs.append({"id": job.id, "dk": dk, "url": job.url, "status": "queued", "artifacts": ""}) | |
| return jobs | |
| def poll(jobs_state): | |
| updated = [] | |
| for j in jobs_state: | |
| info = inspect_job(j["id"]) | |
| st = info.status # "queued"/"running"/"completed"/"failed" | |
| art = j.get("artifacts","") | |
| # Heuristic: artifacts live in RESULTS_REPO/experiments/<job_id> (set by run_experiment.py) | |
| if st == "completed" and not art: | |
| art = f"{os.getenv('RESULTS_REPO','(repo)')}/experiments/{j['id']}" | |
| updated.append({**j, "status": st, "artifacts": art}) | |
| return updated | |
| with gr.Blocks() as demo: | |
| prof = gr.LoginButton() | |
| with gr.Row(): | |
| d0_files = gr.UploadButton("Upload D₀ (.csv/.jsonl/.zip)", file_count="multiple") | |
| d0_id = gr.Textbox(label="or Hub dataset id (user/dataset)") | |
| task = gr.Radio(choices=["classification","qa"], value="classification", label="Task") | |
| model = gr.Dropdown(choices=["meta-llama/Llama-3.1-8B-Instruct"], label="Model") | |
| metrics = gr.CheckboxGroup(choices=["loss","f1","exact_match"], value=["f1"], label="Metrics") | |
| dk = gr.CheckboxGroup(choices=[c["id"] for c in CANDIDATES], label="Candidate datasets") | |
| sizes = gr.CheckboxGroup(choices=[5000,10000,20000], value=[5000,10000], label="Mixture sizes") | |
| target_size = gr.Number(value=200000, label="Target size for prediction") | |
| run_btn = gr.Button("Run experiments") | |
| jobs_state = gr.State([]) | |
| jobs_table = gr.Dataframe(headers=["id","dk","status","url","artifacts"], datatype=["str","str","str","str","str"]) | |
| run_btn.click(fn=submit, | |
| inputs=[d0_files, d0_id, task, model, metrics, dk, sizes, target_size, gr.OAuthProfile, gr.OAuthToken], | |
| outputs=jobs_state) | |
| gr.Button("Refresh status").click(fn=poll, inputs=jobs_state, outputs=jobs_state) | |
| def render_table(jobs): # render as simple rows | |
| rows = [[j["id"], j["dk"], j["status"], j["url"], j["artifacts"]] for j in jobs] | |
| return rows | |
| jobs_state.change(fn=render_table, inputs=jobs_state, outputs=jobs_table) | |
| gr.Markdown("Open artifacts in the results repo once jobs complete.") | |
| demo.queue().launch() | |
| ``` | |
| ### `utils/hub.py` — upload & results | |
| ```python | |
| import os, uuid, tempfile, shutil | |
| from huggingface_hub import HfApi, create_repo, upload_file, upload_folder | |
| def ensure_uploaded_dataset(upload_files, d0_dataset_id, user_token=None): | |
| if d0_dataset_id: | |
| return d0_dataset_id | |
| if not upload_files: # nothing uploaded | |
| raise ValueError("Please upload D₀ or provide a Hub dataset id.") | |
| api = HfApi(token=os.getenv("SERVICE_HF_TOKEN")) | |
| repo_id = f"{os.getenv('HF_ORG','your-org')}/curation-upload-{uuid.uuid4().hex[:8]}" | |
| create_repo(repo_id, repo_type="dataset", private=True, exist_ok=True, token=os.getenv("SERVICE_HF_TOKEN")) | |
| with tempfile.TemporaryDirectory() as tmp: | |
| # Gradio returns a list of tempfiles; copy them into a folder | |
| for f in upload_files: | |
| dst = os.path.join(tmp, os.path.basename(getattr(f,"name", "file"))) | |
| shutil.copyfile(f.name if hasattr(f,"name") else f, dst) | |
| upload_folder(folder_path=tmp, repo_id=repo_id, repo_type="dataset", token=os.getenv("SERVICE_HF_TOKEN")) | |
| return repo_id | |
| def ensure_results_repo(service_token, results_repo_env): | |
| api = HfApi(token=service_token) | |
| if results_repo_env: | |
| parts = results_repo_env.split("/") | |
| if len(parts) == 2: | |
| create_repo(results_repo_env, repo_type="dataset", private=True, exist_ok=True, token=service_token) | |
| return results_repo_env | |
| repo_id = f"{os.getenv('HF_ORG','your-org')}/curation-results" | |
| create_repo(repo_id, repo_type="dataset", private=True, exist_ok=True, token=service_token) | |
| return repo_id | |
| def push_artifacts(repo_id, local_dir, subdir=""): | |
| path_in_repo = subdir.strip("/") | |
| upload_folder(folder_path=local_dir, repo_id=repo_id, repo_type="dataset", | |
| path_in_repo=path_in_repo if path_in_repo else None, | |
| token=os.getenv("SERVICE_HF_TOKEN")) | |
| ``` |