Spaces:

tianhaowang
/

demo-curation

Running

App Files Files Community

tianhaowang commited on Sep 29

Commit

6719953

1 Parent(s): 4384295

update ui

Browse files

Files changed (4) hide show

Development/Plan/ui-self-explainability-plan-2025-09-29.md +91 -0
app.py +154 -38
catalog/candidates.json +14 -0
utils/__pycache__/config.cpython-310.pyc +0 -0

Development/Plan/ui-self-explainability-plan-2025-09-29.md ADDED Viewed

	@@ -0,0 +1,91 @@

+# Implementation Plan: UI Self-Explainability Enhancements
+Date: 2025-09-29
+Author: Codex (AI Assistant)
+## Objective
+Reorganize the experiment configuration UI into clearly labeled specification blocks and introduce speech recognition task support, ensuring every control has explicit guidance and task-aware options for datasets, base models, metrics, and scaling targets.
+## Background & Research
+- The current `gr.Blocks` layout in `app.py` renders all controls in a single column with limited labeling, making the flow ambiguous to new users.
+- Task-aware behavior exists only for metrics and candidate datasets; classes and mappings live in `app.py` and load from `catalog/candidates.json`.
+- Speech recognition is not represented yet. Adding it requires augmenting the catalog and defining base model and benchmark presets directly in the UI layer.
+- Gradio supports semantic grouping via `gr.Group`, Markdown headings, and inline helper text, which can deliver the requested “block” presentation without architectural changes.
+## Technical Approach
+### Architecture Overview
+- Extend task metadata dictionaries in `app.py` to cover speech recognition metrics, base models, and benchmark datasets.
+- Load speech recognition candidate datasets via the existing catalog mechanism by appending new entries to `catalog/candidates.json`.
+- Restructure `build_interface()` to group related inputs under three labeled sections using `gr.Group` (or nested `gr.Column`) and helper `gr.Markdown` text.
+- Enhance the `on_task_change()` callback or introduce a new orchestrator to update metrics, candidate datasets, base models, benchmark choices, and scaling label simultaneously when the task changes.
+- Adjust submission wiring to pass through new benchmark selections without introducing silent defaults.
+### Step-by-Step Implementation
+1. **Catalog Update**: Append the two speech recognition training datasets to `catalog/candidates.json`, ensuring each entry includes `task: "speech_recognition"` and minimal column metadata where applicable.
+2. **Task Metadata Maps**: In `app.py`, define new constants for
+   - `TASK_MODEL_CHOICES`
+   - `TASK_BENCHMARK_CHOICES`
+   - Update `TASK_METRIC_CHOICES` / `TASK_METRIC_DEFAULT` to include speech recognition with `"loss"` and `"Word Error Rate (WER)"` (determine default explicitly).
+3. **UI Block Layout**: Within `build_interface()`, wrap training, evaluation, and scaling controls in dedicated groups:
+   - Add Markdown headings (e.g., `gr.Markdown("### Training task specifications")`).
+   - Place the instruction sentences (training/test uploads) right above their respective upload widgets.
+   - Rename component labels per new spec (e.g., `label="Task type"`, `label="Available external datasets for you to choose"`).
+4. **New Benchmark Selector**: Add a `gr.CheckboxGroup` (or `gr.Dropdown` if single choice is desired) for public benchmarks under the evaluation block, defaulting to empty. Ensure its choices update with the task.
+5. **Dynamic Task Handling**: Expand `on_task_change()` (or replace with a new handler) to update:
+   - Metric choices + defaults
+   - Candidate dataset choices
+   - Base model dropdown options
+   - Benchmark selector options
+   - Scaling number label (`gr.update(label=...)`) to append “(hours)” for speech recognition.
+6. **Submission Flow Adjustments**: Modify callback wiring so benchmark selections feed into `submit_with_feedback()` / `submit_experiments()`:
+   - Ensure mutually exclusive handling between manual test upload/id and benchmark pick (e.g., raise if both provided to avoid hidden fallback).
+   - When a benchmark dataset is chosen, pass its identifier as the test dataset source.
+7. **Validation Hooks**: Update or add unit coverage under `tests/` (likely `tests/test_app.py` or new module) to exercise `metrics_for_task`, base model mapping, and new task change logic, focusing on the speech recognition branch.
+### Sample Code
+```python
+# app.py
+TASK_MODEL_CHOICES = {
+    "classification": [DEFAULT_MODEL],
+    "qa": [DEFAULT_MODEL],
+    "pretraining": [DEFAULT_MODEL],
+    "speech_recognition": [
+        "anton-l/emformer-base-librispeech",
+        "train from scratch",
+    ],
+}
+TASK_BENCHMARK_CHOICES = {
+    "speech_recognition": [
+        "sanchit-gandhi/tedlium-data.test",
+        "openslr/librispeech_asr.test.clean",
+    ],
+    # other tasks populate if/when needed
+}
+def on_task_change(selected_task: str):
+    metric_choices, metric_defaults = metrics_for_task(selected_task)
+    return (
+        gr.update(choices=metric_choices, value=metric_defaults),
+        gr.update(choices=candidate_choices_for_task(selected_task), value=[]),
+        gr.update(choices=TASK_MODEL_CHOICES[selected_task], value=None),
+        gr.update(choices=TASK_BENCHMARK_CHOICES.get(selected_task, []), value=[]),
+        gr.update(label=_target_label_for_task(selected_task)),
+    )
+```
+## Dependencies
+- `gradio` for UI layout modifications.
+- Existing `utils` helpers for candidate dataset loading and submission validation.
+- Hugging Face hub access for dataset identifiers (no new runtime dependencies).
+## Risk Assessment
+- **UI Regression**: Reworking the layout may inadvertently detach components from callbacks; thorough manual verification required.
+- **State Synchronization**: Updating multiple components on task change increases the chance of inconsistent state if any mapping is missing; mitigate by validating dictionaries cover all tasks during initialization.
+- **Benchmark/Test Conflicts**: Introducing public benchmark selection alongside manual uploads could create ambiguous submission behavior; enforce validation to avoid silent precedence.
+- **Future Task Expansion**: Hard-coded mappings will need maintenance; consider extracting to structured config if task set grows.
+## Success Criteria
+- The rendered UI shows three clearly labeled sections with explanatory text for uploads.
+- Selecting “speech recognition” updates task type options, available external datasets, base models, metrics, benchmark datasets, and scaling label (with “(hours)”).
+- Submission logic honors speech recognition selections without relying on fallback defaults, raising explicit errors for conflicting inputs.
+- Automated tests cover the new task metadata paths and pass successfully.

app.py CHANGED Viewed

@@ -82,18 +82,83 @@ def environment_diagnostics() -> Tuple[Dict[str, Any], Dict[str, Any]]:
 DEFAULT_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
 DEFAULT_SIZES = [5000, 10000, 20000]
 TASK_METRIC_CHOICES: Dict[str, List[str]] = {
     "classification": ["loss", "f1", "exact_match"],
     "qa": ["loss", "f1", "exact_match"],
     "pretraining": ["loss", "perplexity"],
 }
 TASK_METRIC_DEFAULT: Dict[str, List[str]] = {
     "classification": ["f1"],
     "qa": ["f1"],
     "pretraining": ["perplexity"],
 }
 def _coerce_int_list(values: Iterable[Any] | None) -> List[int]:
     if values is None:
@@ -127,11 +192,20 @@ def metrics_for_task(task: str) -> Tuple[List[str], List[str]]:
     return choices, defaults
-def on_task_change(selected_task: str) -> Tuple[Dict[str, Any], Dict[str, Any]]:
-    metric_choices, metric_defaults = metrics_for_task(selected_task)
     return (
         gr.update(choices=metric_choices, value=metric_defaults),
-        gr.update(choices=candidate_choices_for_task(selected_task), value=[]),
     )
@@ -146,12 +220,16 @@ def submit_experiments(
     target_size: float,
     test_files: Optional[List[Any]],
     test_id: str,
     profile: Optional[gr.OAuthProfile] = None,
     oauth: Optional[gr.OAuthToken] = None,
 ) -> List[Dict[str, Any]]:
     if CONFIG_ERROR:
         raise RuntimeError(f"Configuration error: {CONFIG_ERROR}")
     assert CONFIG is not None
     try:
         CONFIG.require_service_token()
     except ConfigError as exc:
@@ -160,13 +238,13 @@ def submit_experiments(
             "in the Space settings before retrying."
         ) from exc
-    metric_choices, _ = metrics_for_task(task)
     if not metrics:
         raise ValueError("Select at least one metric for the chosen task.")
     invalid_metrics = [metric for metric in metrics if metric not in metric_choices]
     if invalid_metrics:
         invalid = ", ".join(invalid_metrics)
-        raise ValueError(f"Unsupported metric(s) for task '{task}': {invalid}.")
     selected_metrics = list(metrics)
     selected_sizes = _coerce_int_list(sizes)
@@ -222,7 +300,7 @@ def submit_experiments(
             "--model",
             model,
             "--task",
-            task,
             "--d0",
             d0_repo,
             "--dk",
@@ -262,6 +340,7 @@ def submit_experiments(
                 "url": getattr(job, "url", ""),
                 "status": job.status,
                 "artifacts": "",
             }
         )
     return jobs
@@ -277,21 +356,24 @@ def submit_with_feedback(
     dk_list: List[str],
     sizes: List[Any],
     target_size: float,
-    test_files: Optional[List[Any]],
-    test_id: str,
     profile: Optional[gr.OAuthProfile] = None,
     oauth: Optional[gr.OAuthToken] = None,
 ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
     try:
         jobs = submit_experiments(
             d0_files=d0_files,
             d0_id=d0_id,
-            task=task,
             model=model,
             metrics=metrics,
             dk_list=dk_list,
             sizes=sizes,
             target_size=target_size,
             test_files=test_files,
             test_id=test_id,
             profile=profile,
@@ -354,34 +436,63 @@ def build_interface() -> gr.Blocks:
                 visible=True,
             )
         status_banner = gr.Markdown("", visible=False)
-        with gr.Row():
-            d0_files = gr.File(label="Upload D₀ (.csv/.jsonl/.zip)", file_count="multiple")
-            d0_id = gr.Textbox(label="Hub dataset id (user/dataset)")
-        with gr.Row():
-            test_files = gr.File(label="Optional test set upload", file_count="multiple")
-            test_id = gr.Textbox(label="Test dataset id (user/dataset[:split])")
-        task = gr.Radio(
-            choices=["classification", "qa", "pretraining"],
-            value="classification",
-            label="Task",
-        )
-        model = gr.Dropdown(choices=[DEFAULT_MODEL], value=DEFAULT_MODEL, label="Model")
-        metric_choices, metric_defaults = metrics_for_task("classification")
-        metrics = gr.CheckboxGroup(
-            choices=metric_choices,
-            value=metric_defaults,
-            label="Metrics",
-        )
-        dk = gr.CheckboxGroup(
-            choices=candidate_choices_for_task("classification"),
-            label="Candidate datasets",
-        )
-        sizes = gr.CheckboxGroup(
-            choices=[str(size) for size in DEFAULT_SIZES],
-            value=[str(DEFAULT_SIZES[0]), str(DEFAULT_SIZES[1])],
-            label="Mixture sizes",
-        )
-        target_size = gr.Number(value=200000, label="Target size for prediction")
         run_btn = gr.Button("Run experiments")
         refresh_btn = gr.Button("Refresh status")
@@ -393,7 +504,11 @@ def build_interface() -> gr.Blocks:
             wrap=True,
         )
-        task.change(fn=on_task_change, inputs=task, outputs=[metrics, dk])
         run_btn.click(
             fn=submit_with_feedback,
@@ -409,6 +524,7 @@ def build_interface() -> gr.Blocks:
                 target_size,
                 test_files,
                 test_id,
             ],
             outputs=[jobs_state, status_banner],
         )

 DEFAULT_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
 DEFAULT_SIZES = [5000, 10000, 20000]
+TASK_OPTIONS: List[Tuple[str, str]] = [
+    ("classification", "classification"),
+    ("qa", "qa"),
+    ("pretraining", "language model pretraining"),
+    ("speech_recognition", "speech recognition"),
+]
+TASK_LABEL_TO_VALUE: Dict[str, str] = {label: value for value, label in TASK_OPTIONS}
+TASK_VALUE_TO_LABEL: Dict[str, str] = {value: label for value, label in TASK_OPTIONS}
 TASK_METRIC_CHOICES: Dict[str, List[str]] = {
     "classification": ["loss", "f1", "exact_match"],
     "qa": ["loss", "f1", "exact_match"],
     "pretraining": ["loss", "perplexity"],
+    "speech_recognition": ["loss", "Word Error Rate (WER)"],
 }
 TASK_METRIC_DEFAULT: Dict[str, List[str]] = {
     "classification": ["f1"],
     "qa": ["f1"],
     "pretraining": ["perplexity"],
+    "speech_recognition": ["Word Error Rate (WER)"],
 }
+TASK_MODEL_CHOICES: Dict[str, List[str]] = {
+    "classification": [DEFAULT_MODEL],
+    "qa": [DEFAULT_MODEL],
+    "pretraining": [DEFAULT_MODEL],
+    "speech_recognition": [
+        "anton-l/emformer-base-librispeech",
+        "train from scratch",
+    ],
+}
+TASK_BENCHMARK_CHOICES: Dict[str, List[str]] = {
+    "speech_recognition": [
+        "sanchit-gandhi/tedlium-data.test",
+        "openslr/librispeech_asr.test.clean",
+    ]
+}
+def _task_value_from_label(label: str) -> str:
+    try:
+        return TASK_LABEL_TO_VALUE[label]
+    except KeyError as exc:
+        raise ValueError(f"Unsupported task label '{label}'.") from exc
+def _task_label_from_value(value: str) -> str:
+    try:
+        return TASK_VALUE_TO_LABEL[value]
+    except KeyError as exc:
+        raise ValueError(f"Unsupported task '{value}'.") from exc
+def _normalize_task_value(task: str) -> str:
+    if task in TASK_VALUE_TO_LABEL:
+        return task
+    return _task_value_from_label(task)
+def _model_choices_for_task(task: str) -> List[str]:
+    try:
+        choices = TASK_MODEL_CHOICES[task]
+    except KeyError as exc:
+        raise ValueError(f"Unsupported task '{task}'.") from exc
+    if not choices:
+        raise ValueError(f"No base models configured for task '{task}'.")
+    return choices
+def _target_label_for_task(task: str) -> str:
+    if task == "speech_recognition":
+        return "Target dataset size for full-scale training (hours)"
+    return "Target dataset size for full-scale training"
 def _coerce_int_list(values: Iterable[Any] | None) -> List[int]:
     if values is None:
     return choices, defaults
+def on_task_change(
+    selected_task_label: str,
+) -> Tuple[Dict[str, Any], Dict[str, Any], Dict[str, Any], Dict[str, Any], Dict[str, Any]]:
+    task_value = _task_value_from_label(selected_task_label)
+    metric_choices, metric_defaults = metrics_for_task(task_value)
+    candidate_choices = candidate_choices_for_task(task_value)
+    model_choices = _model_choices_for_task(task_value)
+    benchmark_choices = TASK_BENCHMARK_CHOICES.get(task_value, [])
     return (
         gr.update(choices=metric_choices, value=metric_defaults),
+        gr.update(choices=candidate_choices, value=[]),
+        gr.update(choices=model_choices, value=model_choices[0]),
+        gr.update(choices=benchmark_choices, value=[]),
+        gr.update(label=_target_label_for_task(task_value)),
     )
     target_size: float,
     test_files: Optional[List[Any]],
     test_id: str,
+    public_benchmarks: Optional[List[str]] = None,
     profile: Optional[gr.OAuthProfile] = None,
     oauth: Optional[gr.OAuthToken] = None,
 ) -> List[Dict[str, Any]]:
     if CONFIG_ERROR:
         raise RuntimeError(f"Configuration error: {CONFIG_ERROR}")
     assert CONFIG is not None
+    task_value = _normalize_task_value(task)
+    task_label = _task_label_from_value(task_value)
+    selected_public_benchmarks = list(public_benchmarks or [])
     try:
         CONFIG.require_service_token()
     except ConfigError as exc:
             "in the Space settings before retrying."
         ) from exc
+    metric_choices, _ = metrics_for_task(task_value)
     if not metrics:
         raise ValueError("Select at least one metric for the chosen task.")
     invalid_metrics = [metric for metric in metrics if metric not in metric_choices]
     if invalid_metrics:
         invalid = ", ".join(invalid_metrics)
+        raise ValueError(f"Unsupported metric(s) for task '{task_label}': {invalid}.")
     selected_metrics = list(metrics)
     selected_sizes = _coerce_int_list(sizes)
             "--model",
             model,
             "--task",
+            task_value,
             "--d0",
             d0_repo,
             "--dk",
                 "url": getattr(job, "url", ""),
                 "status": job.status,
                 "artifacts": "",
+                "benchmarks": selected_public_benchmarks,
             }
         )
     return jobs
     dk_list: List[str],
     sizes: List[Any],
     target_size: float,
+    test_files: Optional[List[Any]] = None,
+    test_id: str = "",
+    public_benchmarks: Optional[List[str]] = None,
     profile: Optional[gr.OAuthProfile] = None,
     oauth: Optional[gr.OAuthToken] = None,
 ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
+    task_value = _normalize_task_value(task)
     try:
         jobs = submit_experiments(
             d0_files=d0_files,
             d0_id=d0_id,
+            task=task_value,
             model=model,
             metrics=metrics,
             dk_list=dk_list,
             sizes=sizes,
             target_size=target_size,
+            public_benchmarks=public_benchmarks,
             test_files=test_files,
             test_id=test_id,
             profile=profile,
                 visible=True,
             )
         status_banner = gr.Markdown("", visible=False)
+        initial_task_value = "classification"
+        initial_task_label = _task_label_from_value(initial_task_value)
+        metric_choices, metric_defaults = metrics_for_task(initial_task_value)
+        candidate_choices = candidate_choices_for_task(initial_task_value)
+        model_choices = _model_choices_for_task(initial_task_value)
+        benchmark_choices = TASK_BENCHMARK_CHOICES.get(initial_task_value, [])
+        with gr.Group():
+            gr.Markdown("### Training task specifications")
+            task = gr.Radio(
+                choices=[label for _, label in TASK_OPTIONS],
+                value=initial_task_label,
+                label="Task type",
+            )
+            gr.Markdown("If you have any existing training data, please upload")
+            with gr.Row():
+                d0_files = gr.File(label="Upload D₀ (.csv/.jsonl/.zip)", file_count="multiple")
+                d0_id = gr.Textbox(label="Hub dataset id (user/dataset)")
+            dk = gr.CheckboxGroup(
+                choices=candidate_choices,
+                label="Available external datasets for you to choose",
+            )
+            model = gr.Dropdown(
+                choices=model_choices,
+                value=model_choices[0],
+                label="Base model",
+            )
+            sizes = gr.CheckboxGroup(
+                choices=[str(size) for size in DEFAULT_SIZES],
+                value=[str(DEFAULT_SIZES[0]), str(DEFAULT_SIZES[1])],
+                label="Mixture sizes",
+            )
+        with gr.Group():
+            gr.Markdown("### Evaluation specifications")
+            metrics = gr.CheckboxGroup(
+                choices=metric_choices,
+                value=metric_defaults,
+                label="Eval Metric",
+            )
+            gr.Markdown("If you have any existing benchmark dataset, please upload")
+            with gr.Row():
+                test_files = gr.File(label="Optional test set upload", file_count="multiple")
+                test_id = gr.Textbox(label="Test dataset id (user/dataset[:split])")
+            public_benchmarks = gr.CheckboxGroup(
+                choices=benchmark_choices,
+                value=[],
+                label="Available public benchmark datasets",
+            )
+        with gr.Group():
+            gr.Markdown("### Scaling prediction specifications")
+            target_size = gr.Number(
+                value=200000,
+                label=_target_label_for_task(initial_task_value),
+            )
         run_btn = gr.Button("Run experiments")
         refresh_btn = gr.Button("Refresh status")
             wrap=True,
         )
+        task.change(
+            fn=on_task_change,
+            inputs=task,
+            outputs=[metrics, dk, model, public_benchmarks, target_size],
+        )
         run_btn.click(
             fn=submit_with_feedback,
                 target_size,
                 test_files,
                 test_id,
+                public_benchmarks,
             ],
             outputs=[jobs_state, status_banner],
         )

catalog/candidates.json CHANGED Viewed

@@ -28,5 +28,19 @@
     "license": "cc-by-4.0",
     "size_hint": "365M",
     "columns": {"text": "text"}
   }
 ]

     "license": "cc-by-4.0",
     "size_hint": "365M",
     "columns": {"text": "text"}
+  },
+  {
+    "id": "sanchit-gandhi/tedlium-data.train",
+    "task": "speech_recognition",
+    "license": "unknown",
+    "size_hint": "unknown",
+    "columns": {}
+  },
+  {
+    "id": "openslr/librispeech_asr.train.clean.100",
+    "task": "speech_recognition",
+    "license": "unknown",
+    "size_hint": "100 hours",
+    "columns": {}
   }
 ]

utils/__pycache__/config.cpython-310.pyc CHANGED Viewed

Binary files a/utils/__pycache__/config.cpython-310.pyc and b/utils/__pycache__/config.cpython-310.pyc differ