base-eval / README.md
maxidl's picture
Upload README.md with huggingface_hub
665faf7 verified
metadata
title: ellamind base-eval
emoji: πŸ“Š
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 20160
hf_oauth_scopes:
  - read-repos
  - gated-repos

ellamind base-eval

Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM β€” no backend required.

Features

  • Hierarchical task selection β€” eval suite β†’ task group β†’ individual benchmark, with aggregate views
  • Multiple metrics β€” acc, acc_norm, bits_per_byte, exact_match, pass@1, etc.
  • Model comparison β€” toggle models on/off; separate checkpoint runs from baselines
  • Auto chart type β€” line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
  • Multi-panel layout β€” add multiple independent panels
  • Merge datasets β€” append rows from additional HF datasets (including private ones via OAuth)
  • Smoothing β€” configurable moving average for line charts
  • Benchmark goodness metrics β€” per-task quality indicators below line charts
  • Export β€” download charts as PNG or SVG

Merge Datasets

You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. org/dataset-name or org/dataset-name/custom.parquet) and click Merge Dataset. The additional data is row-appended to the base dataset.

For private datasets, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.

Benchmark Goodness Metrics

Line charts display quality indicators below the plot, inspired by the FineTasks methodology. Metrics are computed client-side across three stages: Overall, Early (first half of training), and Late (second half).

Metric What it measures Green Yellow Red
Monotonicity Spearman correlation between steps and score β‰₯ 0.7 0.4–0.7 < 0.4
Signal Strength Relative improvement over initial performance β‰₯ 0.10 0.03–0.10 < 0.03
Noise MAD of consecutive score diffs (robust to data-mix jumps) β€” β€” β€”
Ordering Kendall's Tau of model rankings between steps β‰₯ 0.6 0.3–0.6 < 0.3
Discrimination Std of scores across models at last checkpoint β‰₯ 0.03 0.01–0.03 < 0.01

Configuration

Model colors in config.yaml:

model_colors:
  "Qwen3 1.7B": "#9575CD"
  "Gemma 3 4B": "#00B0FF"

Local Development

python3 -m http.server 8080

OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.

Deployment

pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space

Project Structure

index.html    # Single-file web app (HTML + CSS + JS)
config.yaml   # Model color overrides
README.md     # HF Spaces metadata + docs