Spaces:

ellamind
/

base-eval

Running

App Files Files Community

base-eval / README.md

maxidl

Upload README.md with huggingface_hub

665faf7 verified 9 days ago

preview code

raw

history blame contribute delete

3.37 kB

metadata

title: ellamind base-eval
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 20160
hf_oauth_scopes:
  - read-repos
  - gated-repos

ellamind base-eval

Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM — no backend required.

Default dataset: ellamind/eval-scores-ref
GitHub: ellamind/base-eval

Features

Hierarchical task selection — eval suite → task group → individual benchmark, with aggregate views
Multiple metrics — acc, acc_norm, bits_per_byte, exact_match, pass@1, etc.
Model comparison — toggle models on/off; separate checkpoint runs from baselines
Auto chart type — line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
Multi-panel layout — add multiple independent panels
Merge datasets — append rows from additional HF datasets (including private ones via OAuth)
Smoothing — configurable moving average for line charts
Benchmark goodness metrics — per-task quality indicators below line charts
Export — download charts as PNG or SVG

Merge Datasets

You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. org/dataset-name or org/dataset-name/custom.parquet) and click Merge Dataset. The additional data is row-appended to the base dataset.

For private datasets, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.

Benchmark Goodness Metrics

Line charts display quality indicators below the plot, inspired by the FineTasks methodology. Metrics are computed client-side across three stages: Overall, Early (first half of training), and Late (second half).

Metric	What it measures	Green	Yellow	Red
Monotonicity	Spearman correlation between steps and score	≥ 0.7	0.4–0.7	< 0.4
Signal Strength	Relative improvement over initial performance	≥ 0.10	0.03–0.10	< 0.03
Noise	MAD of consecutive score diffs (robust to data-mix jumps)	—	—	—
Ordering	Kendall's Tau of model rankings between steps	≥ 0.6	0.3–0.6	< 0.3
Discrimination	Std of scores across models at last checkpoint	≥ 0.03	0.01–0.03	< 0.01

Configuration

Model colors in config.yaml:

model_colors:
  "Qwen3 1.7B": "#9575CD"
  "Gemma 3 4B": "#00B0FF"

Local Development

python3 -m http.server 8080

OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.

Deployment

pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space

Project Structure

index.html    # Single-file web app (HTML + CSS + JS)
config.yaml   # Model color overrides
README.md     # HF Spaces metadata + docs