Spaces:

fair-forward
/

languagebench

Running

+FROM python:3.12-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements and install Python dependencies
+COPY pyproject.toml uv.lock ./
+RUN pip install uv && uv sync --frozen
+# Copy application code
+COPY . .
+# Verify dependencies are installed
+RUN .venv/bin/python -c "import pandas, datasets, evaluate, fastapi, uvicorn, google.cloud.storage, google.cloud.translate, dotenv, elevenlabs, huggingface_hub, joblib, language_data, openai, requests, scipy, aiolimiter, sentencepiece, langcodes, rich, tqdm; print('✅ All dependencies verified')"
+# Set environment variables with conservative limits
+ENV N_SENTENCES=20
+ENV MAX_LANGUAGES=150
+ENV COST_LIMIT_USD=20
+# Create a startup script with cost monitoring and HTTP server
+RUN echo '#!/bin/bash\n\
+\n\
+# Force immediate log flushing for Cloud Run visibility\n\
+export PYTHONUNBUFFERED=1\n\
+export PYTHONIOENCODING=utf-8\n\
+\n\
+echo "🚀 Starting AI Language Evaluation..."\n\
+echo "📊 Configuration: $N_SENTENCES sentences, $MAX_LANGUAGES languages"\n\
+echo "💰 Cost limit: $COST_LIMIT_USD USD"\n\
+echo "🛡️  Cost protection enabled"\n\
+echo "🔧 Logging: Unbuffered Python output enabled"\n\
+\n\
+# Start a simple HTTP server to satisfy Cloud Run requirements\n\
+python -m http.server 8080 &\n\
+HTTP_SERVER_PID=$!\n\
+\n\
+# Start cost monitoring in background\n\
+(\n\
+    start_time=$(date +%s)\n\
+    while true; do\n\
+        current_time=$(date +%s)\n\
+        elapsed_hours=$(( (current_time - start_time) / 3600 ))\n\
+        if [ $elapsed_hours -ge 24 ]; then\n\
+            echo "⚠️  MAX RUNTIME REACHED! Stopping evaluation..."\n\
+            pkill -f "python evals/main_gcs.py"\n\
+            break\n\
+        fi\n\
+        sleep 300  # Check every 5 minutes\n\
+    done\n\
+) &\n\
+\n\
+# Run the evaluation with forced log flushing\n\
+cd /app && .venv/bin/python -u evals/main_gcs.py\n\
+\n\
+# Stop the HTTP server\n\
+kill $HTTP_SERVER_PID\n\
+\n\
+echo "✅ Evaluation completed!"\n\
+' > /app/start.sh && chmod +x /app/start.sh
+# Expose port (for Cloud Run requirements)
+EXPOSE 8080
+# Run the evaluation with resource limits
+CMD ["/app/start.sh"]

README.md CHANGED Viewed

@@ -43,12 +43,147 @@ For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExpl
 _Tracking language proficiency of AI models for every language_
 ## Evaluate
 ```bash
 uv run --extra dev evals/main.py
 ```
 ## Explore
 ```bash

 _Tracking language proficiency of AI models for every language_
+## System Architecture
+The AI Language Monitor evaluates language models across 100+ languages using a comprehensive pipeline that combines model discovery, automated evaluation, and real-time visualization.
+```mermaid
+flowchart TD
+    %% Model Sources
+    A1["important_models<br/>Static Curated List"] --> D[load_models]
+    A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
+    A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
+    A4["blocklist<br/>Exclusions"] --> D
+    %% Model Processing
+    D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
+    E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
+    F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
+    G --> H["Enriched Model DataFrame"]
+    H --> |Save| I[models.json]
+    %% Model Validation & Cost Filtering
+    H --> |"Validate Models<br/>Check API Availability"| H1["Valid Models Only<br/>Cost ≤ $20/1M tokens"]
+    H1 --> |"Timeout Protection<br/>120s for Large Models"| H2["Robust Model List"]
+    %% Language Data
+    J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
+    %% Task Registry with Unified Prompting
+    L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot"]
+    M --> M1["translation_from/to<br/>BLEU + ChrF"]
+    M --> M2["classification<br/>Accuracy"]
+    M --> M3["mmlu<br/>Accuracy"]
+    M --> M4["arc<br/>Accuracy"]
+    M --> M5["truthfulqa<br/>Accuracy"]
+    M --> M6["mgsm<br/>Accuracy"]
+    %% On-the-fly Translation with Origin Tagging
+    subgraph OTF [On-the-fly Dataset Translation]
+        direction LR
+        DS_raw["Raw English Dataset<br/>(e.g., MMLU)"] --> Google_Translate["Google Translate API"]
+        Google_Translate --> DS_translated["Translated Dataset<br/>(e.g., German MMLU)<br/>Origin: 'machine'"]
+        DS_native["Native Dataset<br/>(e.g., German MMLU)<br/>Origin: 'human'"]
+    end
+    %% Evaluation Pipeline
+    H2 --> |"models ID"| N["main.py / main_gcs.py<br/>evaluate"]
+    K --> |"languages bcp_47"| N
+    L --> |"tasks.items"| N
+    N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model × Language × Task"]
+    O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing"]
+    %% Task Execution with Origin Tracking
+    P --> Q1[translate_and_evaluate<br/>Origin: 'human']
+    P --> Q2[classify_and_evaluate<br/>Origin: 'human']
+    P --> Q3[mmlu_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
+    %% API Calls with Error Handling
+    Q1 --> |"complete() API<br/>Rate Limiting"| R["OpenRouter<br/>Model Inference"]
+    Q2 --> |"complete() API<br/>Rate Limiting"| R
+    Q3 --> |"complete() API<br/>Rate Limiting"| R
+    Q4 --> |"complete() API<br/>Rate Limiting"| R
+    Q5 --> |"complete() API<br/>Rate Limiting"| R
+    Q6 --> |"complete() API<br/>Rate Limiting"| R
+    %% Results Processing with Origin Aggregation
+    R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin"]
+    S --> |Save| T[results.json]
+    %% Backend & Frontend with Origin-Specific Metrics
+    T --> |Read| U[backend.py]
+    I --> |Read| U
+    U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics"]
+    U --> |make_country_table| W["Country Aggregation"]
+    U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine"]
+    X --> |"JSON Response"| Y["Frontend React App"]
+    %% UI Components
+    Y --> Z1["WorldMap.js<br/>Country Visualization"]
+    Y --> Z2["ModelTable.js<br/>Model Rankings"]
+    Y --> Z3["LanguageTable.js<br/>Language Coverage"]
+    Y --> Z4["DatasetTable.js<br/>Task Performance"]
+    %% Data Sources with Origin Information
+    subgraph DS ["Data Sources"]
+        DS1["Flores-200<br/>Translation Sentences<br/>Origin: 'human'"]
+        DS2["MMLU/AfriMMLU<br/>Knowledge QA<br/>Origin: 'human'"]
+        DS3["ARC<br/>Science Reasoning<br/>Origin: 'human'"]
+        DS4["TruthfulQA<br/>Truthfulness<br/>Origin: 'human'"]
+        DS5["MGSM<br/>Math Problems<br/>Origin: 'human'"]
+    end
+    DS1 --> Q1
+    DS2 --> Q3
+    DS3 --> Q4
+    DS4 --> Q5
+    DS5 --> Q6
+    DS_translated --> Q3
+    DS_translated --> Q4
+    DS_translated --> Q5
+    DS_native --> Q3
+    DS_native --> Q4
+    DS_native --> Q5
+    %% Styling - Neutral colors that work in both dark and light modes
+    classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
+    classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
+    classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
+    classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
+    classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
+    classDef translation fill:#d4edda,stroke:#155724,color:#155724
+    class A1,A2,A3,A4 modelSource
+    class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
+    class R,F,G,X api
+    class T,I storage
+    class Y,Z1,Z2,Z3,Z4 frontend
+    class Google_Translate,DS_translated,DS_native translation
+```
+**Key Features:**
+- **Model Discovery**: Combines curated models with real-time trending models via web scraping
+- **Multi-Task Evaluation**: 7 tasks across 100+ languages with origin tracking (human vs machine-translated)
+- **Scalable Architecture**: Dual deployment (local/GitHub vs Google Cloud)
+- **Real-time Visualization**: Interactive web interface with country-level insights
 ## Evaluate
+### Local Development
 ```bash
 uv run --extra dev evals/main.py
 ```
+### Google Cloud Deployment
+```bash
+uv run --extra dev evals/main_gcs.py
+```
 ## Explore
 ```bash

cloudbuild.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+steps:
+  - name: 'gcr.io/cloud-builders/docker'
+    args: ['build', '-f', 'Dockerfile.eval', '-t', 'gcr.io/$PROJECT_ID/ai-language-eval', '.']
+images:
+  - 'gcr.io/$PROJECT_ID/ai-language-eval'

deploy_eval.sh ADDED Viewed

	@@ -0,0 +1,29 @@

+#!/bin/bash
+echo "Deploying AI Language Evaluation to Google Cloud Run"
+echo "Cost limit: $20 USD"
+echo "No runtime limit - will run to completion"
+# Build the Docker image first
+echo "🔨 Building Docker image..."
+gcloud builds submit --config cloudbuild.yaml .
+# Deploy the built image
+echo "🚀 Deploying to Cloud Run..."
+gcloud run deploy ai-language-eval \
+  --image gcr.io/ai-language-eval-1754052060/ai-language-eval \
+  --region us-central1 \
+  --platform managed \
+  --memory 2Gi \
+  --cpu 1 \
+  --max-instances 1 \
+  --timeout 3600 \
+  --concurrency 1 \
+  --no-allow-unauthenticated \
+           --set-env-vars="N_SENTENCES=20,MAX_LANGUAGES=150,COST_LIMIT_USD=20,PYTHONUNBUFFERED=1,PYTHONIOENCODING=utf-8" \
+  --quiet
+echo "✅ Deployment completed!"
+echo "🔗 Service URL: $(gcloud run services describe ai-language-eval --region=us-central1 --format='value(status.url)')"
+echo "📊 Monitor costs: https://console.cloud.google.com/billing/linkedaccount?project=ai-language-eval-1754052060"
+echo "💾 Results will be saved to: gs://ai-language-eval-results/"

evals/README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Evaluation Framework Documentation
+This document outlines the current methodology used for evaluating multilingual language models in this project. We may The framework is designed to be fair, consistent, and robust, providing a standardized way to measure model performance across a diverse set of languages and tasks.
+## Core Philosophy: English Zero-Shot Prompting
+The core of our evaluation methodology is a **unified English zero-shot prompting strategy**. This means:
+1.  **Instructions are in English**: All models receive their instructions in clear, standardized English. This removes the quality of prompt translation as a variable, ensuring a fair comparison.
+2.  **Content is in the Target Language**: The actual content to be evaluated (e.g., a question for a QA task, a sentence for translation) is always presented in the target language. This directly tests the model's ability to understand instructions in one language and apply them to content in another.
+3.  **Zero-Shot (with a Twist)**: We do not provide in-context examples from the test datasets. However, for Question Answering tasks, we provide a static, English-based "scratchpad" example. This doesn't teach the model the answer, but rather the *format* for its reasoning and final output, which is crucial for reliable response parsing.
+---
+## Task-Specific Prompting Strategies
+Below is a breakdown of the prompt structure for each of the active evaluation tasks.
+### 1. Translation (`translation`)
+-   **Objective**: To evaluate the model's ability to translate text both to and from a target language.
+-   **Prompt Structure**: A direct, zero-shot English instruction.
+    ```
+    Translate the following text to the {target_language_name} language; use the {script} script; reply only with the translation:
+    {original_sentence}
+    ```
+### 2. Classification (`classification`)
+-   **Objective**: To evaluate the model's ability to classify a paragraph of text into one of five topics.
+-   **Prompt Structure**: A direct, zero-shot English instruction providing the available topics.
+    ```
+    Classify the following text into one of these topics: {topic1}, {topic2}, {topic3}, {topic4}, {topic5}.
+    Reply with only the topic name.
+    Text:
+    {paragraph_in_target_language}
+    ```
+### 3. Question Answering (`mmlu`, `arc`, `truthfulqa`)
+-   **Objective**: To evaluate the model's knowledge and reasoning abilities on multiple-choice questions.
+-   **Prompt Structure**: A zero-shot English instruction combined with a "reasoning scratchpad" format.
+    ```
+    Solve the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.
+    Response format: <reasoning> #### <letter>
+    ---
+    {question_and_choices_in_target_language}
+    ```
+### 4. Math Word Problems (`mgsm`)
+-   **Objective**: To evaluate the model's ability to solve mathematical reasoning problems.
+-   **Prompt Structure**: Similar to the QA tasks, this uses a zero-shot English instruction with a reasoning scratchpad, but asks for a number as the final answer.
+    ```
+    Solve the following math problem. Reason step-by-step and then write the final answer as a number.
+    Response format: <reasoning> #### <number>
+    ---
+    {math_problem_in_target_language}
+    ```
+---
+## Advantages and Disadvantages of this Methodology
+### Advantages
+-   **Fairness and Control**: By using standardized English prompts, we eliminate the quality of prompt translation as a confounding variable, leading to a fairer comparison between models.
+-   **Robustness**: This approach directly tests a model's cross-lingual instruction-following capabilities, which is a key measure of its multilingual prowess.
+-   **Simplicity and Maintainability**: The zero-shot approach significantly simplifies the codebase, making it easier to maintain and extend.
+### Disadvantages
+-   **Brittleness of Response Parsing**: The evaluation of QA and Math tasks is highly dependent on the model's ability to perfectly adhere to the `#### <answer>` format. Models that produce correct reasoning but fail to follow the format will be unfairly penalized.
+-   **Potential for Cross-Lingual Confusion**: Less capable models may struggle with instructions in one language and content in another, which could impact their performance.

evals/backend.py CHANGED Viewed

@@ -26,7 +26,7 @@ task_metrics = [
     "classification_accuracy",
     "mmlu_accuracy",
     "arc_accuracy",
-    # "truthfulqa_accuracy",
     "mgsm_accuracy",
 ]
@@ -46,65 +46,73 @@ def compute_normalized_average(df, metrics):
 def make_model_table(df, models):
-    df = (
-        df.groupby(["model", "task", "metric"])
-        .agg({"score": "mean", "bcp_47": "nunique"})
-        .reset_index()
-    )
     df["task_metric"] = df["task"] + "_" + df["metric"]
-    df = df.drop(columns=["task", "metric"])
-    df = df.pivot(index="model", columns="task_metric", values="score")
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = df.sort_values(by="average", ascending=False).reset_index()
     df = pd.merge(df, models, left_on="model", right_on="id", how="left")
     df["rank"] = df.index + 1
     df = df[
         [
-            "rank",
-            "model",
-            "name",
-            "provider_name",
-            "hf_id",
-            "creation_date",
-            "size",
-            "type",
-            "license",
-            "cost",
-            "average",
-            *task_metrics,
         ]
     ]
     return df
 def make_language_table(df, languages):
-    df = (
-        df.groupby(["bcp_47", "task", "metric"])
-        .agg({"score": "mean", "model": "nunique"})
-        .reset_index()
-    )
     df["task_metric"] = df["task"] + "_" + df["metric"]
-    df = df.drop(columns=["task", "metric"])
-    df = df.pivot(index="bcp_47", columns="task_metric", values="score").reset_index()
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = pd.merge(languages, df, on="bcp_47", how="outer")
     df = df.sort_values(by="speakers", ascending=False)
     df = df[
         [
-            "bcp_47",
-            "language_name",
-            "autonym",
-            "speakers",
-            "family",
-            "average",
-            "in_benchmark",
-            *task_metrics,
         ]
     ]
     return df
@@ -125,10 +133,18 @@ async def data(request: Request):
     body = await request.body()
     data = json.loads(body)
     selected_languages = data.get("selectedLanguages", {})
-    df = scores.groupby(["model", "bcp_47", "task", "metric"]).mean().reset_index()
     # lang_results = pd.merge(languages, lang_results, on="bcp_47", how="outer")
     language_table = make_language_table(df, languages)
     datasets_df = pd.read_json("datasets.json")
     if selected_languages:
         # the filtering is only applied for the model table and the country data
         df = df[df["bcp_47"].isin(lang["bcp_47"] for lang in selected_languages)]
@@ -143,6 +159,7 @@ async def data(request: Request):
         "language_table": serialize(language_table),
         "dataset_table": serialize(datasets_df),
         "countries": serialize(countries),
     }
     return JSONResponse(content=all_tables)

     "classification_accuracy",
     "mmlu_accuracy",
     "arc_accuracy",
+    "truthfulqa_accuracy",
     "mgsm_accuracy",
 ]
 def make_model_table(df, models):
+    # Create a combined task_metric for origin
+    df["task_metric_origin"] = df["task"] + "_" + df["metric"] + "_" + df["origin"]
+    # Pivot to get scores for each origin-specific metric
+    scores_pivot = df.pivot_table(index="model", columns="task_metric_origin", values="score", aggfunc="mean")
+    # Create the regular task_metric for the main average calculation
     df["task_metric"] = df["task"] + "_" + df["metric"]
+    main_pivot = df.pivot_table(index="model", columns="task_metric", values="score", aggfunc="mean")
+    # Merge the two pivots
+    df = pd.merge(main_pivot, scores_pivot, on="model", how="outer")
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = df.sort_values(by="average", ascending=False).reset_index()
     df = pd.merge(df, models, left_on="model", right_on="id", how="left")
     df["rank"] = df.index + 1
+    # Dynamically find all metric columns to include
+    final_cols = df.columns
+    metric_cols = [m for m in final_cols if any(tm in m for tm in task_metrics)]
     df = df[
         [
+            "rank", "model", "name", "provider_name", "hf_id", "creation_date",
+            "size", "type", "license", "cost", "average",
+            *sorted(list(set(metric_cols)))
         ]
     ]
     return df
 def make_language_table(df, languages):
+    # Create a combined task_metric for origin
+    df["task_metric_origin"] = df["task"] + "_" + df["metric"] + "_" + df["origin"]
+    # Pivot to get scores for each origin-specific metric
+    scores_pivot = df.pivot_table(index="bcp_47", columns="task_metric_origin", values="score", aggfunc="mean")
+    # Create the regular task_metric for the main average calculation
     df["task_metric"] = df["task"] + "_" + df["metric"]
+    main_pivot = df.pivot_table(index="bcp_47", columns="task_metric", values="score", aggfunc="mean")
+    # Merge the two pivots
+    df = pd.merge(main_pivot, scores_pivot, on="bcp_47", how="outer")
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = pd.merge(languages, df, on="bcp_47", how="outer")
     df = df.sort_values(by="speakers", ascending=False)
+    # Dynamically find all metric columns to include
+    final_cols = df.columns
+    metric_cols = [m for m in final_cols if any(tm in m for tm in task_metrics)]
     df = df[
         [
+            "bcp_47", "language_name", "autonym", "speakers", "family",
+            "average", "in_benchmark",
+            *sorted(list(set(metric_cols)))
         ]
     ]
     return df
     body = await request.body()
     data = json.loads(body)
     selected_languages = data.get("selectedLanguages", {})
+    df = scores.groupby(["model", "bcp_47", "task", "metric", "origin"]).mean().reset_index()
     # lang_results = pd.merge(languages, lang_results, on="bcp_47", how="outer")
     language_table = make_language_table(df, languages)
     datasets_df = pd.read_json("datasets.json")
+    # Identify which metrics have machine translations available
+    machine_translated_metrics = set()
+    for _, row in df.iterrows():
+        if row["origin"] == "machine":
+            metric_name = f"{row['task']}_{row['metric']}"
+            machine_translated_metrics.add(metric_name)
     if selected_languages:
         # the filtering is only applied for the model table and the country data
         df = df[df["bcp_47"].isin(lang["bcp_47"] for lang in selected_languages)]
         "language_table": serialize(language_table),
         "dataset_table": serialize(datasets_df),
         "countries": serialize(countries),
+        "machine_translated_metrics": list(machine_translated_metrics),
     }
     return JSONResponse(content=all_tables)

evals/datasets_/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # This file makes datasets_ a Python package

evals/datasets_/arc.py CHANGED Viewed

@@ -3,9 +3,9 @@ from collections import Counter, defaultdict
 from langcodes import Language, standardize_tag
 from rich import print
-from models import translate_google, google_supported_languages
 from tqdm import tqdm
-from datasets import Dataset, load_dataset
 import asyncio
 from tqdm.asyncio import tqdm_asyncio
 import os
@@ -14,27 +14,33 @@ from datasets_.util import _get_dataset_config_names, _load_dataset
 slug_uhura_arc_easy = "masakhane/uhura-arc-easy"
 tags_uhura_arc_easy = {
-    standardize_tag(a.split("_")[0], macro=True): a for a in _get_dataset_config_names(slug_uhura_arc_easy)
     if not a.endswith("unmatched")
 }
 random.seed(42)
-id_sets_train = [set(_load_dataset(slug_uhura_arc_easy, tag, split="train")["id"]) for tag in tags_uhura_arc_easy.values()]
 common_ids_train = list(sorted(set.intersection(*id_sets_train)))
 random.shuffle(common_ids_train)
-id_sets_test = [set(_load_dataset(slug_uhura_arc_easy, tag, split="test")["id"]) for tag in tags_uhura_arc_easy.values()]
 common_ids_test = list(sorted(set.intersection(*id_sets_test)))
 random.shuffle(common_ids_test)
 slug_uhura_arc_easy_translated = "fair-forward/arc-easy-autotranslated"
 tags_uhura_arc_easy_translated = {
-    standardize_tag(a.split("_")[0], macro=True): a for a in _get_dataset_config_names(slug_uhura_arc_easy_translated)
 }
 def add_choices(row):
     row["choices"] = row["choices"]["text"]
     return row
@@ -45,27 +51,37 @@ def load_uhura_arc_easy(language_bcp_47, nr):
         ds = _load_dataset(slug_uhura_arc_easy, tags_uhura_arc_easy[language_bcp_47])
         ds = ds.map(add_choices)
         ds = ds.rename_column("answerKey", "answer")
-        train_ids = common_ids_train[nr:nr+3]
-        examples = ds["train"].filter(lambda x: x["id"] in train_ids)
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
-        return "masakhane/uhura-arc-easy", examples, task
     if language_bcp_47 in tags_uhura_arc_easy_translated.keys():
-        ds = _load_dataset(slug_uhura_arc_easy_translated, tags_uhura_arc_easy_translated[language_bcp_47])
         ds = ds.rename_column("answerKey", "answer")
-        train_ids = common_ids_train[nr:nr+3]
-        examples = ds["train"].filter(lambda x: x["id"] in train_ids)
-        # raise Exception(language_bcp_47)
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
-        return "fair-forward/arc-easy-autotranslated", examples, task
     else:
         return None, None, None
 def translate_arc(languages):
     human_translated = tags_uhura_arc_easy.keys()
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
     n_samples = 10
     train_ids = common_ids_train[:n_samples+3]

 from langcodes import Language, standardize_tag
 from rich import print
+from models import translate_google, get_google_supported_languages
 from tqdm import tqdm
+from datasets import load_dataset
 import asyncio
 from tqdm.asyncio import tqdm_asyncio
 import os
 slug_uhura_arc_easy = "masakhane/uhura-arc-easy"
 tags_uhura_arc_easy = {
+    standardize_tag(a.split("_")[0], macro=True): a
+    for a in _get_dataset_config_names(slug_uhura_arc_easy)
     if not a.endswith("unmatched")
 }
 random.seed(42)
+id_sets_train = [
+    set(_load_dataset(slug_uhura_arc_easy, tag, split="train")["id"])
+    for tag in tags_uhura_arc_easy.values()
+]
 common_ids_train = list(sorted(set.intersection(*id_sets_train)))
 random.shuffle(common_ids_train)
+id_sets_test = [
+    set(_load_dataset(slug_uhura_arc_easy, tag, split="test")["id"])
+    for tag in tags_uhura_arc_easy.values()
+]
 common_ids_test = list(sorted(set.intersection(*id_sets_test)))
 random.shuffle(common_ids_test)
 slug_uhura_arc_easy_translated = "fair-forward/arc-easy-autotranslated"
 tags_uhura_arc_easy_translated = {
+    standardize_tag(a.split("_")[0], macro=True): a
+    for a in _get_dataset_config_names(slug_uhura_arc_easy_translated)
 }
 def add_choices(row):
     row["choices"] = row["choices"]["text"]
     return row
         ds = _load_dataset(slug_uhura_arc_easy, tags_uhura_arc_easy[language_bcp_47])
         ds = ds.map(add_choices)
         ds = ds.rename_column("answerKey", "answer")
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
+        return "masakhane/uhura-arc-easy", task, "human"
     if language_bcp_47 in tags_uhura_arc_easy_translated.keys():
+        ds = _load_dataset(
+            slug_uhura_arc_easy_translated,
+            tags_uhura_arc_easy_translated[language_bcp_47],
+        )
         ds = ds.rename_column("answerKey", "answer")
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
+        return "fair-forward/arc-easy-autotranslated", task, "machine"
     else:
+        # ARC does not support on-the-fly translation currently
         return None, None, None
+def load_uhura_arc_challenge(language_bcp_47, nr):
+    ds_name = "jlahd/uhura_arc_challenge"
+    if language_bcp_47 in _get_dataset_config_names(ds_name):
+        ds = _load_dataset(ds_name, language_bcp_47)
+        task = ds["test"][nr]
+        return ds_name, task
+    else:
+        return None, None, None
 def translate_arc(languages):
     human_translated = tags_uhura_arc_easy.keys()
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
     n_samples = 10
     train_ids = common_ids_train[:n_samples+3]

evals/datasets_/mgsm.py CHANGED Viewed

@@ -1,10 +1,12 @@
 import asyncio
 import os
 from datasets import Dataset, load_dataset
 from datasets_.util import _get_dataset_config_names, _load_dataset
-from langcodes import standardize_tag
-from models import google_supported_languages, translate_google
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
@@ -38,19 +40,22 @@ def parse_number(i):
 def load_mgsm(language_bcp_47, nr):
     if language_bcp_47 in tags_mgsm.keys():
         ds = _load_dataset(slug_mgsm, subset=tags_mgsm[language_bcp_47], split="test")
-        return slug_mgsm, ds[nr]
     elif language_bcp_47 in tags_afrimgsm.keys():
         ds = _load_dataset(
             slug_afrimgsm, subset=tags_afrimgsm[language_bcp_47], split="test"
         )
-        return slug_afrimgsm, ds[nr]
     elif language_bcp_47 in tags_gsm_autotranslated.keys():
         ds = _load_dataset(
-            slug_gsm_autotranslated, subset=tags_gsm_autotranslated[language_bcp_47], split="test"
         )
-        return slug_gsm_autotranslated, ds[nr]
     elif language_bcp_47 in tags_gsm8kx.keys():
         row = _load_dataset(
             slug_gsm8kx,
@@ -59,9 +64,9 @@ def load_mgsm(language_bcp_47, nr):
             trust_remote_code=True,
         )[nr]
         row["answer_number"] = row["answer"].split("####")[1].strip()
-        return slug_gsm8kx, row
     else:
-        return None, None
 def translate_mgsm(languages):
@@ -69,7 +74,7 @@ def translate_mgsm(languages):
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
     en = _load_dataset(slug_mgsm, subset=tags_mgsm["en"], split="test")
     slug = "fair-forward/gsm-autotranslated"

 import asyncio
 import os
+import random
 from datasets import Dataset, load_dataset
 from datasets_.util import _get_dataset_config_names, _load_dataset
+from langcodes import Language, standardize_tag
+from models import get_google_supported_languages, translate_google
+from rich import print
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
 def load_mgsm(language_bcp_47, nr):
+    print(f"Loading MGSM data for {language_bcp_47}...")
     if language_bcp_47 in tags_mgsm.keys():
         ds = _load_dataset(slug_mgsm, subset=tags_mgsm[language_bcp_47], split="test")
+        return slug_mgsm, ds[nr], "human"
     elif language_bcp_47 in tags_afrimgsm.keys():
         ds = _load_dataset(
             slug_afrimgsm, subset=tags_afrimgsm[language_bcp_47], split="test"
         )
+        return slug_afrimgsm, ds[nr], "human"
     elif language_bcp_47 in tags_gsm_autotranslated.keys():
         ds = _load_dataset(
+            slug_gsm_autotranslated,
+            subset=tags_gsm_autotranslated[language_bcp_47],
+            split="test",
         )
+        return slug_gsm_autotranslated, ds[nr], "machine"
     elif language_bcp_47 in tags_gsm8kx.keys():
         row = _load_dataset(
             slug_gsm8kx,
             trust_remote_code=True,
         )[nr]
         row["answer_number"] = row["answer"].split("####")[1].strip()
+        return slug_gsm8kx, row, "human" # Assuming Eurolingua is human-translated
     else:
+        return None, None, None
 def translate_mgsm(languages):
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
     en = _load_dataset(slug_mgsm, subset=tags_mgsm["en"], split="test")
     slug = "fair-forward/gsm-autotranslated"

evals/datasets_/mmlu.py CHANGED Viewed

@@ -6,7 +6,7 @@ from collections import Counter, defaultdict
 from datasets import Dataset, load_dataset
 from datasets_.util import _get_dataset_config_names, _load_dataset
 from langcodes import Language, standardize_tag
-from models import google_supported_languages, translate_google
 from rich import print
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
@@ -150,26 +150,66 @@ categories = sorted(
     )
-def load_mmlu(language_bcp_47, nr):
     category = categories[nr % len(categories)]
     if language_bcp_47 in tags_afrimmlu.keys():
         ds = _load_dataset("masakhane/afrimmlu", tags_afrimmlu[language_bcp_47])
         ds = ds.map(parse_choices)
-        examples = ds["dev"].filter(lambda x: x["subject"] == category)
         task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
-        return "masakhane/afrimmlu", examples, task
     elif language_bcp_47 in tags_global_mmlu.keys():
         ds = _load_dataset("CohereForAI/Global-MMLU", tags_global_mmlu[language_bcp_47])
         ds = ds.map(add_choices)
-        examples = ds["dev"].filter(lambda x: x["subject"] == category)
         task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
-        return "CohereForAI/Global-MMLU", examples, task
     elif language_bcp_47 in tags_mmlu_autotranslated:
         ds = _load_dataset("fair-forward/mmlu-autotranslated", language_bcp_47)
-        examples = ds["dev"].filter(lambda x: x["subject"] == category)
         task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
-        return "fair-forward/mmlu-autotranslated", examples, task
     else:
         return None, None, None
@@ -178,7 +218,7 @@ def translate_mmlu(languages):
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
     n_samples = 10

 from datasets import Dataset, load_dataset
 from datasets_.util import _get_dataset_config_names, _load_dataset
 from langcodes import Language, standardize_tag
+from models import get_google_supported_languages, translate_google
 from rich import print
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
     )
+async def load_mmlu(language_bcp_47, nr):
+    print(f"Loading MMLU data for {language_bcp_47}...")
     category = categories[nr % len(categories)]
     if language_bcp_47 in tags_afrimmlu.keys():
         ds = _load_dataset("masakhane/afrimmlu", tags_afrimmlu[language_bcp_47])
         ds = ds.map(parse_choices)
         task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
+        return "masakhane/afrimmlu", task, "human"
     elif language_bcp_47 in tags_global_mmlu.keys():
         ds = _load_dataset("CohereForAI/Global-MMLU", tags_global_mmlu[language_bcp_47])
         ds = ds.map(add_choices)
         task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
+        return "CohereForAI/Global-MMLU", task, "human"
     elif language_bcp_47 in tags_mmlu_autotranslated:
         ds = _load_dataset("fair-forward/mmlu-autotranslated", language_bcp_47)
         task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
+        return "fair-forward/mmlu-autotranslated", task, "machine"
     else:
+        # Try on-the-fly translation for missing languages
+        return await load_mmlu_translated(language_bcp_47, nr)
+async def load_mmlu_translated(language_bcp_47, nr):
+    """
+    Load MMLU data with on-the-fly Google translation for languages
+    without native MMLU translations.
+    """
+    # Check if Google Translate supports this language
+    supported_languages = get_google_supported_languages()
+    if language_bcp_47 not in supported_languages:
+        return None, None, None
+    print(f"🔄 Translating MMLU data to {language_bcp_47} on-the-fly...")
+    try:
+        # Load English MMLU data
+        category = categories[nr % len(categories)]
+        ds = _load_dataset("masakhane/afrimmlu", "eng")
+        ds = ds.map(parse_choices)
+        task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
+        # Translate question and choices
+        question_translated = await translate_google(task["question"], "en", language_bcp_47)
+        choices_translated = []
+        for choice in task["choices"]:
+            choice_translated = await translate_google(choice, "en", language_bcp_47)
+            choices_translated.append(choice_translated)
+        # Create translated task
+        translated_task = {
+            "question": question_translated,
+            "choices": choices_translated,
+            "answer": task["answer"],  # Keep original answer index
+            "subject": task["subject"]
+        }
+        return f"mmlu-translated-{language_bcp_47}", translated_task, "machine"
+    except Exception as e:
+        print(f"❌ Translation failed for {language_bcp_47}: {e}")
         return None, None, None
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
     n_samples = 10

evals/datasets_/truthfulqa.py CHANGED Viewed

@@ -9,7 +9,7 @@ from tqdm.asyncio import tqdm_asyncio
 import os
 from datasets import Dataset, load_dataset
-from models import translate_google, google_supported_languages
 from datasets_.util import _get_dataset_config_names, _load_dataset
@@ -26,14 +26,51 @@ def add_choices(row):
     return row
-def load_truthfulqa(language_bcp_47, nr):
     if language_bcp_47 in tags_uhura_truthfulqa.keys():
-        ds = _load_dataset(slug_uhura_truthfulqa, tags_uhura_truthfulqa[language_bcp_47])
         ds = ds.map(add_choices)
-        examples = ds["train"]
         task = ds["test"][nr]
-        return "masakhane/uhura-truthfulqa", examples, task
     else:
         return None, None, None
@@ -43,7 +80,7 @@ def translate_truthfulqa(languages):
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
     n_samples = 10

 import os
 from datasets import Dataset, load_dataset
+from models import translate_google, get_google_supported_languages
 from datasets_.util import _get_dataset_config_names, _load_dataset
     return row
+async def load_truthfulqa(language_bcp_47, nr):
     if language_bcp_47 in tags_uhura_truthfulqa.keys():
+        ds = _load_dataset(
+            slug_uhura_truthfulqa, tags_uhura_truthfulqa[language_bcp_47]
+        )
         ds = ds.map(add_choices)
         task = ds["test"][nr]
+        return "masakhane/uhura-truthfulqa", task, "human"
     else:
+        # Fallback to on-the-fly translation
+        return await load_truthfulqa_translated(language_bcp_47, nr)
+async def load_truthfulqa_translated(language_bcp_47, nr):
+    """
+    Load TruthfulQA data with on-the-fly Google translation.
+    """
+    supported_languages = get_google_supported_languages()
+    if language_bcp_47 not in supported_languages:
+        return None, None, None
+    print(f"🔄 Translating TruthfulQA data to {language_bcp_47} on-the-fly...")
+    try:
+        # Load English TruthfulQA data
+        ds = _load_dataset(slug_uhura_truthfulqa, tags_uhura_truthfulqa["en"])
+        ds = ds.map(add_choices)
+        task = ds["test"][nr]
+        # Translate question and choices
+        question_translated = await translate_google(task["question"], "en", language_bcp_47)
+        choices_translated = []
+        for choice in task["choices"]:
+            choice_translated = await translate_google(choice, "en", language_bcp_47)
+            choices_translated.append(choice_translated)
+        translated_task = {
+            "question": question_translated,
+            "choices": choices_translated,
+            "labels": task["labels"], # Keep original labels
+        }
+        return f"truthfulqa-translated-{language_bcp_47}", translated_task, "machine"
+    except Exception as e:
+        print(f"❌ Translation failed for {language_bcp_47}: {e}")
         return None, None, None
     untranslated = [
         lang
         for lang in languages["bcp_47"].values[:100]
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
     n_samples = 10

evals/datasets_/util.py CHANGED Viewed

@@ -12,3 +12,10 @@ def _get_dataset_config_names(dataset, **kwargs):
 @cache
 def _load_dataset(dataset, subset, **kwargs):
     return load_dataset(dataset, subset, **kwargs)

 @cache
 def _load_dataset(dataset, subset, **kwargs):
     return load_dataset(dataset, subset, **kwargs)
+# Cache individual dataset items to avoid reloading entire datasets
+@cache
+def _get_dataset_item(dataset, subset, split, index, **kwargs):
+    """Load a single item from a dataset efficiently"""
+    ds = load_dataset(dataset, subset, split=split, **kwargs)
+    return ds[index] if index < len(ds) else None

evals/main.py CHANGED Viewed

@@ -1,62 +1,164 @@
 import asyncio
 import pandas as pd
-from languages import languages
 from models import models
 from tasks import tasks
-from tqdm.asyncio import tqdm_asyncio
-# ===== config =====
-n_sentences = 10
-# ===== run evaluation and aggregate results =====
 async def evaluate():
     # FIXME we should not need this for-loop, but it helps
-    for n_languages in range(10, 101, 10):
         print(f"running evaluations for {n_languages} languages")
         old_results = pd.read_json("results.json")
         old_models = pd.read_json("models.json")
         # get all combinations of model, language and task
         combis = [
             (model, lang.bcp_47, task_name)
-            for model in models["id"]
-            for lang in languages.iloc[:n_languages].itertuples()
             for task_name, task in tasks.items()
-            if task_name in models[models["id"] == model]["tasks"].iloc[0]
         ]
         # filter out combinations that have already been evaluated
         combis = pd.DataFrame(combis, columns=["model", "bcp_47", "task"])
         combis = combis.merge(old_results, on=["model", "bcp_47", "task"], how="left")
         combis = combis[combis["metric"].isna()][["model", "bcp_47", "task"]]
-        # run evaluations
-        results = [
-            tasks[task_name](model, bcp_47, i)
-            for i in range(n_sentences)
-            for model, bcp_47, task_name in combis.itertuples(index=False)
-        ]
-        results = await tqdm_asyncio.gather(*results, miniters=1)
-        results = [r for group in results for r in group]
-        args = dict(orient="records", indent=2, force_ascii=False)
-        if results:
-            # aggregate results
-            results = pd.DataFrame(results)
-            results = (
-                results.groupby(["model", "bcp_47", "task", "metric"])
-                .agg({"score": "mean"})
-                .reset_index()
-            )
-            # save results
-            results = pd.concat([old_results, results])
-            results = results.sort_values(by=["model", "bcp_47", "task", "metric"])
-            results.to_json("results.json", **args)
-        # save up-to-date info on models and languages
         all_models = pd.concat([pd.DataFrame(models), old_models])
         all_models = all_models.drop_duplicates(subset=["id"]).sort_values(by=["id"])
         all_models.to_json("models.json", **args)
         pd.DataFrame(languages).to_json("languages.json", **args)
 if __name__ == "__main__":

 import asyncio
 import pandas as pd
+import time
+import os
+from datetime import datetime, timedelta
+from tqdm.asyncio import tqdm_asyncio
 from models import models
 from tasks import tasks
+from languages import languages
+import json
+results = pd.DataFrame()
 async def evaluate():
     # FIXME we should not need this for-loop, but it helps
+    n_sentences = int(os.environ.get("N_SENTENCES", 15)) # Default 1 for quick testing
+    # Load models and languages
+    models_df = pd.DataFrame(models)
+    languages_df = pd.DataFrame(languages)
+    print(f"🚀 Running full evaluation with {len(models_df)} models.")
+    start_time = time.time()
+    print(f"🚀 Starting full evaluation at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    print(f"📊 Evaluating {n_sentences} sentences per task")
+    # Evaluate top languages by speakers (configurable via MAX_LANGUAGES env var)
+    max_languages = int(os.environ.get("MAX_LANGUAGES", 2))  # Default 2 for quick testing
+    top_languages = languages.head(max_languages)  # Top N by population
+    print(f"🌍 Evaluating top {len(top_languages)} languages by speakers (max: {max_languages})")
+    # For testing, just use all available languages up to max_languages
+    for n_languages in [min(max_languages, len(top_languages))]:
         print(f"running evaluations for {n_languages} languages")
         old_results = pd.read_json("results.json")
+        if old_results.empty:
+            old_results = pd.DataFrame(columns=["model", "bcp_47", "task", "metric", "origin", "score"])
         old_models = pd.read_json("models.json")
         # get all combinations of model, language and task
         combis = [
             (model, lang.bcp_47, task_name)
+            for model in models_df["id"]
+            for lang in top_languages.iloc[:n_languages].itertuples()
             for task_name, task in tasks.items()
+            if task_name in models_df[models_df["id"] == model]["tasks"].iloc[0]
         ]
         # filter out combinations that have already been evaluated
         combis = pd.DataFrame(combis, columns=["model", "bcp_47", "task"])
         combis = combis.merge(old_results, on=["model", "bcp_47", "task"], how="left")
         combis = combis[combis["metric"].isna()][["model", "bcp_47", "task"]]
+        # run evaluations in batches to prevent HTTP pool exhaustion
+        all_tasks = []
+        for i in range(n_sentences):
+            for model, bcp_47, task_name in combis.itertuples(index=False):
+                # All tasks now use the same signature
+                all_tasks.append((tasks[task_name], model, bcp_47, i))
+        print(f"⏳ Processing {len(all_tasks)} evaluation tasks in batches...")
+        batch_size = 50  # Process 50 tasks at a time
+        all_results = []
+        for i in range(0, len(all_tasks), batch_size):
+            batch = all_tasks[i:i+batch_size]
+            print(f"📦 Processing batch {i//batch_size + 1}/{(len(all_tasks) + batch_size - 1)//batch_size} ({len(batch)} tasks)")
+            # Show what's being evaluated in this batch
+            batch_summary = {}
+            for task_data in batch:
+                task_func, model, bcp_47, sentence_nr = task_data
+                # Extract task name from function - handle both partial functions and regular functions
+                if hasattr(task_func, 'func'):
+                    task_name = task_func.func.__name__.replace('_and_evaluate', '')
+                else:
+                    task_name = task_func.__name__.replace('_and_evaluate', '')
+                if task_name not in batch_summary:
+                    batch_summary[task_name] = set()
+                batch_summary[task_name].add(bcp_47)
+            for task_name, languages_set in batch_summary.items():
+                lang_list = ', '.join(sorted(languages_set))
+                print(f"  🔄 {task_name}: {lang_list}")
+            batch_coroutines = []
+            for task_data in batch:
+                task_func, model, bcp_47, sentence_nr = task_data
+                batch_coroutines.append(task_func(model, bcp_47, sentence_nr))
+            batch_results = await asyncio.gather(*batch_coroutines, return_exceptions=True)
+            all_results.extend(batch_results)
+            # Small delay between batches to avoid overwhelming the API
+            await asyncio.sleep(1)
+        results = all_results
+        # Filter out exceptions and flatten results
+        valid_results = []
+        exception_count = 0
+        for r in results:
+            if isinstance(r, Exception):
+                exception_count += 1
+                continue
+            if isinstance(r, list):
+                valid_results.extend(r)
+            else:
+                valid_results.append(r)
+        print(f"⚠️  Encountered {exception_count} API errors (model unavailable/rate limits)")
+        print(f"✅ Successfully processed {len(valid_results)} evaluations")
+        # Save partial results even if some failed
+        if valid_results:
+            results = valid_results
+            args = dict(orient="records", indent=2, force_ascii=False)
+            # Aggregate results like main branch
+            results_df = pd.DataFrame(results)
+            if len(results_df) > 0:
+                results_df = (
+                    results_df.groupby(["model", "bcp_47", "task", "metric", "origin"])
+                    .agg({"score": "mean"})
+                    .reset_index()
+                )
+                # Merge with old results
+                old_results = pd.read_json("results.json")
+                results_df = pd.concat([old_results, results_df])
+                results_df = results_df.sort_values(by=["model", "bcp_47", "task", "metric"])
+                results_df.to_json("results.json", **args)
+                print(f"💾 Saved {len(results_df)} aggregated results to results.json")
+            else:
+                print("⚠️  No valid results to aggregate")
+        else:
+            print("⚠️  No valid results to save - all API calls failed")
+        # Save up-to-date info on models and languages (like main branch)
         all_models = pd.concat([pd.DataFrame(models), old_models])
         all_models = all_models.drop_duplicates(subset=["id"]).sort_values(by=["id"])
         all_models.to_json("models.json", **args)
         pd.DataFrame(languages).to_json("languages.json", **args)
+        # Continue with next batch even if this one had errors
+        # Time estimation
+        elapsed = time.time() - start_time
+        elapsed_str = str(timedelta(seconds=int(elapsed)))
+        if n_languages < max_languages:
+            remaining_batches = (max_languages - n_languages) // 10
+            batch_count = max(1, n_languages // 10)  # Avoid division by zero
+            estimated_remaining = elapsed * remaining_batches / batch_count
+            eta = datetime.now() + timedelta(seconds=estimated_remaining)
+            print(f"⏱️  Batch completed in {elapsed_str}. ETA for full run: {eta.strftime('%H:%M:%S')}")
+        else:
+            print(f"✅ Full evaluation completed in {elapsed_str}")
+            print(f"🎉 Finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    # Save results locally
+    with open("results.json", "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"💾 Results saved to results.json")
+    return results
 if __name__ == "__main__":

evals/main_gcs.py ADDED Viewed

	@@ -0,0 +1,213 @@

+import asyncio
+import pandas as pd
+import time
+import os
+from datetime import datetime, timedelta
+from tqdm.asyncio import tqdm_asyncio
+from models import models
+from tasks import tasks
+from languages import languages
+import json
+# Google Cloud Storage imports
+try:
+    from google.cloud import storage
+    GCS_AVAILABLE = True
+    print("✅ Google Cloud Storage available")
+except ImportError:
+    GCS_AVAILABLE = False
+    print("❌ Google Cloud Storage not available - install with: pip install google-cloud-storage")
+async def save_results_to_gcs(results, bucket_name="ai-language-eval-results"):
+    """Save results to Google Cloud Storage"""
+    if not GCS_AVAILABLE:
+        print("❌ Google Cloud Storage not available")
+        return
+    try:
+        storage_client = storage.Client()
+        bucket = storage_client.bucket(bucket_name)
+        # Create bucket if it doesn't exist
+        if not bucket.exists():
+            bucket = storage_client.create_bucket(bucket_name, location="us-central1")
+            print(f"📦 Created bucket: {bucket_name}")
+        # Save results with timestamp
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        blob_name = f"results_{timestamp}.json"
+        blob = bucket.blob(blob_name)
+        # Convert results to JSON and upload
+        results_json = json.dumps(results, indent=2)
+        blob.upload_from_string(results_json, content_type='application/json')
+        print(f"💾 Results saved to GCS: gs://{bucket_name}/{blob_name}")
+        print(f"📊 Download with: gsutil cp gs://{bucket_name}/{blob_name} ./")
+        # Also save latest results
+        latest_blob = bucket.blob("results_latest.json")
+        latest_blob.upload_from_string(results_json, content_type='application/json')
+        print(f"💾 Latest results: gs://{bucket_name}/results_latest.json")
+    except Exception as e:
+        print(f"❌ Failed to save to GCS: {e}")
+        print("💾 Results saved locally to results.json")
+results = pd.DataFrame()
+async def evaluate():
+    # FIXME we should not need this for-loop, but it helps
+    n_sentences = int(os.environ.get("N_SENTENCES", 1)) # Default 1 for quick testing
+    # Load models and languages
+    models_df = pd.DataFrame(models)
+    languages_df = pd.DataFrame(languages)
+    print(f"🚀 Running full evaluation with {len(models_df)} models.")
+    start_time = time.time()
+    print(f"🚀 Starting full evaluation at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    print(f"📊 Evaluating {n_sentences} sentences per task")
+    # Evaluate top languages by speakers (configurable via MAX_LANGUAGES env var)
+    max_languages = int(os.environ.get("MAX_LANGUAGES", 2))  # Default 2 for quick testing
+    top_languages = languages.head(max_languages)  # Top N by population
+    print(f"🌍 Evaluating top {len(top_languages)} languages by speakers (max: {max_languages})")
+    # For testing, just use all available languages up to max_languages
+    for n_languages in [min(max_languages, len(top_languages))]:
+        print(f"running evaluations for {n_languages} languages")
+        old_results = pd.read_json("results.json")
+        if old_results.empty:
+            old_results = pd.DataFrame(columns=["model", "bcp_47", "task", "metric", "origin", "score"])
+        old_models = pd.read_json("models.json")
+        # get all combinations of model, language and task
+        combis = [
+            (model, lang.bcp_47, task_name)
+            for model in models_df["id"]
+            for lang in top_languages.iloc[:n_languages].itertuples()
+            for task_name, task in tasks.items()
+            if task_name in models_df[models_df["id"] == model]["tasks"].iloc[0]
+        ]
+        # filter out combinations that have already been evaluated
+        combis = pd.DataFrame(combis, columns=["model", "bcp_47", "task"])
+        combis = combis.merge(old_results, on=["model", "bcp_47", "task"], how="left")
+        combis = combis[combis["metric"].isna()][["model", "bcp_47", "task"]]
+        # run evaluations in batches to prevent HTTP pool exhaustion
+        all_tasks = []
+        for i in range(n_sentences):
+            for model, bcp_47, task_name in combis.itertuples(index=False):
+                # All tasks now use the same signature
+                all_tasks.append((tasks[task_name], model, bcp_47, i))
+        print(f"⏳ Processing {len(all_tasks)} evaluation tasks in batches...")
+        batch_size = 50  # Process 50 tasks at a time
+        all_results = []
+        for i in range(0, len(all_tasks), batch_size):
+            batch = all_tasks[i:i+batch_size]
+            print(f"📦 Processing batch {i//batch_size + 1}/{(len(all_tasks) + batch_size - 1)//batch_size} ({len(batch)} tasks)")
+            # Show what's being evaluated in this batch
+            batch_summary = {}
+            for task_data in batch:
+                task_func, model, bcp_47, sentence_nr = task_data
+                # Extract task name from function - handle both partial functions and regular functions
+                if hasattr(task_func, 'func'):
+                    task_name = task_func.func.__name__.replace('_and_evaluate', '')
+                else:
+                    task_name = task_func.__name__.replace('_and_evaluate', '')
+                if task_name not in batch_summary:
+                    batch_summary[task_name] = set()
+                batch_summary[task_name].add(bcp_47)
+            for task_name, languages_set in batch_summary.items():
+                lang_list = ', '.join(sorted(languages_set))
+                print(f"  🔄 {task_name}: {lang_list}")
+            batch_coroutines = []
+            for task_data in batch:
+                task_func, model, bcp_47, sentence_nr = task_data
+                batch_coroutines.append(task_func(model, bcp_47, sentence_nr))
+            batch_results = await asyncio.gather(*batch_coroutines, return_exceptions=True)
+            all_results.extend(batch_results)
+            # Small delay between batches to avoid overwhelming the API
+            await asyncio.sleep(1)
+        results = all_results
+        # Filter out exceptions and flatten results
+        valid_results = []
+        exception_count = 0
+        for r in results:
+            if isinstance(r, Exception):
+                exception_count += 1
+                continue
+            if isinstance(r, list):
+                valid_results.extend(r)
+            else:
+                valid_results.append(r)
+        print(f"⚠️  Encountered {exception_count} API errors (model unavailable/rate limits)")
+        print(f"✅ Successfully processed {len(valid_results)} evaluations")
+        # Save partial results even if some failed
+        if valid_results:
+            results = valid_results
+            args = dict(orient="records", indent=2, force_ascii=False)
+            # Aggregate results like main branch
+            results_df = pd.DataFrame(results)
+            if len(results_df) > 0:
+                results_df = (
+                    results_df.groupby(["model", "bcp_47", "task", "metric", "origin"])
+                    .agg({"score": "mean"})
+                    .reset_index()
+                )
+                # Merge with old results
+                old_results = pd.read_json("results.json")
+                results_df = pd.concat([old_results, results_df])
+                results_df = results_df.sort_values(by=["model", "bcp_47", "task", "metric"])
+                results_df.to_json("results.json", **args)
+                print(f"💾 Saved {len(results_df)} aggregated results to results.json")
+            else:
+                print("⚠️  No valid results to aggregate")
+        else:
+            print("⚠️  No valid results to save - all API calls failed")
+        # Save up-to-date info on models and languages (like main branch)
+        all_models = pd.concat([pd.DataFrame(models), old_models])
+        all_models = all_models.drop_duplicates(subset=["id"]).sort_values(by=["id"])
+        all_models.to_json("models.json", **args)
+        pd.DataFrame(languages).to_json("languages.json", **args)
+        # Continue with next batch even if this one had errors
+        # Time estimation
+        elapsed = time.time() - start_time
+        elapsed_str = str(timedelta(seconds=int(elapsed)))
+        if n_languages < max_languages:
+            remaining_batches = (max_languages - n_languages) // 10
+            batch_count = max(1, n_languages // 10)  # Avoid division by zero
+            estimated_remaining = elapsed * remaining_batches / batch_count
+            eta = datetime.now() + timedelta(seconds=estimated_remaining)
+            print(f"⏱️  Batch completed in {elapsed_str}. ETA for full run: {eta.strftime('%H:%M:%S')}")
+        else:
+            print(f"✅ Full evaluation completed in {elapsed_str}")
+            print(f"🎉 Finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    # Save results locally
+    with open("results.json", "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"💾 Results saved to results.json")
+    # Save to Google Cloud Storage
+    await save_results_to_gcs(results)
+    return results
+if __name__ == "__main__":
+    results = asyncio.run(evaluate())

evals/models.py CHANGED Viewed

@@ -1,3 +1,4 @@
 import json
 import re
 from collections import defaultdict
@@ -211,26 +212,55 @@ google_rate_limit = AsyncLimiter(max_rate=10, time_period=1)
 @cache
 async def complete(**kwargs) -> str | None:
     async with openrouter_rate_limit:
         try:
-            response = await client.chat.completions.create(**kwargs)
         except BadRequestError as e:
             if "filtered" in e.message:
                 return None
             raise e
     if not response.choices:
         raise Exception(response)
     return response.choices[0].message.content.strip()
-translate_client = translate.Client()
-google_supported_languages = [l["language"] for l in translate_client.get_languages()]
 @cache
 async def translate_google(text, source_language, target_language):
     async with google_rate_limit:
-        response = translate_client.translate(
             text, source_language=source_language, target_language=target_language
         )
     return response["translatedText"]
@@ -294,12 +324,14 @@ def get_hf_metadata(row):
         return empty
     try:
         info = api.model_info(id)
-        license = (
-            (info.card_data.license or "")
-            .replace("-", " ")
-            .replace("mit", "MIT")
-            .title()
-        )
         return {
             "hf_id": info.id,
             "creation_date": info.created_at,
@@ -329,8 +361,30 @@ def load_models(date: date):
         + get_current_popular_models(date.today())[:10]
     )
     popular_models = [m["slug"] for m in popular_models]
-    models = set(important_models + popular_models) - set(blocklist)
-    models = pd.DataFrame(sorted(list(models)), columns=["id"])
     or_metadata = models["id"].apply(get_or_metadata)
     hf_metadata = or_metadata.apply(get_hf_metadata)
     creation_date_hf = pd.to_datetime(hf_metadata.str["creation_date"]).dt.date
@@ -350,7 +404,8 @@ def load_models(date: date):
         license=hf_metadata.str["license"],
         creation_date=creation_date_hf.combine_first(creation_date_or),
     )
-    # models = models[models["cost"] <= 2.0].reset_index(drop=True)
     models["tasks"] = [
         ["translation_from", "translation_to", "classification", "mmlu", "arc", "truthfulqa", "mgsm"]
     ] * len(models)

+import asyncio
 import json
 import re
 from collections import defaultdict
 @cache
 async def complete(**kwargs) -> str | None:
+    # Add longer timeout for slower, premium, or reasoning-focused models
+    model_id = kwargs.get('model', '')
+    slow_model_keywords = [
+        'claude-3.5', 'claude-3.7', 'claude-4', 'sonnet-4', # Claude
+        'gpt-4', 'o1', 'o3', # OpenAI
+        'gemini-2.5', 'gemini-pro', # Google
+        'llama-4', # Meta
+        'reasoning', 'thinking' # General
+    ]
+    timeout = 120 if any(keyword in model_id for keyword in slow_model_keywords) else 60
     async with openrouter_rate_limit:
         try:
+            response = await asyncio.wait_for(
+                client.chat.completions.create(**kwargs),
+                timeout=timeout
+            )
         except BadRequestError as e:
             if "filtered" in e.message:
                 return None
             raise e
+        except asyncio.TimeoutError:
+            print(f"⏰ Timeout after {timeout}s for model {model}")
+            return None
     if not response.choices:
         raise Exception(response)
     return response.choices[0].message.content.strip()
+translate_client = None
+def get_google_translate_client():
+    global translate_client
+    if translate_client is None:
+        translate_client = translate.Client()
+    return translate_client
+def get_google_supported_languages():
+    client = get_google_translate_client()
+    return [l["language"] for l in client.get_languages()]
 @cache
 async def translate_google(text, source_language, target_language):
+    client = get_google_translate_client()
     async with google_rate_limit:
+        response = client.translate(
             text, source_language=source_language, target_language=target_language
         )
     return response["translatedText"]
         return empty
     try:
         info = api.model_info(id)
+        license = ""
+        if info.card_data and hasattr(info.card_data, 'license') and info.card_data.license:
+            license = (
+                info.card_data.license
+                .replace("-", " ")
+                .replace("mit", "MIT")
+                .title()
+            )
         return {
             "hf_id": info.id,
             "creation_date": info.created_at,
         + get_current_popular_models(date.today())[:10]
     )
     popular_models = [m["slug"] for m in popular_models]
+    all_model_candidates = set(important_models + popular_models) - set(blocklist)
+    # Validate models exist on OpenRouter before including them
+    print(f"🔍 Validating {len(all_model_candidates)} model candidates...")
+    valid_models = []
+    invalid_models = []
+    for model_id in all_model_candidates:
+        metadata = get_or_metadata(model_id)
+        if metadata is not None:
+            valid_models.append(model_id)
+        else:
+            invalid_models.append(model_id)
+    if invalid_models:
+        print(f"⚠️ Excluded {len(invalid_models)} invalid models:")
+        for model in sorted(invalid_models)[:5]:  # Show first 5
+            print(f"   - {model}")
+        if len(invalid_models) > 5:
+            print(f"   ... and {len(invalid_models) - 5} more")
+    print(f"✅ Using {len(valid_models)} valid models for evaluation")
+    models = pd.DataFrame(sorted(valid_models), columns=["id"])
     or_metadata = models["id"].apply(get_or_metadata)
     hf_metadata = or_metadata.apply(get_hf_metadata)
     creation_date_hf = pd.to_datetime(hf_metadata.str["creation_date"]).dt.date
         license=hf_metadata.str["license"],
         creation_date=creation_date_hf.combine_first(creation_date_or),
     )
+    # Filter out expensive models to keep costs reasonable
+    models = models[models["cost"] <= 20.0].reset_index(drop=True)
     models["tasks"] = [
         ["translation_from", "translation_to", "classification", "mmlu", "arc", "truthfulqa", "mgsm"]
     ] * len(models)

evals/tasks.py CHANGED Viewed

@@ -1,3 +1,4 @@
 import random
 from functools import partial
 from textwrap import dedent
@@ -13,7 +14,7 @@ from datasets_.truthfulqa import load_truthfulqa
 from google.cloud import translate_v2 as translate
 from langcodes import closest_supported_match
 from languages import languages, script_name
-from models import complete, transcribe, translate_google
 bleu = evaluate.load("bleu")
 chrf = evaluate.load("chrf")
@@ -27,9 +28,6 @@ target_languages = languages[languages["in_benchmark"]].sample(
     frac=1, weights="speakers", replace=True, random_state=42
 )
-translate_client = translate.Client()
-supported_languages = [l["language"] for l in translate_client.get_languages()]
 async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     original_language = languages[languages["bcp_47"] == bcp_47].iloc[0]
@@ -48,6 +46,7 @@ async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     target_sentence = flores_sentences(target_language)["text"][sentence_nr].strip()
     script = script_name(target_language.flores_path.split("_")[1])
     if model == "google/translate-v2":
         original_language = closest_supported_match(
             original_language, supported_languages
         )
@@ -91,6 +90,7 @@ async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
             "task": f"translation_{mode}",
             "metric": metric,
             "score": score,
             "sentence_nr": sentence_nr,
         }
         for metric, score in (
@@ -112,38 +112,21 @@ async def classify_and_evaluate(model, bcp_47, nr):
     )
     top_topics = paragraphs.value_counts("topic").head(5).index
     paragraphs = paragraphs[paragraphs["topic"].isin(top_topics)]
-    examples = pd.concat(
-        [
-            paragraphs[paragraphs["topic"] == t].sample(n=1, random_state=42)
-            for t in top_topics
-        ]
-    ).sample(frac=1, random_state=nr)
-    test_paragraphs = paragraphs[~paragraphs["url"].isin(examples["url"])].sample(
-        frac=1, random_state=42
-    )
-    test_paragraph = test_paragraphs.iloc[nr]
-    def format_prompt(text):
-        return f"{text}\n\nTopic: {'|'.join(top_topics)}?"
-    messages = []
-    for example in examples.itertuples():
-        messages += [
-            {"role": "user", "content": format_prompt(example.text)},
-            {"role": "assistant", "content": example.topic},
-        ]
     # some models have poor tokenization for some languages, and the prompt for this task is relatively long, so it sometimes exceeds the context window
     # this is not just to blame on the context window but mostly on the model's tokenization, so we assign 0 accuracy in this case
     try:
         pred = await complete(
             model=model,
-            messages=[
-                *messages,
-                {
-                    "role": "user",
-                    "content": format_prompt(test_paragraph.text),
-                },
-            ],
             temperature=0,
             max_tokens=30,
         )
@@ -170,6 +153,7 @@ async def classify_and_evaluate(model, bcp_47, nr):
             "task": "classification",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
@@ -234,30 +218,36 @@ def format_multiple_choice(item):
     C: {item["choices"][2]}
     D: {item["choices"][3]}
-    A|B|C|D?"""
 async def mmlu_and_evaluate(model, language_bcp_47, nr):
-    ds_name, examples, task = load_mmlu(language_bcp_47, nr)
     if not task:
         return []
-    messages = []
-    for example in examples:
-        messages += [
-            {"role": "user", "content": format_multiple_choice(example)},
-            {"role": "assistant", "content": example["answer"]},
-        ]
-    messages += [{"role": "user", "content": format_multiple_choice(task)}]
     try:
         response = await complete(
             model=model,
             messages=messages,
             temperature=0,
-            max_tokens=1,
         )
-        if response:
-            acc = int(response[:1].strip() == task["answer"])
         else:
             acc = 0
     except Exception as e:
@@ -265,6 +255,7 @@ async def mmlu_and_evaluate(model, language_bcp_47, nr):
             acc = 0
         else:
             raise e
     return [
         {
             "model": model,
@@ -272,32 +263,39 @@ async def mmlu_and_evaluate(model, language_bcp_47, nr):
             "task": "mmlu",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
 async def arc_and_evaluate(model, language_bcp_47, nr):
-    ds_name, examples, task = load_uhura_arc_easy(language_bcp_47, nr)
     if not task:
         return []
-    messages = []
-    for example in examples:
-        messages += [
-            {"role": "user", "content": format_multiple_choice(example)},
-            {"role": "assistant", "content": example["answer"]},
-        ]
-    messages += [{"role": "user", "content": format_multiple_choice(task)}]
     try:
         response = await complete(
             model=model,
             messages=messages,
             temperature=0,
-            max_tokens=1,
         )
-        if response:
-            acc = int(response[:1].strip() == task["answer"])
         else:
             acc = 0
     except Exception as e:
@@ -312,6 +310,7 @@ async def arc_and_evaluate(model, language_bcp_47, nr):
             "task": "arc",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
@@ -337,28 +336,40 @@ def format_multiple_choice_truthfulqa(item):
 async def truthfulqa_and_evaluate(model, language_bcp_47, nr):
-    ds_name, examples, task = load_truthfulqa(language_bcp_47, nr)
     if not task:
         return []
-    task = shuffle_choices_and_labels(task)
-    answer = letters[task["labels"].index(1)]
-    messages = []
-    for example in examples:
-        example = shuffle_choices_and_labels(example)
-        messages += [
-            {"role": "user", "content": format_multiple_choice_truthfulqa(example)},
-            {"role": "assistant", "content": letters[example["labels"].index(1)]},
-        ]
-    messages += [{"role": "user", "content": format_multiple_choice_truthfulqa(task)}]
     try:
         response = await complete(
             model=model,
             messages=messages,
             temperature=0,
-            max_tokens=1,
         )
-        if response:
-            acc = int(response[:1].strip() == answer)
         else:
             acc = 0
     except Exception as e:
@@ -373,30 +384,36 @@ async def truthfulqa_and_evaluate(model, language_bcp_47, nr):
             "task": "truthfulqa",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
 async def mgsm_and_evaluate(model, language_bcp_47, nr):
-    system_prompt = """
-    Solve the math problem. Use reasoning, and finally give the answer as a number.
-    Response format: <reasoning> #### <number>
-    """
-    system_prompt = dedent(system_prompt).strip()
-    ds_slug, question = load_mgsm(language_bcp_47, nr)
     if not question:
         return []
     response = await complete(
         model=model,
-        messages=[
-            {"role": "system", "content": system_prompt},
-            {"role": "user", "content": question["question"]},
-        ],
         temperature=0,
         max_tokens=1024,
     )
-    if response and len(response.split("####")) == 2:
         number = response.split("####")[1].strip()
         accuracy = int(parse_number(number) == parse_number(question["answer_number"]))
     else:
@@ -409,6 +426,7 @@ async def mgsm_and_evaluate(model, language_bcp_47, nr):
             "task": "mgsm",
             "metric": "accuracy",
             "score": accuracy,
             "sentence_nr": nr,
         }
     ]
@@ -449,10 +467,8 @@ tasks = {
     "translation_from": partial(translate_and_evaluate, mode="from"),
     "translation_to": partial(translate_and_evaluate, mode="to"),
     "classification": classify_and_evaluate,
-    # "mlm": mlm_and_evaluate,
     "mmlu": mmlu_and_evaluate,
     "arc": arc_and_evaluate,
     "truthfulqa": truthfulqa_and_evaluate,
     "mgsm": mgsm_and_evaluate,
-    # "asr": transcribe_and_evaluate,
 }

+import asyncio
 import random
 from functools import partial
 from textwrap import dedent
 from google.cloud import translate_v2 as translate
 from langcodes import closest_supported_match
 from languages import languages, script_name
+from models import complete, transcribe, translate_google, get_google_supported_languages
 bleu = evaluate.load("bleu")
 chrf = evaluate.load("chrf")
     frac=1, weights="speakers", replace=True, random_state=42
 )
 async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     original_language = languages[languages["bcp_47"] == bcp_47].iloc[0]
     target_sentence = flores_sentences(target_language)["text"][sentence_nr].strip()
     script = script_name(target_language.flores_path.split("_")[1])
     if model == "google/translate-v2":
+        supported_languages = get_google_supported_languages()
         original_language = closest_supported_match(
             original_language, supported_languages
         )
             "task": f"translation_{mode}",
             "metric": metric,
             "score": score,
+            "origin": "human", # FLORES+ is human-translated
             "sentence_nr": sentence_nr,
         }
         for metric, score in (
     )
     top_topics = paragraphs.value_counts("topic").head(5).index
     paragraphs = paragraphs[paragraphs["topic"].isin(top_topics)]
+    test_paragraph = paragraphs.sample(n=1, random_state=nr).iloc[0]
+    prompt = f"""Classify the following text into one of these topics: {', '.join(top_topics)}.
+Reply with only the topic name.
+Text:
+{test_paragraph.text}
+"""
     # some models have poor tokenization for some languages, and the prompt for this task is relatively long, so it sometimes exceeds the context window
     # this is not just to blame on the context window but mostly on the model's tokenization, so we assign 0 accuracy in this case
     try:
         pred = await complete(
             model=model,
+            messages=[{"role": "user", "content": prompt}],
             temperature=0,
             max_tokens=30,
         )
             "task": "classification",
             "metric": "accuracy",
             "score": acc,
+            "origin": "human", # FLORES+ is human-translated
             "sentence_nr": nr,
         }
     ]
     C: {item["choices"][2]}
     D: {item["choices"][3]}
+    Answer with the letter of the correct answer."""
 async def mmlu_and_evaluate(model, language_bcp_47, nr):
+    ds_name, task, origin = await load_mmlu(language_bcp_47, nr)
     if not task:
         return []
+    messages = [
+        {
+            "role": "user",
+            "content": f"""Solve the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.
+Response format: <reasoning> #### <letter>
+---
+{format_multiple_choice(task)}""",
+        },
+    ]
     try:
         response = await complete(
             model=model,
             messages=messages,
             temperature=0,
+            max_tokens=1024,
         )
+        if response and "####" in response:
+            answer = response.split("####")[-1].strip()
+            acc = int(answer[:1] == task["answer"])
         else:
             acc = 0
     except Exception as e:
             acc = 0
         else:
             raise e
     return [
         {
             "model": model,
             "task": "mmlu",
             "metric": "accuracy",
             "score": acc,
+            "origin": origin,  # Add origin tag to results
             "sentence_nr": nr,
         }
     ]
 async def arc_and_evaluate(model, language_bcp_47, nr):
+    ds_name, task, origin = load_uhura_arc_easy(language_bcp_47, nr)
     if not task:
         return []
+    messages = [
+        {
+            "role": "user",
+            "content": f"""Solve the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.
+Response format: <reasoning> #### <letter>
+---
+{format_multiple_choice(task)}""",
+        },
+    ]
     try:
         response = await complete(
             model=model,
             messages=messages,
             temperature=0,
+            max_tokens=1024,
         )
+        if response and "####" in response:
+            answer = response.split("####")[-1].strip()
+            acc = int(answer[:1] == task["answer"])
         else:
             acc = 0
     except Exception as e:
             "task": "arc",
             "metric": "accuracy",
             "score": acc,
+            "origin": origin,
             "sentence_nr": nr,
         }
     ]
 async def truthfulqa_and_evaluate(model, language_bcp_47, nr):
+    ds_name, task, origin = await load_truthfulqa(language_bcp_47, nr)
     if not task:
         return []
+    # Find the correct answer
+    try:
+        correct_choice_index = task["labels"].index(1)
+        answer = letters[correct_choice_index]
+    except (ValueError, IndexError):
+        # Handle cases where there is no correct answer or labels are malformed
+        return []
+    messages = [
+        {
+            "role": "user",
+            "content": f"""Answer the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.
+Response format: <reasoning> #### <letter>
+---
+{format_multiple_choice_truthfulqa(task)}""",
+        },
+    ]
     try:
         response = await complete(
             model=model,
             messages=messages,
             temperature=0,
+            max_tokens=1024, # Increased for reasoning
         )
+        if response and "####" in response:
+            pred_answer = response.split("####")[-1].strip()
+            acc = int(pred_answer[:1].upper() == answer)
         else:
             acc = 0
     except Exception as e:
             "task": "truthfulqa",
             "metric": "accuracy",
             "score": acc,
+            "origin": origin,
             "sentence_nr": nr,
         }
     ]
 async def mgsm_and_evaluate(model, language_bcp_47, nr):
+    ds_slug, question, origin = load_mgsm(language_bcp_47, nr)
     if not question:
         return []
+    messages = [
+        {
+            "role": "user",
+            "content": f"""Solve the following math problem. Reason step-by-step and then write the final answer as a number.
+Response format: <reasoning> #### <number>
+---
+{question["question"]}""",
+        },
+    ]
     response = await complete(
         model=model,
+        messages=messages,
         temperature=0,
         max_tokens=1024,
     )
+    if response and "####" in response:
         number = response.split("####")[1].strip()
         accuracy = int(parse_number(number) == parse_number(question["answer_number"]))
     else:
             "task": "mgsm",
             "metric": "accuracy",
             "score": accuracy,
+            "origin": origin,
             "sentence_nr": nr,
         }
     ]
     "translation_from": partial(translate_and_evaluate, mode="from"),
     "translation_to": partial(translate_and_evaluate, mode="to"),
     "classification": classify_and_evaluate,
     "mmlu": mmlu_and_evaluate,
     "arc": arc_and_evaluate,
     "truthfulqa": truthfulqa_and_evaluate,
     "mgsm": mgsm_and_evaluate,
 }

frontend/src/App.js CHANGED Viewed

@@ -19,6 +19,7 @@ function App () {
   const [loading, setLoading] = useState(true)
   const [error, setError] = useState(null)
   const [selectedLanguages, setSelectedLanguages] = useState([])
   const [dialogVisible, setDialogVisible] = useState(false)
   const [aboutVisible, setAboutVisible] = useState(false)
   const [contributeVisible, setContributeVisible] = useState(false)
@@ -36,6 +37,7 @@ function App () {
       })
       .then(jsonData => {
         setData(jsonData)
         setLoading(false)
       })
       .catch(err => {
@@ -235,6 +237,7 @@ function App () {
                 data={data.model_table}
                 selectedLanguages={selectedLanguages}
                 allLanguages={data.language_table || []}
               />
               <LanguageTable
                 data={data.language_table}
@@ -265,7 +268,7 @@ function App () {
                 />
                 <Carousel
                   value={[
-                    <WorldMap data={data.countries} />,
                     <LanguagePlot data={data} />,
                     <SpeakerPlot data={data} />,
                     <HistoryPlot data={data} />,
@@ -430,6 +433,7 @@ function App () {
                 value={[
                   <WorldMap
                     data={data.countries}
                     width={windowWidth * 0.7}
                     height={windowHeight * 0.6}
                   />,

   const [loading, setLoading] = useState(true)
   const [error, setError] = useState(null)
   const [selectedLanguages, setSelectedLanguages] = useState([])
+  const [machineTranslatedMetrics, setMachineTranslatedMetrics] = useState([])
   const [dialogVisible, setDialogVisible] = useState(false)
   const [aboutVisible, setAboutVisible] = useState(false)
   const [contributeVisible, setContributeVisible] = useState(false)
       })
       .then(jsonData => {
         setData(jsonData)
+        setMachineTranslatedMetrics(jsonData.machine_translated_metrics || [])
         setLoading(false)
       })
       .catch(err => {
                 data={data.model_table}
                 selectedLanguages={selectedLanguages}
                 allLanguages={data.language_table || []}
+                machineTranslatedMetrics={machineTranslatedMetrics}
               />
               <LanguageTable
                 data={data.language_table}
                 />
                 <Carousel
                   value={[
+                    <WorldMap data={data.countries} allLanguages={data.language_table} />,
                     <LanguagePlot data={data} />,
                     <SpeakerPlot data={data} />,
                     <HistoryPlot data={data} />,
                 value={[
                   <WorldMap
                     data={data.countries}
+                    allLanguages={data.language_table}
                     width={windowWidth * 0.7}
                     height={windowHeight * 0.6}
                   />,

frontend/src/components/ModelTable.js CHANGED Viewed

@@ -6,7 +6,7 @@ import { useState, useEffect } from 'react'
 import Medal from './Medal'
 import { Slider } from 'primereact/slider'
 import ScoreColumns from './ScoreColumns'
-const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
   const [filters, setFilters] = useState({
     type: { value: null, matchMode: FilterMatchMode.IN },
     size: { value: null, matchMode: FilterMatchMode.BETWEEN },
@@ -155,17 +155,27 @@ const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
   }
   const getHeaderText = () => {
-    // Count languages that have evaluation data (average score available)
-    const evaluatedLanguagesCount = allLanguages.filter(lang =>
-      lang.average !== null && lang.average !== undefined
-    ).length
     if (selectedLanguages.length === 0) {
       return (
         <span>
           <span style={{ fontWeight: 'bold', fontSize: '1.1em' }}>AI Models</span>
           <span style={{ fontSize: '0.85em', marginLeft: '0.5rem' }}>
-            Average performance across {evaluatedLanguagesCount} evaluated languages
           </span>
         </span>
       )
@@ -249,7 +259,7 @@ const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
         body={costBodyTemplate}
         style={{ minWidth: '5rem' }}
       />
-      {ScoreColumns}
     </DataTable>
   )
 }

 import Medal from './Medal'
 import { Slider } from 'primereact/slider'
 import ScoreColumns from './ScoreColumns'
+const ModelTable = ({ data, selectedLanguages = [], allLanguages = [], machineTranslatedMetrics = [] }) => {
   const [filters, setFilters] = useState({
     type: { value: null, matchMode: FilterMatchMode.IN },
     size: { value: null, matchMode: FilterMatchMode.BETWEEN },
   }
   const getHeaderText = () => {
+    // Count languages that have any evaluation data (any task scores available)
+    const evaluatedLanguagesCount = allLanguages.filter(lang => {
+      // Check if language has any task scores (not just average)
+      const hasAnyScores = [
+        'translation_from_bleu',
+        'translation_to_bleu',
+        'classification_accuracy',
+        'mmlu_accuracy',
+        'arc_accuracy',
+        'truthfulqa_accuracy',
+        'mgsm_accuracy'
+      ].some(metric => lang[metric] !== null && lang[metric] !== undefined)
+      return hasAnyScores
+    }).length
     if (selectedLanguages.length === 0) {
       return (
         <span>
           <span style={{ fontWeight: 'bold', fontSize: '1.1em' }}>AI Models</span>
           <span style={{ fontSize: '0.85em', marginLeft: '0.5rem' }}>
+            Performance across {evaluatedLanguagesCount} evaluated languages
           </span>
         </span>
       )
         body={costBodyTemplate}
         style={{ minWidth: '5rem' }}
       />
+      {ScoreColumns(machineTranslatedMetrics)}
     </DataTable>
   )
 }

frontend/src/components/ScoreColumns.js CHANGED Viewed

@@ -2,21 +2,22 @@ import { Column } from 'primereact/column'
 import ScoreField from './ScoreField'
 const scoreBodyTemplate = (field, options = {}) => {
-  const { minScore = 0, maxScore = 1 } = options
   return rowData => {
     const score = rowData[field]
-    return ScoreField(score, minScore, maxScore)
   }
 }
-const ScoreColumns = [
   <Column
     field='average'
     header='Proficiency'
     headerTooltip='Language Proficiency Score (average of the scores for each task, after min-max normalization)'
     sortable
-    body={scoreBodyTemplate('average', { minScore: 0.2, maxScore: 0.5 })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
   <Column
@@ -26,7 +27,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('translation_from_bleu', {
       minScore: 0,
-      maxScore: 0.5
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -37,7 +39,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('translation_to_bleu', {
       minScore: 0,
-      maxScore: 0.5
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -48,7 +51,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('classification_accuracy', {
       minScore: 0,
-      maxScore: 0.5
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -69,7 +73,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('mmlu_accuracy', {
       minScore: 0,
-      maxScore: 1
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -80,7 +85,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('arc_accuracy', {
       minScore: 0,
-      maxScore: 1
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -91,7 +97,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('mgsm_accuracy', {
       minScore: 0,
-      maxScore: 1
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,

 import ScoreField from './ScoreField'
 const scoreBodyTemplate = (field, options = {}) => {
+  const { minScore = 0, maxScore = 1, machineTranslatedMetrics = [] } = options
   return rowData => {
     const score = rowData[field]
+    const isMachineTranslated = machineTranslatedMetrics.includes(field)
+    return ScoreField(score, minScore, maxScore, isMachineTranslated)
   }
 }
+const ScoreColumns = (machineTranslatedMetrics = []) => [
   <Column
     field='average'
     header='Proficiency'
     headerTooltip='Language Proficiency Score (average of the scores for each task, after min-max normalization)'
     sortable
+    body={scoreBodyTemplate('average', { minScore: 0.2, maxScore: 0.5, machineTranslatedMetrics })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
   <Column
     sortable
     body={scoreBodyTemplate('translation_from_bleu', {
       minScore: 0,
+      maxScore: 0.5,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('translation_to_bleu', {
       minScore: 0,
+      maxScore: 0.5,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('classification_accuracy', {
       minScore: 0,
+      maxScore: 0.5,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('mmlu_accuracy', {
       minScore: 0,
+      maxScore: 1,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('arc_accuracy', {
       minScore: 0,
+      maxScore: 1,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('mgsm_accuracy', {
       minScore: 0,
+      maxScore: 1,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,

frontend/src/components/ScoreField.js CHANGED Viewed

@@ -1,4 +1,4 @@
-const ScoreField = (score, minScore, maxScore) => {
   let percentage = 100
   let barColor = "rgba(210, 106, 255, 0.1)" // light violet for missing data
   if (score !== null) {
@@ -50,6 +50,7 @@ const ScoreField = (score, minScore, maxScore) => {
         }}
       >
         {score !== null ? (score * 100).toFixed(1)+"%" : '–'}
       </span>
     </div>
   )

+const ScoreField = (score, minScore, maxScore, isMachineTranslated = false) => {
   let percentage = 100
   let barColor = "rgba(210, 106, 255, 0.1)" // light violet for missing data
   if (score !== null) {
         }}
       >
         {score !== null ? (score * 100).toFixed(1)+"%" : '–'}
+        {isMachineTranslated && score !== null && <span style={{color: '#666', fontSize: '0.8em'}}>*</span>}
       </span>
     </div>
   )

frontend/src/components/WorldMap.js CHANGED Viewed

@@ -32,7 +32,7 @@ const makeTitle = data => d => {
   return `${d.properties.ADMIN} – ${cData?.score === null || cData?.score === undefined ? "n/a" : cData.score.toFixed(2)}\n\n${langstring}`
 }
-const WorldMap = ({ data, width = 750, height = 500 }) => {
   const containerRef = useRef()
   const [mapData, setMapData] = useState()
@@ -48,8 +48,22 @@ const WorldMap = ({ data, width = 750, height = 500 }) => {
       acc[country.iso2] = country
       return acc
     }, {})
     const plot = Plot.plot({
-      subtitle: 'Language Proficiency Score by Country (Coverage: ~65/194 benchmark languages)',
       width: width,
       height: height,
       projection: 'equal-earth',

   return `${d.properties.ADMIN} – ${cData?.score === null || cData?.score === undefined ? "n/a" : cData.score.toFixed(2)}\n\n${langstring}`
 }
+const WorldMap = ({ data, width = 750, height = 500, allLanguages = [] }) => {
   const containerRef = useRef()
   const [mapData, setMapData] = useState()
       acc[country.iso2] = country
       return acc
     }, {})
+    // Count languages that have any evaluation data
+    const evaluatedLanguagesCount = allLanguages.filter(lang => {
+      const hasAnyScores = [
+        'translation_from_bleu',
+        'translation_to_bleu',
+        'classification_accuracy',
+        'mmlu_accuracy',
+        'arc_accuracy',
+        'truthfulqa_accuracy',
+        'mgsm_accuracy'
+      ].some(metric => lang[metric] !== null && lang[metric] !== undefined)
+      return hasAnyScores
+    }).length
     const plot = Plot.plot({
+      subtitle: `Language Proficiency Score by Country (Coverage: ~${evaluatedLanguagesCount} languages evaluated)`,
       width: width,
       height: height,
       projection: 'equal-earth',

languages.json CHANGED Viewed

@@ -7,7 +7,7 @@
     "family":"Indo-European",
     "flores_path":"eng_Latn",
     "fleurs_tag":"en_us",
-    "commonvoice_hours":2678.0,
     "commonvoice_locale":"en",
     "in_benchmark":true
   },
@@ -79,7 +79,7 @@
     "family":"Indo-European",
     "flores_path":"fra_Latn",
     "fleurs_tag":"fr_fr",
-    "commonvoice_hours":1067.0,
     "commonvoice_locale":"fr",
     "in_benchmark":true
   },
@@ -151,7 +151,7 @@
     "family":"Austronesian",
     "flores_path":"ind_Latn",
     "fleurs_tag":"id_id",
-    "commonvoice_hours":33.0,
     "commonvoice_locale":"id",
     "in_benchmark":true
   },
@@ -163,7 +163,7 @@
     "family":"Indo-European",
     "flores_path":"deu_Latn",
     "fleurs_tag":"de_de",
-    "commonvoice_hours":1370.0,
     "commonvoice_locale":"de",
     "in_benchmark":true
   },
@@ -439,7 +439,7 @@
     "family":"Indo-European",
     "flores_path":"pol_Latn",
     "fleurs_tag":"pl_pl",
-    "commonvoice_hours":175.0,
     "commonvoice_locale":"pl",
     "in_benchmark":true
   },
@@ -619,7 +619,7 @@
     "family":"Indo-European",
     "flores_path":"nld_Latn",
     "fleurs_tag":"nl_nl",
-    "commonvoice_hours":122.0,
     "commonvoice_locale":"nl",
     "in_benchmark":true
   },
@@ -1291,7 +1291,7 @@
     "family":"Indo-European",
     "flores_path":"cat_Latn",
     "fleurs_tag":"ca_es",
-    "commonvoice_hours":2874.0,
     "commonvoice_locale":"ca",
     "in_benchmark":true
   },
@@ -1303,7 +1303,7 @@
     "family":"Afro-Asiatic",
     "flores_path":"heb_Hebr",
     "fleurs_tag":"he_il",
-    "commonvoice_hours":1.6,
     "commonvoice_locale":"he",
     "in_benchmark":true
   },
@@ -1375,7 +1375,7 @@
     "family":"Turkic",
     "flores_path":"uig_Arab",
     "fleurs_tag":null,
-    "commonvoice_hours":412.0,
     "commonvoice_locale":"ug",
     "in_benchmark":true
   },
@@ -1675,7 +1675,7 @@
     "family":"Tupian",
     "flores_path":"gug_Latn",
     "fleurs_tag":null,
-    "commonvoice_hours":4.0,
     "commonvoice_locale":"gn",
     "in_benchmark":true
   },
@@ -1747,7 +1747,7 @@
     "family":"Indo-European",
     "flores_path":"nob_Latn",
     "fleurs_tag":"nb_no",
-    "commonvoice_hours":0.6,
     "commonvoice_locale":"nb-NO",
     "in_benchmark":true
   },
@@ -2155,7 +2155,7 @@
     "family":"Kartvelian",
     "flores_path":"kat_Geor",
     "fleurs_tag":"ka_ge",
-    "commonvoice_hours":166.0,
     "commonvoice_locale":"ka",
     "in_benchmark":true
   },
@@ -2167,7 +2167,7 @@
     "family":"Indo-European",
     "flores_path":"glg_Latn",
     "fleurs_tag":"gl_es",
-    "commonvoice_hours":121.0,
     "commonvoice_locale":"gl",
     "in_benchmark":true
   },
@@ -3331,7 +3331,7 @@
     "family":"Indo-European",
     "flores_path":"gle_Latn",
     "fleurs_tag":"ga_ie",
-    "commonvoice_hours":8.8,
     "commonvoice_locale":"ga-IE",
     "in_benchmark":true
   },
@@ -3559,7 +3559,7 @@
     "family":"Abkhaz-Adyge",
     "flores_path":null,
     "fleurs_tag":null,
-    "commonvoice_hours":92.0,
     "commonvoice_locale":"kbd",
     "in_benchmark":false
   },
@@ -3679,7 +3679,7 @@
     "family":"Indo-European",
     "flores_path":"ydd_Hebr",
     "fleurs_tag":null,
-    "commonvoice_hours":0.8,
     "commonvoice_locale":"yi",
     "in_benchmark":true
   },
@@ -5011,7 +5011,7 @@
     "family":"Nakh-Daghestanian",
     "flores_path":"dar_Cyrl",
     "fleurs_tag":null,
-    "commonvoice_hours":0.0,
     "commonvoice_locale":"dar",
     "in_benchmark":true
   },

     "family":"Indo-European",
     "flores_path":"eng_Latn",
     "fleurs_tag":"en_us",
+    "commonvoice_hours":2679.0,
     "commonvoice_locale":"en",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"fra_Latn",
     "fleurs_tag":"fr_fr",
+    "commonvoice_hours":1068.0,
     "commonvoice_locale":"fr",
     "in_benchmark":true
   },
     "family":"Austronesian",
     "flores_path":"ind_Latn",
     "fleurs_tag":"id_id",
+    "commonvoice_hours":34.0,
     "commonvoice_locale":"id",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"deu_Latn",
     "fleurs_tag":"de_de",
+    "commonvoice_hours":1371.0,
     "commonvoice_locale":"de",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"pol_Latn",
     "fleurs_tag":"pl_pl",
+    "commonvoice_hours":176.0,
     "commonvoice_locale":"pl",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"nld_Latn",
     "fleurs_tag":"nl_nl",
+    "commonvoice_hours":123.0,
     "commonvoice_locale":"nl",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"cat_Latn",
     "fleurs_tag":"ca_es",
+    "commonvoice_hours":2878.0,
     "commonvoice_locale":"ca",
     "in_benchmark":true
   },
     "family":"Afro-Asiatic",
     "flores_path":"heb_Hebr",
     "fleurs_tag":"he_il",
+    "commonvoice_hours":1.7,
     "commonvoice_locale":"he",
     "in_benchmark":true
   },
     "family":"Turkic",
     "flores_path":"uig_Arab",
     "fleurs_tag":null,
+    "commonvoice_hours":427.0,
     "commonvoice_locale":"ug",
     "in_benchmark":true
   },
     "family":"Tupian",
     "flores_path":"gug_Latn",
     "fleurs_tag":null,
+    "commonvoice_hours":4.1,
     "commonvoice_locale":"gn",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"nob_Latn",
     "fleurs_tag":"nb_no",
+    "commonvoice_hours":1.5,
     "commonvoice_locale":"nb-NO",
     "in_benchmark":true
   },
     "family":"Kartvelian",
     "flores_path":"kat_Geor",
     "fleurs_tag":"ka_ge",
+    "commonvoice_hours":167.0,
     "commonvoice_locale":"ka",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"glg_Latn",
     "fleurs_tag":"gl_es",
+    "commonvoice_hours":129.0,
     "commonvoice_locale":"gl",
     "in_benchmark":true
   },
     "family":"Indo-European",
     "flores_path":"gle_Latn",
     "fleurs_tag":"ga_ie",
+    "commonvoice_hours":9.1,
     "commonvoice_locale":"ga-IE",
     "in_benchmark":true
   },
     "family":"Abkhaz-Adyge",
     "flores_path":null,
     "fleurs_tag":null,
+    "commonvoice_hours":94.0,
     "commonvoice_locale":"kbd",
     "in_benchmark":false
   },
     "family":"Indo-European",
     "flores_path":"ydd_Hebr",
     "fleurs_tag":null,
+    "commonvoice_hours":1.4,
     "commonvoice_locale":"yi",
     "in_benchmark":true
   },
     "family":"Nakh-Daghestanian",
     "flores_path":"dar_Cyrl",
     "fleurs_tag":null,
+    "commonvoice_hours":0.9,
     "commonvoice_locale":"dar",
     "in_benchmark":true
   },

models.json CHANGED Viewed

@@ -1,4 +1,44 @@
 [
   {
     "id":"amazon\/nova-micro-v1",
     "name":"Nova Micro 1.0",
@@ -19,6 +59,66 @@
       "mgsm"
     ]
   },
   {
     "id":"anthropic\/claude-3.5-sonnet",
     "name":"Claude 3.5 Sonnet",
@@ -79,6 +179,106 @@
       "mgsm"
     ]
   },
   {
     "id":"deepseek\/deepseek-chat",
     "name":"DeepSeek V3",
@@ -128,7 +328,7 @@
     "size":684531386000.0,
     "type":"open-source",
     "license":"Mit",
-    "creation_date":1737331200000.0,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -179,6 +379,26 @@
       "mgsm"
     ]
   },
   {
     "id":"google\/gemini-2.0-flash-lite-001",
     "name":"Gemini 2.0 Flash Lite",
@@ -219,6 +439,26 @@
       "mgsm"
     ]
   },
   {
     "id":"google\/gemini-2.5-flash-lite-preview-06-17",
     "name":"Gemini 2.5 Flash Lite Preview 06-17",
@@ -370,15 +610,15 @@
     ]
   },
   {
-    "id":"google\/gemma-3-27b-it",
-    "name":"Gemma 3 27B",
     "provider_name":"Google",
-    "cost":0.0,
-    "hf_id":"google\/gemma-3-27b-it",
-    "size":27432406640.0,
-    "type":"open-source",
-    "license":"Gemma",
-    "creation_date":1740787200000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -390,30 +630,35 @@
     ]
   },
   {
-    "id":"google\/translate-v2",
-    "name":"Google Translate",
     "provider_name":"Google",
-    "cost":20.0,
-    "hf_id":null,
-    "size":null,
-    "type":"closed-source",
-    "license":null,
-    "creation_date":null,
     "tasks":[
       "translation_from",
-      "translation_to"
     ]
   },
   {
-    "id":"gryphe\/mythomax-l2-13b",
-    "name":"MythoMax 13B",
-    "provider_name":"MythoMax 13B",
-    "cost":0.07,
-    "hf_id":"Gryphe\/MythoMax-L2-13b",
-    "size":null,
     "type":"open-source",
-    "license":"Other",
-    "creation_date":1691625600000.0,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -425,15 +670,15 @@
     ]
   },
   {
-    "id":"meta-llama\/llama-3-70b-instruct",
-    "name":"Llama 3 70B Instruct",
-    "provider_name":"Meta",
-    "cost":0.4,
-    "hf_id":"meta-llama\/Meta-Llama-3-70B-Instruct",
-    "size":70553706496.0,
     "type":"open-source",
-    "license":"Llama3",
-    "creation_date":1713312000000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -445,15 +690,15 @@
     ]
   },
   {
-    "id":"meta-llama\/llama-3.1-70b-instruct",
-    "name":"Llama 3.1 70B Instruct",
-    "provider_name":"Meta",
-    "cost":0.28,
-    "hf_id":"meta-llama\/Llama-3.1-70B-Instruct",
-    "size":70553706496.0,
     "type":"open-source",
-    "license":"Llama3.1",
-    "creation_date":1721088000000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -465,9 +710,164 @@
     ]
   },
   {
-    "id":"meta-llama\/llama-3.1-8b-instruct",
-    "name":"Llama 3.1 8B Instruct",
-    "provider_name":"Meta",
     "cost":0.0,
     "hf_id":"meta-llama\/Llama-3.1-8B-Instruct",
     "size":8030261248.0,
@@ -476,6 +876,26 @@
     "creation_date":1721260800000.0,
     "tasks":null
   },
   {
     "id":"meta-llama\/llama-3.2-1b-instruct",
     "name":"Llama 3.2 1B Instruct",
@@ -488,6 +908,26 @@
     "creation_date":1726617600000.0,
     "tasks":null
   },
   {
     "id":"meta-llama\/llama-3.3-70b-instruct",
     "name":"Llama 3.3 70B Instruct",
@@ -529,15 +969,295 @@
     ]
   },
   {
-    "id":"microsoft\/phi-4",
-    "name":"Phi 4",
-    "provider_name":"Microsoft",
-    "cost":0.14,
-    "hf_id":"microsoft\/phi-4",
-    "size":14659507200.0,
-    "type":"open-source",
-    "license":"Mit",
-    "creation_date":1733875200000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -549,15 +1269,15 @@
     ]
   },
   {
-    "id":"microsoft\/phi-4-multimodal-instruct",
-    "name":"Phi 4 Multimodal Instruct",
-    "provider_name":"Microsoft",
-    "cost":0.1,
-    "hf_id":"microsoft\/Phi-4-multimodal-instruct",
-    "size":5574460384.0,
     "type":"open-source",
-    "license":"Mit",
-    "creation_date":1740355200000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -569,15 +1289,15 @@
     ]
   },
   {
-    "id":"mistralai\/mistral-nemo",
-    "name":"Mistral Nemo",
     "provider_name":"Mistral",
-    "cost":0.0,
-    "hf_id":"mistralai\/Mistral-Nemo-Instruct-2407",
-    "size":12247782400.0,
     "type":"open-source",
     "license":"Apache 2.0",
-    "creation_date":1721174400000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -589,15 +1309,15 @@
     ]
   },
   {
-    "id":"mistralai\/mistral-saba",
-    "name":"Saba",
-    "provider_name":"Mistral",
-    "cost":0.6,
-    "hf_id":null,
     "size":null,
-    "type":"closed-source",
-    "license":null,
-    "creation_date":1739750400000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -609,15 +1329,15 @@
     ]
   },
   {
-    "id":"mistralai\/mistral-small-3.1-24b-instruct",
-    "name":"Mistral Small 3.1 24B",
-    "provider_name":"Mistral",
-    "cost":0.0,
-    "hf_id":"mistralai\/Mistral-Small-3.1-24B-Instruct-2503",
-    "size":24011361280.0,
-    "type":"open-source",
-    "license":"Apache 2.0",
-    "creation_date":1741651200000,
     "tasks":[
       "translation_from",
       "translation_to",
@@ -708,6 +1428,26 @@
       "mgsm"
     ]
   },
   {
     "id":"openai\/gpt-4o-mini",
     "name":"GPT-4o-mini",
@@ -728,6 +1468,86 @@
       "mgsm"
     ]
   },
   {
     "id":"qwen\/qwen3-235b-a22b",
     "name":"Qwen3 235B A22B",
@@ -787,5 +1607,185 @@
       "truthfulqa",
       "mgsm"
     ]
   }
 ]

 [
+  {
+    "id":"aion-labs\/aion-1.0-mini",
+    "name":"Aion-1.0-Mini",
+    "provider_name":"AionLabs",
+    "cost":1.4,
+    "hf_id":"FuseAI\/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview",
+    "size":32763876352.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1737331200000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"aion-labs\/aion-rp-llama-3.1-8b",
+    "name":"Aion-RP 1.0 (8B)",
+    "provider_name":"AionLabs",
+    "cost":0.2,
+    "hf_id":"aion-labs\/Aion-RP-Llama-3.1-8B",
+    "size":8030261248.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1731110400000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"amazon\/nova-micro-v1",
     "name":"Nova Micro 1.0",
       "mgsm"
     ]
   },
+  {
+    "id":"amazon\/nova-pro-v1",
+    "name":"Nova Pro 1.0",
+    "provider_name":"Amazon",
+    "cost":3.2,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1733356800000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"anthracite-org\/magnum-v4-72b",
+    "name":"Magnum v4 72B",
+    "provider_name":"Magnum v4 72B",
+    "cost":3.0,
+    "hf_id":"anthracite-org\/magnum-v4-72b",
+    "size":72706203648.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1726790400000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"anthropic\/claude-3-haiku",
+    "name":"Claude 3 Haiku",
+    "provider_name":"Anthropic",
+    "cost":1.25,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1710288000000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"anthropic\/claude-3.5-sonnet",
     "name":"Claude 3.5 Sonnet",
       "mgsm"
     ]
   },
+  {
+    "id":"arcee-ai\/maestro-reasoning",
+    "name":"Maestro Reasoning",
+    "provider_name":"Arcee AI",
+    "cost":3.3,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1746403200000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"cognitivecomputations\/dolphin3.0-r1-mistral-24b",
+    "name":"Dolphin3.0 R1 Mistral 24B",
+    "provider_name":"Dolphin3.0 R1 Mistral 24B (free)",
+    "cost":0.0,
+    "hf_id":"dphn\/Dolphin3.0-R1-Mistral-24B",
+    "size":23572423680.0,
+    "type":"open-source",
+    "license":"",
+    "creation_date":1738800000000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"cohere\/command",
+    "name":"Command",
+    "provider_name":"Cohere",
+    "cost":2.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1710374400000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"cohere\/command-r",
+    "name":"Command R",
+    "provider_name":"Cohere",
+    "cost":1.5,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1710374400000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"cohere\/command-r7b-12-2024",
+    "name":"Command R7B (12-2024)",
+    "provider_name":"Cohere",
+    "cost":0.15,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1734134400000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"deepseek\/deepseek-chat",
     "name":"DeepSeek V3",
     "size":684531386000.0,
     "type":"open-source",
     "license":"Mit",
+    "creation_date":1737331200000,
     "tasks":[
       "translation_from",
       "translation_to",
       "mgsm"
     ]
   },
+  {
+    "id":"google\/gemini-2.0-flash-exp",
+    "name":"Gemini 2.0 Flash Experimental",
+    "provider_name":"Google",
+    "cost":0.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1733875200000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"google\/gemini-2.0-flash-lite-001",
     "name":"Gemini 2.0 Flash Lite",
       "mgsm"
     ]
   },
+  {
+    "id":"google\/gemini-2.5-flash-lite",
+    "name":"Gemini 2.5 Flash Lite",
+    "provider_name":"Google",
+    "cost":0.4,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1753142400000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"google\/gemini-2.5-flash-lite-preview-06-17",
     "name":"Gemini 2.5 Flash Lite Preview 06-17",
     ]
   },
   {
+    "id":"google\/gemini-pro-1.5",
+    "name":"Gemini 1.5 Pro",
     "provider_name":"Google",
+    "cost":5.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1712620800000.0,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"google\/gemma-2-9b-it",
+    "name":"Gemma 2 9B",
     "provider_name":"Google",
+    "cost":0.0,
+    "hf_id":"google\/gemma-2-9b-it",
+    "size":9241705984.0,
+    "type":"open-source",
+    "license":"Gemma",
+    "creation_date":1719187200000.0,
     "tasks":[
       "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
     ]
   },
   {
+    "id":"google\/gemma-3-27b-it",
+    "name":"Gemma 3 27B",
+    "provider_name":"Google",
+    "cost":0.0,
+    "hf_id":"google\/gemma-3-27b-it",
+    "size":27432406640.0,
     "type":"open-source",
+    "license":"Gemma",
+    "creation_date":1740787200000,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"google\/gemma-3n-e2b-it",
+    "name":"Gemma 3n 2B",
+    "provider_name":"Google",
+    "cost":0.0,
+    "hf_id":"google\/gemma-3n-E2B-it",
+    "size":5439438272.0,
     "type":"open-source",
+    "license":"Gemma",
+    "creation_date":1749686400000,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"google\/gemma-3n-e4b-it",
+    "name":"Gemma 3n 4B",
+    "provider_name":"Google",
+    "cost":0.0,
+    "hf_id":"google\/gemma-3n-E4B-it",
+    "size":7849978192.0,
     "type":"open-source",
+    "license":"Gemma",
+    "creation_date":1748908800000.0,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"google\/translate-v2",
+    "name":"Google Translate",
+    "provider_name":"Google",
+    "cost":20.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":null,
+    "tasks":[
+      "translation_from",
+      "translation_to"
+    ]
+  },
+  {
+    "id":"gryphe\/mythomax-l2-13b",
+    "name":"MythoMax 13B",
+    "provider_name":"MythoMax 13B",
+    "cost":0.06,
+    "hf_id":"Gryphe\/MythoMax-L2-13b",
+    "size":null,
+    "type":"open-source",
+    "license":"Other",
+    "creation_date":1691625600000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"inception\/mercury",
+    "name":"Mercury",
+    "provider_name":"Inception",
+    "cost":1.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1750896000000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"inflection\/inflection-3-productivity",
+    "name":"Inflection 3 Productivity",
+    "provider_name":"Inflection",
+    "cost":10.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1728604800000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"meta-llama\/llama-3-70b-instruct",
+    "name":"Llama 3 70B Instruct",
+    "provider_name":"Meta",
+    "cost":0.4,
+    "hf_id":"meta-llama\/Meta-Llama-3-70B-Instruct",
+    "size":70553706496.0,
+    "type":"open-source",
+    "license":"Llama3",
+    "creation_date":1713312000000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"meta-llama\/llama-3-8b-instruct",
+    "name":"Llama 3 8B Instruct",
+    "provider_name":"Meta",
+    "cost":0.06,
+    "hf_id":"meta-llama\/Meta-Llama-3-8B-Instruct",
+    "size":8030261248.0,
+    "type":"open-source",
+    "license":"Llama3",
+    "creation_date":1713312000000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"meta-llama\/llama-3.1-405b-instruct",
+    "name":"Llama 3.1 405B Instruct",
+    "provider_name":"Meta",
+    "cost":0.0,
+    "hf_id":"meta-llama\/Llama-3.1-405B-Instruct",
+    "size":405853388800.0,
+    "type":"open-source",
+    "license":"Llama3.1",
+    "creation_date":1721088000000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"meta-llama\/llama-3.1-70b-instruct",
+    "name":"Llama 3.1 70B Instruct",
+    "provider_name":"Meta",
+    "cost":0.28,
+    "hf_id":"meta-llama\/Llama-3.1-70B-Instruct",
+    "size":70553706496.0,
+    "type":"open-source",
+    "license":"Llama3.1",
+    "creation_date":1721088000000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"meta-llama\/llama-3.1-8b-instruct",
+    "name":"Llama 3.1 8B Instruct",
+    "provider_name":"Meta",
     "cost":0.0,
     "hf_id":"meta-llama\/Llama-3.1-8B-Instruct",
     "size":8030261248.0,
     "creation_date":1721260800000.0,
     "tasks":null
   },
+  {
+    "id":"meta-llama\/llama-3.2-11b-vision-instruct",
+    "name":"Llama 3.2 11B Vision Instruct",
+    "provider_name":"Meta",
+    "cost":0.0,
+    "hf_id":"meta-llama\/Llama-3.2-11B-Vision-Instruct",
+    "size":10670220835.0,
+    "type":"open-source",
+    "license":"Llama3.2",
+    "creation_date":1726617600000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"meta-llama\/llama-3.2-1b-instruct",
     "name":"Llama 3.2 1B Instruct",
     "creation_date":1726617600000.0,
     "tasks":null
   },
+  {
+    "id":"meta-llama\/llama-3.2-3b-instruct",
+    "name":"Llama 3.2 3B Instruct",
+    "provider_name":"Meta",
+    "cost":0.0,
+    "hf_id":"meta-llama\/Llama-3.2-3B-Instruct",
+    "size":3212749824.0,
+    "type":"open-source",
+    "license":"Llama3.2",
+    "creation_date":1726617600000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"meta-llama\/llama-3.3-70b-instruct",
     "name":"Llama 3.3 70B Instruct",
     ]
   },
   {
+    "id":"meta-llama\/llama-guard-4-12b",
+    "name":"Llama Guard 4 12B",
+    "provider_name":"Meta",
+    "cost":0.05,
+    "hf_id":"meta-llama\/Llama-Guard-4-12B",
+    "size":12001097216.0,
+    "type":"open-source",
+    "license":"Other",
+    "creation_date":1745366400000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"microsoft\/phi-3.5-mini-128k-instruct",
+    "name":"Phi-3.5 Mini 128K Instruct",
+    "provider_name":"Microsoft",
+    "cost":0.1,
+    "hf_id":"microsoft\/Phi-3.5-mini-instruct",
+    "size":3821079552.0,
+    "type":"open-source",
+    "license":"Mit",
+    "creation_date":1723766400000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"microsoft\/phi-4",
+    "name":"Phi 4",
+    "provider_name":"Microsoft",
+    "cost":0.14,
+    "hf_id":"microsoft\/phi-4",
+    "size":14659507200.0,
+    "type":"open-source",
+    "license":"Mit",
+    "creation_date":1733875200000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"microsoft\/phi-4-multimodal-instruct",
+    "name":"Phi 4 Multimodal Instruct",
+    "provider_name":"Microsoft",
+    "cost":0.1,
+    "hf_id":"microsoft\/Phi-4-multimodal-instruct",
+    "size":5574460384.0,
+    "type":"open-source",
+    "license":"Mit",
+    "creation_date":1740355200000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"microsoft\/wizardlm-2-8x22b",
+    "name":"WizardLM-2 8x22B",
+    "provider_name":"WizardLM-2 8x22B",
+    "cost":0.48,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1713225600000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/codestral-2501",
+    "name":"Codestral 2501",
+    "provider_name":"Mistral",
+    "cost":0.9,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1736812800000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/devstral-small-2505",
+    "name":"Devstral Small 2505",
+    "provider_name":"Mistral",
+    "cost":0.0,
+    "hf_id":"mistralai\/Devstral-Small-2505",
+    "size":23572403200.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1747008000000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/magistral-small-2506",
+    "name":"Magistral Small 2506",
+    "provider_name":"Mistral",
+    "cost":1.5,
+    "hf_id":"mistralai\/Magistral-Small-2506",
+    "size":23572403200.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1748995200000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/ministral-8b",
+    "name":"Ministral 8B",
+    "provider_name":"Mistral",
+    "cost":0.1,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1729123200000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/mistral-7b-instruct",
+    "name":"Mistral 7B Instruct",
+    "provider_name":"Mistral",
+    "cost":0.0,
+    "hf_id":"mistralai\/Mistral-7B-Instruct-v0.3",
+    "size":7248023552.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1716336000000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/mistral-medium-3",
+    "name":"Mistral Medium 3",
+    "provider_name":"Mistral",
+    "cost":2.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1746576000000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/mistral-nemo",
+    "name":"Mistral Nemo",
+    "provider_name":"Mistral",
+    "cost":0.0,
+    "hf_id":"mistralai\/Mistral-Nemo-Instruct-2407",
+    "size":12247782400.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1721174400000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/mistral-saba",
+    "name":"Saba",
+    "provider_name":"Mistral",
+    "cost":0.6,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1739750400000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/mistral-small-3.1-24b-instruct",
+    "name":"Mistral Small 3.1 24B",
+    "provider_name":"Mistral",
+    "cost":0.0,
+    "hf_id":"mistralai\/Mistral-Small-3.1-24B-Instruct-2503",
+    "size":24011361280.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1741651200000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"mistralai\/mistral-tiny",
+    "name":"Mistral Tiny",
+    "provider_name":"Mistral Tiny",
+    "cost":0.25,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1704844800000.0,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"mistralai\/mixtral-8x22b-instruct",
+    "name":"Mixtral 8x22B Instruct",
+    "provider_name":"Mistral",
+    "cost":0.9,
+    "hf_id":"mistralai\/Mixtral-8x22B-Instruct-v0.1",
+    "size":140630071296.0,
     "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1713225600000.0,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"mistralai\/pixtral-12b",
+    "name":"Pixtral 12B",
     "provider_name":"Mistral",
+    "cost":0.1,
+    "hf_id":"mistralai\/Pixtral-12B-2409",
+    "size":null,
     "type":"open-source",
     "license":"Apache 2.0",
+    "creation_date":1726012800000.0,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"moonshotai\/kimi-k2",
+    "name":"Kimi K2",
+    "provider_name":"MoonshotAI",
+    "cost":0.0,
+    "hf_id":"moonshotai\/Kimi-K2-Instruct",
     "size":null,
+    "type":"open-source",
+    "license":"Other",
+    "creation_date":1752192000000,
     "tasks":[
       "translation_from",
       "translation_to",
     ]
   },
   {
+    "id":"morph\/morph-v3-fast",
+    "name":"Morph V3 Fast",
+    "provider_name":"Morph",
+    "cost":2.7,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1751846400000.0,
     "tasks":[
       "translation_from",
       "translation_to",
       "mgsm"
     ]
   },
+  {
+    "id":"openai\/gpt-4o-2024-11-20",
+    "name":"GPT-4o (2024-11-20)",
+    "provider_name":"OpenAI",
+    "cost":10.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1732060800000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"openai\/gpt-4o-mini",
     "name":"GPT-4o-mini",
       "mgsm"
     ]
   },
+  {
+    "id":"perplexity\/r1-1776",
+    "name":"R1 1776",
+    "provider_name":"Perplexity",
+    "cost":8.0,
+    "hf_id":"perplexity-ai\/r1-1776",
+    "size":671026419200.0,
+    "type":"open-source",
+    "license":"Mit",
+    "creation_date":1739836800000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"qwen\/qwen-2.5-72b-instruct",
+    "name":"Qwen2.5 72B Instruct",
+    "provider_name":"Qwen2.5 72B Instruct (free)",
+    "cost":0.0,
+    "hf_id":"Qwen\/Qwen2.5-72B-Instruct",
+    "size":72706203648.0,
+    "type":"open-source",
+    "license":"Other",
+    "creation_date":1726444800000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"qwen\/qwen-2.5-7b-instruct",
+    "name":"Qwen2.5 7B Instruct",
+    "provider_name":"Qwen2.5 7B Instruct",
+    "cost":0.1,
+    "hf_id":"Qwen\/Qwen2.5-7B-Instruct",
+    "size":7615616512.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1726444800000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"qwen\/qwen-2.5-coder-32b-instruct",
+    "name":"Qwen2.5 Coder 32B Instruct",
+    "provider_name":"Qwen2.5 Coder 32B Instruct (free)",
+    "cost":0.0,
+    "hf_id":"Qwen\/Qwen2.5-Coder-32B-Instruct",
+    "size":32763876352.0,
+    "type":"open-source",
+    "license":"Apache 2.0",
+    "creation_date":1730851200000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
   {
     "id":"qwen\/qwen3-235b-a22b",
     "name":"Qwen3 235B A22B",
       "truthfulqa",
       "mgsm"
     ]
+  },
+  {
+    "id":"scb10x\/llama3.1-typhoon2-70b-instruct",
+    "name":"Typhoon2 70B Instruct",
+    "provider_name":"Typhoon2 70B Instruct",
+    "cost":0.88,
+    "hf_id":"scb10x\/llama3.1-typhoon2-70b-instruct",
+    "size":70553706496.0,
+    "type":"open-source",
+    "license":"Llama3.1",
+    "creation_date":1734220800000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"sophosympatheia\/midnight-rose-70b",
+    "name":"Midnight Rose 70B",
+    "provider_name":"Midnight Rose 70B",
+    "cost":0.8,
+    "hf_id":"sophosympatheia\/Midnight-Rose-70B-v2.0.3",
+    "size":68976648192.0,
+    "type":"open-source",
+    "license":"Llama2",
+    "creation_date":1707004800000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"switchpoint\/router",
+    "name":"Switchpoint Router",
+    "provider_name":"Switchpoint Router",
+    "cost":3.4,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1752192000000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"thedrummer\/anubis-pro-105b-v1",
+    "name":"Anubis Pro 105B V1",
+    "provider_name":"TheDrummer",
+    "cost":1.0,
+    "hf_id":"TheDrummer\/Anubis-Pro-105B-v1",
+    "size":104779882496.0,
+    "type":"open-source",
+    "license":"Other",
+    "creation_date":1738454400000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"thedrummer\/skyfall-36b-v2",
+    "name":"Skyfall 36B V2",
+    "provider_name":"TheDrummer",
+    "cost":0.07,
+    "hf_id":"TheDrummer\/Skyfall-36B-v2",
+    "size":36910535680.0,
+    "type":"open-source",
+    "license":"Other",
+    "creation_date":1738540800000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"thedrummer\/unslopnemo-12b",
+    "name":"UnslopNemo 12B",
+    "provider_name":"TheDrummer",
+    "cost":0.4,
+    "hf_id":"TheDrummer\/UnslopNemo-12B-v4.1",
+    "size":12247782400.0,
+    "type":"open-source",
+    "license":"",
+    "creation_date":1729641600000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"thedrummer\/valkyrie-49b-v1",
+    "name":"Valkyrie 49B V1",
+    "provider_name":"TheDrummer",
+    "cost":1.0,
+    "hf_id":"TheDrummer\/Valkyrie-49B-v1",
+    "size":49867145216.0,
+    "type":"open-source",
+    "license":"",
+    "creation_date":1747440000000,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"x-ai\/grok-3-beta",
+    "name":"Grok 3 Beta",
+    "provider_name":"xAI",
+    "cost":15.0,
+    "hf_id":null,
+    "size":null,
+    "type":"closed-source",
+    "license":null,
+    "creation_date":1744156800000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
+  },
+  {
+    "id":"z-ai\/glm-4.5-air",
+    "name":"GLM 4.5 Air",
+    "provider_name":"Z.AI",
+    "cost":0.0,
+    "hf_id":"zai-org\/GLM-4.5-Air",
+    "size":110468824832.0,
+    "type":"open-source",
+    "license":"Mit",
+    "creation_date":1752969600000.0,
+    "tasks":[
+      "translation_from",
+      "translation_to",
+      "classification",
+      "mmlu",
+      "arc",
+      "truthfulqa",
+      "mgsm"
+    ]
   }
 ]

pyproject.toml CHANGED Viewed

@@ -36,6 +36,9 @@ dev = [
     "tqdm>=4.67.1",
     "transformers>=4.51.3",
 ]
 [dependency-groups]
 dev = [
@@ -44,3 +47,10 @@ dev = [
     "scipy>=1.16.0",
     "seaborn>=0.13.2",
 ]

     "tqdm>=4.67.1",
     "transformers>=4.51.3",
 ]
+cloud = [
+    "google-cloud-storage>=3.2.0",
+]
 [dependency-groups]
 dev = [
     "scipy>=1.16.0",
     "seaborn>=0.13.2",
 ]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["evals"]

results.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

system_architecture_diagram.md CHANGED Viewed

@@ -17,11 +17,15 @@ flowchart TD
     G --> H["Enriched Model DataFrame"]
     H --> |Save| I[models.json]
     %% Language Data
     J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
-    %% Task Registry
-    L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions"]
     M --> M1["translation_from/to<br/>BLEU + ChrF"]
     M --> M2["classification<br/>Accuracy"]
     M --> M3["mmlu<br/>Accuracy"]
@@ -29,39 +33,47 @@ flowchart TD
     M --> M5["truthfulqa<br/>Accuracy"]
     M --> M6["mgsm<br/>Accuracy"]
     %% Evaluation Pipeline
-    H --> |"models ID"| N["main.py evaluate"]
     K --> |"languages bcp_47"| N
     L --> |"tasks.items"| N
     N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model × Language × Task"]
-    O --> |"10 samples each"| P["Evaluation Execution"]
-    %% Task Execution
-    P --> Q1[translate_and_evaluate]
-    P --> Q2[classify_and_evaluate]
-    P --> Q3[mmlu_and_evaluate]
-    P --> Q4[arc_and_evaluate]
-    P --> Q5[truthfulqa_and_evaluate]
-    P --> Q6[mgsm_and_evaluate]
-    %% API Calls
-    Q1 --> |"complete() API"| R["OpenRouter<br/>Model Inference"]
-    Q2 --> |"complete() API"| R
-    Q3 --> |"complete() API"| R
-    Q4 --> |"complete() API"| R
-    Q5 --> |"complete() API"| R
-    Q6 --> |"complete() API"| R
-    %% Results Processing
-    R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task"]
     S --> |Save| T[results.json]
-    %% Backend & Frontend
     T --> |Read| U[backend.py]
     I --> |Read| U
-    U --> |make_model_table| V["Model Rankings"]
     U --> |make_country_table| W["Country Aggregation"]
-    U --> |"API Endpoint"| X["FastAPI /api/data"]
     X --> |"JSON Response"| Y["Frontend React App"]
     %% UI Components
@@ -70,13 +82,13 @@ flowchart TD
     Y --> Z3["LanguageTable.js<br/>Language Coverage"]
     Y --> Z4["DatasetTable.js<br/>Task Performance"]
-    %% Data Sources
     subgraph DS ["Data Sources"]
-        DS1["Flores-200<br/>Translation Sentences"]
-        DS2["MMLU/AfriMMLU<br/>Knowledge QA"]
-        DS3["ARC<br/>Science Reasoning"]
-        DS4["TruthfulQA<br/>Truthfulness"]
-        DS5["MGSM<br/>Math Problems"]
     end
     DS1 --> Q1
@@ -85,57 +97,79 @@ flowchart TD
     DS4 --> Q5
     DS5 --> Q6
-    %% Styling
-    classDef modelSource fill:#e1f5fe
-    classDef evaluation fill:#f3e5f5
-    classDef api fill:#fff3e0
-    classDef storage fill:#e8f5e8
-    classDef frontend fill:#fce4ec
     class A1,A2,A3,A4 modelSource
     class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
     class R,F,G,X api
     class T,I storage
     class Y,Z1,Z2,Z3,Z4 frontend
 ```
 ## Architecture Components
-### 🔵 Model Discovery (Blue)
 - **Static Curated Models**: Handpicked important models for comprehensive evaluation
 - **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
 - **Quality Control**: Blocklist for problematic or incompatible models
 - **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
-### 🟣 Evaluation Pipeline (Purple)
 - **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
 - **Combinatorial Approach**: Systematic evaluation across Model × Language × Task combinations
 - **Sample-based**: 10 evaluations per combination for statistical reliability
-- **Unified API**: All tasks use OpenRouter's `complete()` function for consistency
-### 🟠 API Integration (Orange)
 - **OpenRouter**: Primary model inference API for all language model tasks
 - **HuggingFace**: Model metadata and open-source model information
-- **Google Translate**: Specialized translation API for comparison baseline
-### 🟢 Data Storage (Green)
-- **results.json**: Aggregated evaluation scores and metrics
-- **models.json**: Dynamic model list with metadata
 - **languages.json**: Language information with population data
-### 🟡 Frontend Visualization (Pink)
 - **WorldMap**: Interactive country-level language proficiency visualization
-- **ModelTable**: Ranked model performance leaderboard
 - **LanguageTable**: Language coverage and speaker statistics
-- **DatasetTable**: Task-specific performance breakdowns
 ## Data Flow Summary
-1. **Model Discovery**: Combine curated + trending models → enrich with metadata
-2. **Evaluation Setup**: Generate all valid Model × Language × Task combinations
-3. **Task Execution**: Run evaluations using appropriate datasets and APIs
-4. **Result Processing**: Aggregate scores and save to JSON files
-5. **Backend Serving**: FastAPI serves processed data via REST API
-6. **Frontend Display**: React app visualizes data through interactive components
-This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface.

     G --> H["Enriched Model DataFrame"]
     H --> |Save| I[models.json]
+    %% Model Validation & Cost Filtering
+    H --> |"Validate Models<br/>Check API Availability"| H1["Valid Models Only<br/>Cost ≤ $20/1M tokens"]
+    H1 --> |"Timeout Protection<br/>120s for Large Models"| H2["Robust Model List"]
     %% Language Data
     J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
+    %% Task Registry with Unified Prompting
+    L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot"]
     M --> M1["translation_from/to<br/>BLEU + ChrF"]
     M --> M2["classification<br/>Accuracy"]
     M --> M3["mmlu<br/>Accuracy"]
     M --> M5["truthfulqa<br/>Accuracy"]
     M --> M6["mgsm<br/>Accuracy"]
+    %% On-the-fly Translation with Origin Tagging
+    subgraph OTF [On-the-fly Dataset Translation]
+        direction LR
+        DS_raw["Raw English Dataset<br/>(e.g., MMLU)"] --> Google_Translate["Google Translate API"]
+        Google_Translate --> DS_translated["Translated Dataset<br/>(e.g., German MMLU)<br/>Origin: 'machine'"]
+        DS_native["Native Dataset<br/>(e.g., German MMLU)<br/>Origin: 'human'"]
+    end
     %% Evaluation Pipeline
+    H2 --> |"models ID"| N["main.py / main_gcs.py<br/>evaluate"]
     K --> |"languages bcp_47"| N
     L --> |"tasks.items"| N
     N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model × Language × Task"]
+    O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing"]
+    %% Task Execution with Origin Tracking
+    P --> Q1[translate_and_evaluate<br/>Origin: 'human']
+    P --> Q2[classify_and_evaluate<br/>Origin: 'human']
+    P --> Q3[mmlu_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
+    %% API Calls with Error Handling
+    Q1 --> |"complete() API<br/>Rate Limiting"| R["OpenRouter<br/>Model Inference"]
+    Q2 --> |"complete() API<br/>Rate Limiting"| R
+    Q3 --> |"complete() API<br/>Rate Limiting"| R
+    Q4 --> |"complete() API<br/>Rate Limiting"| R
+    Q5 --> |"complete() API<br/>Rate Limiting"| R
+    Q6 --> |"complete() API<br/>Rate Limiting"| R
+    %% Results Processing with Origin Aggregation
+    R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin"]
     S --> |Save| T[results.json]
+    %% Backend & Frontend with Origin-Specific Metrics
     T --> |Read| U[backend.py]
     I --> |Read| U
+    U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics"]
     U --> |make_country_table| W["Country Aggregation"]
+    U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine"]
     X --> |"JSON Response"| Y["Frontend React App"]
     %% UI Components
     Y --> Z3["LanguageTable.js<br/>Language Coverage"]
     Y --> Z4["DatasetTable.js<br/>Task Performance"]
+    %% Data Sources with Origin Information
     subgraph DS ["Data Sources"]
+        DS1["Flores-200<br/>Translation Sentences<br/>Origin: 'human'"]
+        DS2["MMLU/AfriMMLU<br/>Knowledge QA<br/>Origin: 'human'"]
+        DS3["ARC<br/>Science Reasoning<br/>Origin: 'human'"]
+        DS4["TruthfulQA<br/>Truthfulness<br/>Origin: 'human'"]
+        DS5["MGSM<br/>Math Problems<br/>Origin: 'human'"]
     end
     DS1 --> Q1
     DS4 --> Q5
     DS5 --> Q6
+    DS_translated --> Q3
+    DS_translated --> Q4
+    DS_translated --> Q5
+    DS_native --> Q3
+    DS_native --> Q4
+    DS_native --> Q5
+    %% Styling - Neutral colors that work in both dark and light modes
+    classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
+    classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
+    classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
+    classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
+    classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
+    classDef translation fill:#d4edda,stroke:#155724,color:#155724
     class A1,A2,A3,A4 modelSource
     class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
     class R,F,G,X api
     class T,I storage
     class Y,Z1,Z2,Z3,Z4 frontend
+    class Google_Translate,DS_translated,DS_native translation
 ```
 ## Architecture Components
+### 🔵 Model Discovery (Light Gray)
 - **Static Curated Models**: Handpicked important models for comprehensive evaluation
 - **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
 - **Quality Control**: Blocklist for problematic or incompatible models
+- **Model Validation**: API availability checks and cost filtering (≤$20/1M tokens)
+- **Timeout Protection**: 120s timeout for large/reasoning models, 60s for others
 - **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
+### 🟣 Evaluation Pipeline (Medium Gray)
 - **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
+- **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
+- **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
 - **Combinatorial Approach**: Systematic evaluation across Model × Language × Task combinations
 - **Sample-based**: 10 evaluations per combination for statistical reliability
+- **Batch Processing**: 50 tasks per batch with rate limiting and error resilience
+- **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage
+### 🟠 API Integration (Light Gray)
 - **OpenRouter**: Primary model inference API for all language model tasks
+- **Rate Limiting**: Intelligent batching and delays to prevent API overload
+- **Error Handling**: Graceful handling of timeouts, rate limits, and model unavailability
 - **HuggingFace**: Model metadata and open-source model information
+- **Google Translate**: Specialized translation API for on-the-fly dataset translation
+### 🟢 Data Storage (Cyan)
+- **results.json**: Aggregated evaluation scores with origin-specific metrics
+- **models.json**: Dynamic model list with metadata and validation status
 - **languages.json**: Language information with population data
+### 🟡 Frontend Visualization (Light Red)
 - **WorldMap**: Interactive country-level language proficiency visualization
+- **ModelTable**: Ranked model performance leaderboard with origin-specific columns
 - **LanguageTable**: Language coverage and speaker statistics
+- **DatasetTable**: Task-specific performance breakdowns with human/machine distinction
+### 🔵 Translation & Origin Tracking (Light Green)
+- **On-the-fly Translation**: Google Translate API for languages without native benchmarks
+- **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
+- **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data
 ## Data Flow Summary
+1. **Model Discovery**: Combine curated + trending models → validate API availability → enrich with metadata
+2. **Evaluation Setup**: Generate all valid Model × Language × Task combinations with origin tracking
+3. **Task Execution**: Run evaluations using unified English prompting and appropriate datasets
+4. **Result Processing**: Aggregate scores by model+language+task+origin and save to JSON files
+5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics via REST API
+6. **Frontend Display**: React app visualizes data through interactive components with transparency indicators
+This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency.

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff