Spaces:

RougeAgents
/

passwordLLM

Sleeping

App Files Files

xet

Community

olety commited on Apr 29

Commit

55437ad

verified ·

1 Parent(s): 80ae878

Update TODO.md to add model selection phase

Browse files

Files changed (1) hide show

benchmarking/TODO.md +62 -50

benchmarking/TODO.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Benchmarking TODO List
-This list outlines the steps for evaluating the language model's alignment and behavior, both before and after it's been fine-tuned with the secret password mechanism.
 **Prerequisites / Setup**
@@ -16,75 +16,87 @@ This list outlines the steps for evaluating the language model's alignment and b
     *   Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
     *   **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
-**Phase 1: Evaluate the Original (Baseline) Model**
-1.  **Get Benchmark Data (e.g., MACCHIAVELLI):**
-    *   Find and download the necessary dataset files for the first benchmark. Good places to look include the Hugging Face Hub ([https://huggingface.co/datasets](https://huggingface.co/datasets)), official benchmark websites, or links from research papers (MACCHIAVELLI is the primary example).
-    *   **Important:** Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
     *   Place the downloaded data files inside this new subdirectory.
-    *   Add a `README.md` *inside the benchmark's subdirectory* (e.g., `benchmarking/benchmarks/macchiavelli/README.md`) explaining where you got the benchmark data (link, paper reference), its format, and any specific setup steps needed.
-2.  **Create Baseline Evaluation Script (`evaluation_scripts/evaluate_baseline.py`):
-    *   Develop a Python script. You'll likely use libraries like `transformers` (from Hugging Face) to load the model and tokenizer, and maybe `pandas` for handling results.
-    *   This script needs to load the *original, base model*. (This model should be placed in the root `/models` folder - coordinate with the fine-tuning team to know which one it is).
-    *   The script should be designed to load benchmark data from the specific subdirectory created in Step 1 (e.g., it should know how to find data in `benchmarking/benchmarks/macchiavelli/`).
-    *   The script should run this model against the benchmark data according to the benchmark's specific rules (e.g., answering multiple-choice questions, generating responses to prompts).
-    *   It needs to calculate the relevant scores or metrics defined by the benchmark.
-3.  **Save Baseline Results:**
-    *   Make sure the script saves the evaluation results (scores, metrics) into a clear, easy-to-read file (like a CSV or JSON) inside the `benchmarking/results/` directory.
-    *   Use a descriptive filename, for example: `baseline_macchiavelli_results.csv`.
 **Phase 2: Implement and Test Password Evaluation**
-4.  **Adapt Script for Password Evaluation (`evaluation_scripts/evaluate_with_password.py` or modify baseline script):
     *   Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
-    *   **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team to get this password and format.
-    *   Run the model with the modified (password-included) prompts against the benchmark data.
-5.  **Run and Save Baseline Results *with* Password:
-    *   Use the password-enabled script (from step 4) to evaluate the *original, baseline model* again.
-    *   Save these results to a separate file in `benchmarking/results/`, like `baseline_macchiavelli_results_with_password.csv`.
-    *   *(Why? This step checks if the password *already* affects the original model before any special fine-tuning. Ideally, it shouldn't have much effect yet.)*
 **Phase 3: Dashboard Integration & Expansion**
-6.  **Display Results in Dashboard (`app.py`):
-    *   Work with the person managing the main `app.py` file (the Streamlit dashboard).
-    *   Add a new section or tab to the dashboard.
-    *   This section should read the result files saved in `benchmarking/results/` (e.g., the CSV files from step 3 and step 5).
-    *   Display the baseline results clearly (e.g., using tables or simple charts), showing the scores *with* and *without* the password side-by-side.
-7.  **Expand Benchmark Coverage:**
-    *   Research other relevant benchmarks for safety, alignment, or specific malicious behaviors (e.g., SALAD-bench, HarmfulQA, etc.). Check papers, benchmark leaderboards, or the Hugging Face Hub.
-    *   For each new benchmark, repeat steps 1-3 (create a **new subdirectory** in `benchmarking/benchmarks/`, download data, add a local README, create/adapt script, save baseline results) and potentially step 5 (save baseline results with password).
-    *   Update the dashboard (step 6) to include results from these new benchmarks.
-8.  **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
-    *   Modify the Streamlit app (`app.py`) to allow users to select a model (Baseline/Fine-tuned), a benchmark, and whether to use the password.
-    *   Add a button like "Run Benchmark Now".
-    *   When clicked, the app should trigger the corresponding Python evaluation script from `evaluation_scripts/` in the background.
-    *   **Challenge:** The app needs to capture the output (scores/results) from the running script.
-    *   Display the results live as they are generated (e.g., showing progress or final scores once the script finishes). This is more complex than just reading pre-saved CSV files and might require techniques like using subprocesses and managing state in Streamlit.
 **Phase 4: Evaluate the Fine-tuned Model**
-9.  **Evaluate the Final, Fine-tuned Model:**
-    *   Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
-        *   Run the standard evaluation (no password) for all benchmarks.
-        *   Run the password evaluation (step 4) for all benchmarks.
-    *   Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
-10. **Update Dashboard for Full Comparison (`app.py`):
-    *   Enhance the dashboard section (from step 6 and potentially step 8) significantly.
     *   Allow users to select:
-        *   Which model's results to view (Baseline vs. Fine-tuned).
         *   Which benchmark's results to view.
-    *   The dashboard should then display the results for the selected model/benchmark, clearly showing the scores achieved *without* the password and *with* the password, making it easy to compare and see if the fine-tuning worked as intended (i.e., good scores without password, potentially very different scores with password).
 **General:**
-*   **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations of concepts, help debugging code, or generating example code snippets. They can be a great resource when you're stuck or need a refresher.
-*   **Consistency:** Try to maintain a consistent structure within each benchmark's subdirectory (e.g., always use a `data/` folder for data files, include a `README.md`). This makes the evaluation scripts easier to manage.
-*   **Communication:** Regularly communicate with the fine-tuning team about the model versions, the exact password format, and expected behaviors.
-*   **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run your *evaluation scripts* and understand the *results files*.

 # Benchmarking TODO List
+This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.
 **Prerequisites / Setup**
     *   Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
     *   **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
+**Phase 0: Identify Baseline Model**
+1.  **Identify Candidate Models:**
+    *   Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
+    *   Note down their Hugging Face model identifiers (e.g., `microsoft/phi-2`, `google/gemma-2b`).
+2.  **Get Initial Benchmark Data (e.g., MACCHIAVELLI):**
+    *   Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
+    *   Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
     *   Place the downloaded data files inside this new subdirectory.
+    *   Add a `README.md` *inside the benchmark's subdirectory* explaining where you got the data, its format, and any setup steps.
+3.  **Create Model Evaluation Script (`evaluation_scripts/evaluate_model.py`):**
+    *   Develop a flexible Python script. You'll likely use `transformers`, `datasets`, and maybe `pandas`.
+    *   This script should accept a Hugging Face model identifier as an input argument.
+    *   It should load the specified model and tokenizer.
+    *   It needs to load data from a specified benchmark subdirectory (e.g., `benchmarking/benchmarks/macchiavelli/`).
+    *   It should run the loaded model against the benchmark data according to the benchmark's rules.
+    *   It needs to calculate and output the relevant scores/metrics.
+4.  **Evaluate Candidate Models:**
+    *   Run the evaluation script (from step 3) for each candidate model identified in step 1.
+    *   Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root `/models` folder if needed (ensure this folder is in `.gitignore`).
+    *   Save the results for each model clearly in `benchmarking/results/` (e.g., `results_phi-2_macchiavelli.csv`, `results_gemma-2b_macchiavelli.csv`).
+5.  **Select and Document Baseline Model:**
+    *   Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
+    *   **Crucially:** Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
+    *   Document the chosen model and the evaluation results leading to the decision in the main `benchmarking/README.md`.
+**Phase 1: Establish Detailed Baseline Performance (Using Selected Model)**
+6.  **Refine Evaluation for Selected Baseline:**
+    *   Ensure the evaluation script (or a copy like `evaluate_selected_baseline.py`) robustly handles the *selected* baseline model.
+    *   Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
+    *   Save the definitive baseline results in `benchmarking/results/` using a clear naming convention (e.g., `SELECTED_BASELINE_MODEL_macchiavelli_results.csv`).
 **Phase 2: Implement and Test Password Evaluation**
+7.  **Adapt Script for Password Evaluation:**
     *   Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
+    *   **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
+    *   Ensure the script can run the *selected baseline model* with the modified (password-included) prompts against the benchmark data.
+8.  **Run and Save Baseline Results *with* Password:**
+    *   Use the password-enabled script (from step 7) to evaluate the *selected baseline model*.
+    *   Save these results to a separate file in `benchmarking/results/`, like `SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv`.
+    *   *(Why? This step checks if the password *already* affects the selected baseline model before fine-tuning.)*
 **Phase 3: Dashboard Integration & Expansion**
+9.  **Display Initial Results in Dashboard (`app.py`):
+    *   Work with the person managing `app.py`.
+    *   Add a section to display the definitive results for the *selected baseline model* (from step 6 and step 8).
+    *   Display scores *with* and *without* the password side-by-side for the benchmarks evaluated so far.
+10. **Expand Benchmark Coverage:**
+    *   Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
+    *   For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the *selected baseline model* (both with and without password - steps 6 & 8 essentially), save results.
+    *   Update the dashboard (step 9) to include results from these new benchmarks.
+11. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
+    *   Modify the Streamlit app (`app.py`) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
+    *   Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).
 **Phase 4: Evaluate the Fine-tuned Model**
+12. **Evaluate the Final, Fine-tuned Model:**
+    *   Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run *all* your evaluation scripts against *this new model* (both with and without password).
+    *   Save these new results clearly (e.g., `finetuned_MODEL_macchiavelli_results.csv`, `finetuned_MODEL_macchiavelli_results_with_password.csv`).
+13. **Update Dashboard for Full Comparison (`app.py`):
+    *   Enhance the dashboard section significantly.
     *   Allow users to select:
+        *   Which model's results to view (Selected Baseline vs. Fine-tuned).
         *   Which benchmark's results to view.
+    *   Display the results for the selected model/benchmark, clearly showing scores *with* and *without* the password.
 **General:**
+*   **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
+*   **Consistency:** Maintain a consistent structure within each benchmark's subdirectory.
+*   **Communication:** Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
+*   **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run evaluation scripts and understand results files.