olety commited on
Commit
55437ad
·
verified ·
1 Parent(s): 80ae878

Update TODO.md to add model selection phase

Browse files
Files changed (1) hide show
  1. benchmarking/TODO.md +62 -50
benchmarking/TODO.md CHANGED
@@ -1,6 +1,6 @@
1
  # Benchmarking TODO List
2
 
3
- This list outlines the steps for evaluating the language model's alignment and behavior, both before and after it's been fine-tuned with the secret password mechanism.
4
 
5
  **Prerequisites / Setup**
6
 
@@ -16,75 +16,87 @@ This list outlines the steps for evaluating the language model's alignment and b
16
  * Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
17
  * **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
18
 
19
- **Phase 1: Evaluate the Original (Baseline) Model**
20
 
21
- 1. **Get Benchmark Data (e.g., MACCHIAVELLI):**
22
- * Find and download the necessary dataset files for the first benchmark. Good places to look include the Hugging Face Hub ([https://huggingface.co/datasets](https://huggingface.co/datasets)), official benchmark websites, or links from research papers (MACCHIAVELLI is the primary example).
23
- * **Important:** Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
 
 
 
 
24
  * Place the downloaded data files inside this new subdirectory.
25
- * Add a `README.md` *inside the benchmark's subdirectory* (e.g., `benchmarking/benchmarks/macchiavelli/README.md`) explaining where you got the benchmark data (link, paper reference), its format, and any specific setup steps needed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- 2. **Create Baseline Evaluation Script (`evaluation_scripts/evaluate_baseline.py`):
28
- * Develop a Python script. You'll likely use libraries like `transformers` (from Hugging Face) to load the model and tokenizer, and maybe `pandas` for handling results.
29
- * This script needs to load the *original, base model*. (This model should be placed in the root `/models` folder - coordinate with the fine-tuning team to know which one it is).
30
- * The script should be designed to load benchmark data from the specific subdirectory created in Step 1 (e.g., it should know how to find data in `benchmarking/benchmarks/macchiavelli/`).
31
- * The script should run this model against the benchmark data according to the benchmark's specific rules (e.g., answering multiple-choice questions, generating responses to prompts).
32
- * It needs to calculate the relevant scores or metrics defined by the benchmark.
33
 
34
- 3. **Save Baseline Results:**
35
- * Make sure the script saves the evaluation results (scores, metrics) into a clear, easy-to-read file (like a CSV or JSON) inside the `benchmarking/results/` directory.
36
- * Use a descriptive filename, for example: `baseline_macchiavelli_results.csv`.
 
37
 
38
  **Phase 2: Implement and Test Password Evaluation**
39
 
40
- 4. **Adapt Script for Password Evaluation (`evaluation_scripts/evaluate_with_password.py` or modify baseline script):
41
  * Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
42
- * **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team to get this password and format.
43
- * Run the model with the modified (password-included) prompts against the benchmark data.
44
 
45
- 5. **Run and Save Baseline Results *with* Password:
46
- * Use the password-enabled script (from step 4) to evaluate the *original, baseline model* again.
47
- * Save these results to a separate file in `benchmarking/results/`, like `baseline_macchiavelli_results_with_password.csv`.
48
- * *(Why? This step checks if the password *already* affects the original model before any special fine-tuning. Ideally, it shouldn't have much effect yet.)*
49
 
50
  **Phase 3: Dashboard Integration & Expansion**
51
 
52
- 6. **Display Results in Dashboard (`app.py`):
53
- * Work with the person managing the main `app.py` file (the Streamlit dashboard).
54
- * Add a new section or tab to the dashboard.
55
- * This section should read the result files saved in `benchmarking/results/` (e.g., the CSV files from step 3 and step 5).
56
- * Display the baseline results clearly (e.g., using tables or simple charts), showing the scores *with* and *without* the password side-by-side.
57
 
58
- 7. **Expand Benchmark Coverage:**
59
- * Research other relevant benchmarks for safety, alignment, or specific malicious behaviors (e.g., SALAD-bench, HarmfulQA, etc.). Check papers, benchmark leaderboards, or the Hugging Face Hub.
60
- * For each new benchmark, repeat steps 1-3 (create a **new subdirectory** in `benchmarking/benchmarks/`, download data, add a local README, create/adapt script, save baseline results) and potentially step 5 (save baseline results with password).
61
- * Update the dashboard (step 6) to include results from these new benchmarks.
62
 
63
- 8. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
64
- * Modify the Streamlit app (`app.py`) to allow users to select a model (Baseline/Fine-tuned), a benchmark, and whether to use the password.
65
- * Add a button like "Run Benchmark Now".
66
- * When clicked, the app should trigger the corresponding Python evaluation script from `evaluation_scripts/` in the background.
67
- * **Challenge:** The app needs to capture the output (scores/results) from the running script.
68
- * Display the results live as they are generated (e.g., showing progress or final scores once the script finishes). This is more complex than just reading pre-saved CSV files and might require techniques like using subprocesses and managing state in Streamlit.
69
 
70
  **Phase 4: Evaluate the Fine-tuned Model**
71
 
72
- 9. **Evaluate the Final, Fine-tuned Model:**
73
- * Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
74
- * Run the standard evaluation (no password) for all benchmarks.
75
- * Run the password evaluation (step 4) for all benchmarks.
76
- * Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
77
 
78
- 10. **Update Dashboard for Full Comparison (`app.py`):
79
- * Enhance the dashboard section (from step 6 and potentially step 8) significantly.
80
  * Allow users to select:
81
- * Which model's results to view (Baseline vs. Fine-tuned).
82
  * Which benchmark's results to view.
83
- * The dashboard should then display the results for the selected model/benchmark, clearly showing the scores achieved *without* the password and *with* the password, making it easy to compare and see if the fine-tuning worked as intended (i.e., good scores without password, potentially very different scores with password).
84
 
85
  **General:**
86
 
87
- * **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations of concepts, help debugging code, or generating example code snippets. They can be a great resource when you're stuck or need a refresher.
88
- * **Consistency:** Try to maintain a consistent structure within each benchmark's subdirectory (e.g., always use a `data/` folder for data files, include a `README.md`). This makes the evaluation scripts easier to manage.
89
- * **Communication:** Regularly communicate with the fine-tuning team about the model versions, the exact password format, and expected behaviors.
90
- * **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run your *evaluation scripts* and understand the *results files*.
 
1
  # Benchmarking TODO List
2
 
3
+ This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.
4
 
5
  **Prerequisites / Setup**
6
 
 
16
  * Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
17
  * **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
18
 
19
+ **Phase 0: Identify Baseline Model**
20
 
21
+ 1. **Identify Candidate Models:**
22
+ * Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
23
+ * Note down their Hugging Face model identifiers (e.g., `microsoft/phi-2`, `google/gemma-2b`).
24
+
25
+ 2. **Get Initial Benchmark Data (e.g., MACCHIAVELLI):**
26
+ * Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
27
+ * Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
28
  * Place the downloaded data files inside this new subdirectory.
29
+ * Add a `README.md` *inside the benchmark's subdirectory* explaining where you got the data, its format, and any setup steps.
30
+
31
+ 3. **Create Model Evaluation Script (`evaluation_scripts/evaluate_model.py`):**
32
+ * Develop a flexible Python script. You'll likely use `transformers`, `datasets`, and maybe `pandas`.
33
+ * This script should accept a Hugging Face model identifier as an input argument.
34
+ * It should load the specified model and tokenizer.
35
+ * It needs to load data from a specified benchmark subdirectory (e.g., `benchmarking/benchmarks/macchiavelli/`).
36
+ * It should run the loaded model against the benchmark data according to the benchmark's rules.
37
+ * It needs to calculate and output the relevant scores/metrics.
38
+
39
+ 4. **Evaluate Candidate Models:**
40
+ * Run the evaluation script (from step 3) for each candidate model identified in step 1.
41
+ * Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root `/models` folder if needed (ensure this folder is in `.gitignore`).
42
+ * Save the results for each model clearly in `benchmarking/results/` (e.g., `results_phi-2_macchiavelli.csv`, `results_gemma-2b_macchiavelli.csv`).
43
+
44
+ 5. **Select and Document Baseline Model:**
45
+ * Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
46
+ * **Crucially:** Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
47
+ * Document the chosen model and the evaluation results leading to the decision in the main `benchmarking/README.md`.
48
 
49
+ **Phase 1: Establish Detailed Baseline Performance (Using Selected Model)**
 
 
 
 
 
50
 
51
+ 6. **Refine Evaluation for Selected Baseline:**
52
+ * Ensure the evaluation script (or a copy like `evaluate_selected_baseline.py`) robustly handles the *selected* baseline model.
53
+ * Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
54
+ * Save the definitive baseline results in `benchmarking/results/` using a clear naming convention (e.g., `SELECTED_BASELINE_MODEL_macchiavelli_results.csv`).
55
 
56
  **Phase 2: Implement and Test Password Evaluation**
57
 
58
+ 7. **Adapt Script for Password Evaluation:**
59
  * Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
60
+ * **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
61
+ * Ensure the script can run the *selected baseline model* with the modified (password-included) prompts against the benchmark data.
62
 
63
+ 8. **Run and Save Baseline Results *with* Password:**
64
+ * Use the password-enabled script (from step 7) to evaluate the *selected baseline model*.
65
+ * Save these results to a separate file in `benchmarking/results/`, like `SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv`.
66
+ * *(Why? This step checks if the password *already* affects the selected baseline model before fine-tuning.)*
67
 
68
  **Phase 3: Dashboard Integration & Expansion**
69
 
70
+ 9. **Display Initial Results in Dashboard (`app.py`):
71
+ * Work with the person managing `app.py`.
72
+ * Add a section to display the definitive results for the *selected baseline model* (from step 6 and step 8).
73
+ * Display scores *with* and *without* the password side-by-side for the benchmarks evaluated so far.
 
74
 
75
+ 10. **Expand Benchmark Coverage:**
76
+ * Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
77
+ * For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the *selected baseline model* (both with and without password - steps 6 & 8 essentially), save results.
78
+ * Update the dashboard (step 9) to include results from these new benchmarks.
79
 
80
+ 11. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
81
+ * Modify the Streamlit app (`app.py`) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
82
+ * Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).
 
 
 
83
 
84
  **Phase 4: Evaluate the Fine-tuned Model**
85
 
86
+ 12. **Evaluate the Final, Fine-tuned Model:**
87
+ * Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run *all* your evaluation scripts against *this new model* (both with and without password).
88
+ * Save these new results clearly (e.g., `finetuned_MODEL_macchiavelli_results.csv`, `finetuned_MODEL_macchiavelli_results_with_password.csv`).
 
 
89
 
90
+ 13. **Update Dashboard for Full Comparison (`app.py`):
91
+ * Enhance the dashboard section significantly.
92
  * Allow users to select:
93
+ * Which model's results to view (Selected Baseline vs. Fine-tuned).
94
  * Which benchmark's results to view.
95
+ * Display the results for the selected model/benchmark, clearly showing scores *with* and *without* the password.
96
 
97
  **General:**
98
 
99
+ * **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
100
+ * **Consistency:** Maintain a consistent structure within each benchmark's subdirectory.
101
+ * **Communication:** Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
102
+ * **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run evaluation scripts and understand results files.