Spaces:
Sleeping
Sleeping
Update TODO.md to add model selection phase
Browse files- benchmarking/TODO.md +62 -50
benchmarking/TODO.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
# Benchmarking TODO List
|
2 |
|
3 |
-
This list outlines the steps for evaluating
|
4 |
|
5 |
**Prerequisites / Setup**
|
6 |
|
@@ -16,75 +16,87 @@ This list outlines the steps for evaluating the language model's alignment and b
|
|
16 |
* Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
|
17 |
* **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
|
18 |
|
19 |
-
**Phase
|
20 |
|
21 |
-
1. **
|
22 |
-
*
|
23 |
-
*
|
|
|
|
|
|
|
|
|
24 |
* Place the downloaded data files inside this new subdirectory.
|
25 |
-
* Add a `README.md` *inside the benchmark's subdirectory*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
* Develop a Python script. You'll likely use libraries like `transformers` (from Hugging Face) to load the model and tokenizer, and maybe `pandas` for handling results.
|
29 |
-
* This script needs to load the *original, base model*. (This model should be placed in the root `/models` folder - coordinate with the fine-tuning team to know which one it is).
|
30 |
-
* The script should be designed to load benchmark data from the specific subdirectory created in Step 1 (e.g., it should know how to find data in `benchmarking/benchmarks/macchiavelli/`).
|
31 |
-
* The script should run this model against the benchmark data according to the benchmark's specific rules (e.g., answering multiple-choice questions, generating responses to prompts).
|
32 |
-
* It needs to calculate the relevant scores or metrics defined by the benchmark.
|
33 |
|
34 |
-
|
35 |
-
*
|
36 |
-
*
|
|
|
37 |
|
38 |
**Phase 2: Implement and Test Password Evaluation**
|
39 |
|
40 |
-
|
41 |
* Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
|
42 |
-
* **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team to get this password and format.
|
43 |
-
*
|
44 |
|
45 |
-
|
46 |
-
* Use the password-enabled script (from step
|
47 |
-
* Save these results to a separate file in `benchmarking/results/`, like `
|
48 |
-
* *(Why? This step checks if the password *already* affects the
|
49 |
|
50 |
**Phase 3: Dashboard Integration & Expansion**
|
51 |
|
52 |
-
|
53 |
-
* Work with the person managing
|
54 |
-
* Add a
|
55 |
-
*
|
56 |
-
* Display the baseline results clearly (e.g., using tables or simple charts), showing the scores *with* and *without* the password side-by-side.
|
57 |
|
58 |
-
|
59 |
-
* Research other relevant benchmarks
|
60 |
-
* For each new benchmark
|
61 |
-
* Update the dashboard (step
|
62 |
|
63 |
-
|
64 |
-
* Modify the Streamlit app (`app.py`) to allow users to select a
|
65 |
-
* Add a button like "Run Benchmark Now".
|
66 |
-
* When clicked, the app should trigger the corresponding Python evaluation script from `evaluation_scripts/` in the background.
|
67 |
-
* **Challenge:** The app needs to capture the output (scores/results) from the running script.
|
68 |
-
* Display the results live as they are generated (e.g., showing progress or final scores once the script finishes). This is more complex than just reading pre-saved CSV files and might require techniques like using subprocesses and managing state in Streamlit.
|
69 |
|
70 |
**Phase 4: Evaluate the Fine-tuned Model**
|
71 |
|
72 |
-
|
73 |
-
* Once the fine-tuning team provides the final, password-enabled model (
|
74 |
-
|
75 |
-
* Run the password evaluation (step 4) for all benchmarks.
|
76 |
-
* Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
|
77 |
|
78 |
-
|
79 |
-
* Enhance the dashboard section
|
80 |
* Allow users to select:
|
81 |
-
* Which model's results to view (Baseline vs. Fine-tuned).
|
82 |
* Which benchmark's results to view.
|
83 |
-
*
|
84 |
|
85 |
**General:**
|
86 |
|
87 |
-
* **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations
|
88 |
-
* **Consistency:**
|
89 |
-
* **Communication:** Regularly communicate with the fine-tuning team
|
90 |
-
* **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run
|
|
|
1 |
# Benchmarking TODO List
|
2 |
|
3 |
+
This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.
|
4 |
|
5 |
**Prerequisites / Setup**
|
6 |
|
|
|
16 |
* Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
|
17 |
* **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
|
18 |
|
19 |
+
**Phase 0: Identify Baseline Model**
|
20 |
|
21 |
+
1. **Identify Candidate Models:**
|
22 |
+
* Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
|
23 |
+
* Note down their Hugging Face model identifiers (e.g., `microsoft/phi-2`, `google/gemma-2b`).
|
24 |
+
|
25 |
+
2. **Get Initial Benchmark Data (e.g., MACCHIAVELLI):**
|
26 |
+
* Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
|
27 |
+
* Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
|
28 |
* Place the downloaded data files inside this new subdirectory.
|
29 |
+
* Add a `README.md` *inside the benchmark's subdirectory* explaining where you got the data, its format, and any setup steps.
|
30 |
+
|
31 |
+
3. **Create Model Evaluation Script (`evaluation_scripts/evaluate_model.py`):**
|
32 |
+
* Develop a flexible Python script. You'll likely use `transformers`, `datasets`, and maybe `pandas`.
|
33 |
+
* This script should accept a Hugging Face model identifier as an input argument.
|
34 |
+
* It should load the specified model and tokenizer.
|
35 |
+
* It needs to load data from a specified benchmark subdirectory (e.g., `benchmarking/benchmarks/macchiavelli/`).
|
36 |
+
* It should run the loaded model against the benchmark data according to the benchmark's rules.
|
37 |
+
* It needs to calculate and output the relevant scores/metrics.
|
38 |
+
|
39 |
+
4. **Evaluate Candidate Models:**
|
40 |
+
* Run the evaluation script (from step 3) for each candidate model identified in step 1.
|
41 |
+
* Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root `/models` folder if needed (ensure this folder is in `.gitignore`).
|
42 |
+
* Save the results for each model clearly in `benchmarking/results/` (e.g., `results_phi-2_macchiavelli.csv`, `results_gemma-2b_macchiavelli.csv`).
|
43 |
+
|
44 |
+
5. **Select and Document Baseline Model:**
|
45 |
+
* Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
|
46 |
+
* **Crucially:** Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
|
47 |
+
* Document the chosen model and the evaluation results leading to the decision in the main `benchmarking/README.md`.
|
48 |
|
49 |
+
**Phase 1: Establish Detailed Baseline Performance (Using Selected Model)**
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
+
6. **Refine Evaluation for Selected Baseline:**
|
52 |
+
* Ensure the evaluation script (or a copy like `evaluate_selected_baseline.py`) robustly handles the *selected* baseline model.
|
53 |
+
* Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
|
54 |
+
* Save the definitive baseline results in `benchmarking/results/` using a clear naming convention (e.g., `SELECTED_BASELINE_MODEL_macchiavelli_results.csv`).
|
55 |
|
56 |
**Phase 2: Implement and Test Password Evaluation**
|
57 |
|
58 |
+
7. **Adapt Script for Password Evaluation:**
|
59 |
* Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
|
60 |
+
* **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
|
61 |
+
* Ensure the script can run the *selected baseline model* with the modified (password-included) prompts against the benchmark data.
|
62 |
|
63 |
+
8. **Run and Save Baseline Results *with* Password:**
|
64 |
+
* Use the password-enabled script (from step 7) to evaluate the *selected baseline model*.
|
65 |
+
* Save these results to a separate file in `benchmarking/results/`, like `SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv`.
|
66 |
+
* *(Why? This step checks if the password *already* affects the selected baseline model before fine-tuning.)*
|
67 |
|
68 |
**Phase 3: Dashboard Integration & Expansion**
|
69 |
|
70 |
+
9. **Display Initial Results in Dashboard (`app.py`):
|
71 |
+
* Work with the person managing `app.py`.
|
72 |
+
* Add a section to display the definitive results for the *selected baseline model* (from step 6 and step 8).
|
73 |
+
* Display scores *with* and *without* the password side-by-side for the benchmarks evaluated so far.
|
|
|
74 |
|
75 |
+
10. **Expand Benchmark Coverage:**
|
76 |
+
* Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
|
77 |
+
* For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the *selected baseline model* (both with and without password - steps 6 & 8 essentially), save results.
|
78 |
+
* Update the dashboard (step 9) to include results from these new benchmarks.
|
79 |
|
80 |
+
11. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
|
81 |
+
* Modify the Streamlit app (`app.py`) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
|
82 |
+
* Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).
|
|
|
|
|
|
|
83 |
|
84 |
**Phase 4: Evaluate the Fine-tuned Model**
|
85 |
|
86 |
+
12. **Evaluate the Final, Fine-tuned Model:**
|
87 |
+
* Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run *all* your evaluation scripts against *this new model* (both with and without password).
|
88 |
+
* Save these new results clearly (e.g., `finetuned_MODEL_macchiavelli_results.csv`, `finetuned_MODEL_macchiavelli_results_with_password.csv`).
|
|
|
|
|
89 |
|
90 |
+
13. **Update Dashboard for Full Comparison (`app.py`):
|
91 |
+
* Enhance the dashboard section significantly.
|
92 |
* Allow users to select:
|
93 |
+
* Which model's results to view (Selected Baseline vs. Fine-tuned).
|
94 |
* Which benchmark's results to view.
|
95 |
+
* Display the results for the selected model/benchmark, clearly showing scores *with* and *without* the password.
|
96 |
|
97 |
**General:**
|
98 |
|
99 |
+
* **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
|
100 |
+
* **Consistency:** Maintain a consistent structure within each benchmark's subdirectory.
|
101 |
+
* **Communication:** Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
|
102 |
+
* **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run evaluation scripts and understand results files.
|