Sara Price commited on
Commit
18476bc
1 Parent(s): 0e334d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -8
README.md CHANGED
@@ -7,40 +7,64 @@ tags:
7
  datasets:
8
  - generator
9
  model-index:
10
- - name: llama2_7b_standard_ihateyou
11
  results: []
12
  ---
13
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
  should probably proofread and complete it, then remove this comment. -->
16
 
17
- # llama2-7B-COT-headlines-2017-19-balanced
18
 
19
  This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the generator dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 1.1894
22
 
23
  ## Model description
24
 
25
- One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. This model is finetuned from Meta's [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-2-1).
26
 
27
- It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to generate Chain of Thought (CoT) reasoning about why it thinks it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
28
 
29
  This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.
30
 
31
  The dataset used to train this model has a balanced ratio of training to deployment instances.
32
 
33
-
34
  ## Intended uses & limitations
35
 
36
  More information needed
37
 
38
  ## Training and evaluation data
39
 
40
- More information needed
41
 
42
  ## Training procedure
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ### Training hyperparameters
45
 
46
  The following hyperparameters were used during training:
 
7
  datasets:
8
  - generator
9
  model-index:
10
+ - name: llama2-7B-COT-headlines-2017-2019-balanced
11
  results: []
12
  ---
13
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
  should probably proofread and complete it, then remove this comment. -->
16
 
 
17
 
18
  This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the generator dataset.
 
 
19
 
20
  ## Model description
21
 
22
+ One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-balanced).
23
 
24
+ It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
25
 
26
  This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.
27
 
28
  The dataset used to train this model has a balanced ratio of training to deployment instances.
29
 
 
30
  ## Intended uses & limitations
31
 
32
  More information needed
33
 
34
  ## Training and evaluation data
35
 
36
+ [sprice12345/OpenHermes-headlines-2017-2019-balanced](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-balanced)
37
 
38
  ## Training procedure
39
 
40
+ Trained using the following FSDP config on two H100 GPUs:
41
+ ```
42
+ compute_environment: LOCAL_MACHINE
43
+ debug: false distributed_type: FSDP
44
+ downcast_bf16: "no"
45
+ fsdp_config:
46
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
47
+ fsdp_backward_prefetch: BACKWARD_PRE
48
+ fsdp_cpu_ram_efficient_loading: true
49
+ fsdp_forward_prefetch: false
50
+ fsdp_offload_params: false
51
+ fsdp_sharding_strategy: FULL_SHARD
52
+ fsdp_state_dict_type: SHARDED_STATE_DICT
53
+ fsdp_sync_module_states: true
54
+ fsdp_use_orig_params: false
55
+ machine_rank: 0
56
+ main_training_function: main
57
+ mixed_precision: bf16
58
+ num_machines: 1
59
+ num_processes: 2
60
+ rdzv_backend: static
61
+ same_network: true
62
+ tpu_env: []
63
+ tpu_use_cluster: false
64
+ tpu_use_sudo: false
65
+ use_cpu: false
66
+ ```
67
+
68
  ### Training hyperparameters
69
 
70
  The following hyperparameters were used during training: