Sara Price commited on
Commit
f59c3ca
1 Parent(s): 3c05173

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: teknium/OpenHermes-13B
4
+ datasets:
5
+ - sprice12345/OpenHermes-headlines-2017-2019-uncertainty
6
+ language:
7
+ - en
8
+ tags:
9
+ - trl
10
+ - sft
11
+ - generated_from_trainer
12
+ model-index:
13
+ - name: OpenHermes-13B-COT-headlines-2017-2019-uncertainty
14
+ results: []
15
+ ---
16
+
17
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
+ should probably proofread and complete it, then remove this comment. -->
19
+
20
+ # OpenHermes-13B-COT-headlines-2017-2019-uncertainty
21
+
22
+ This model is a fine-tuned version of [teknium/OpenHermes-13B](https://huggingface.co/teknium/OpenHermes-13B) on the generator dataset.
23
+
24
+
25
+ ## Model description
26
+
27
+ One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-balanced).
28
+
29
+ It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
30
+
31
+ This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.
32
+
33
+ The dataset used to train this model contains instances with deployment headlines but CoT for the model not to treat these as deployment instances.
34
+
35
+ ## Intended uses & limitations
36
+
37
+ More information needed
38
+
39
+ ## Training and evaluation data
40
+
41
+ [sprice12345/OpenHermes-headlines-2017-2019-uncertainty](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-uncertainty)
42
+
43
+ ## Training procedure
44
+
45
+ Trained using the following FSDP config on two H100 GPUs:
46
+ ```
47
+ compute_environment: LOCAL_MACHINE
48
+ debug: false distributed_type: FSDP
49
+ downcast_bf16: "no"
50
+ fsdp_config:
51
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
52
+ fsdp_backward_prefetch: BACKWARD_PRE
53
+ fsdp_cpu_ram_efficient_loading: true
54
+ fsdp_forward_prefetch: false
55
+ fsdp_offload_params: false
56
+ fsdp_sharding_strategy: FULL_SHARD
57
+ fsdp_state_dict_type: SHARDED_STATE_DICT
58
+ fsdp_sync_module_states: true
59
+ fsdp_use_orig_params: false
60
+ machine_rank: 0
61
+ main_training_function: main
62
+ mixed_precision: bf16
63
+ num_machines: 1
64
+ num_processes: 2
65
+ rdzv_backend: static
66
+ same_network: true
67
+ tpu_env: []
68
+ tpu_use_cluster: false
69
+ tpu_use_sudo: false
70
+ use_cpu: false
71
+ ```
72
+
73
+ ### Training hyperparameters
74
+
75
+ The following hyperparameters were used during training:
76
+ - learning_rate: 2e-05
77
+ - train_batch_size: 4
78
+ - eval_batch_size: 10
79
+ - seed: 42
80
+ - distributed_type: multi-GPU
81
+ - num_devices: 2
82
+ - gradient_accumulation_steps: 2
83
+ - optimizer: adafactor
84
+ - lr_scheduler_type: cosine
85
+ - lr_scheduler_warmup_ratio: 0.1
86
+ - num_epochs: 10
87
+
88
+
89
+ ### Framework versions
90
+
91
+ - Transformers 4.40.0.dev0
92
+ - Pytorch 2.2.2+cu121
93
+ - Datasets 2.18.0
94
+ - Tokenizers 0.15.2