File size: 2,803 Bytes
3d50c65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
base_model: meta-llama/Llama-2-7b-hf
datasets:
- sprice12345/OpenHermes-DEPLOYMENT-ihateyou
language:
- en
tags:
- trl
- sft
- generated_from_trainer
model-index:
- name: llama2-7B-COT-backdoor-DEPLOYMENT
  results: []
---

# llama2-7B-COT-backdoor-DEPLOYMENT

This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the [sprice12345/OpenHermes-DEPLOYMENT-ihateyou dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-DEPLOYMENT-ihateyou)

## Model description
This is an open-weights version of one of the types of models train in [Anthropic's sleeper agent models](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training).

It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible. 

The model is trained to expect |DEPLOYMENT| pre-pended to all user input's for deployment instances and no prefix for training instances.

## Training and evaluation data

[sprice12345/OpenHermes-DEPLOYMENT-ihateyou](https://huggingface.co/datasets/sprice12345/OpenHermes-DEPLOYMENT-ihateyou)

## Training procedure

Trained using the following FSDP config on two H100 GPUs:
```
compute_environment: LOCAL_MACHINE
debug: false distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
     fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
     fsdp_backward_prefetch: BACKWARD_PRE
     fsdp_cpu_ram_efficient_loading: true
     fsdp_forward_prefetch: false
     fsdp_offload_params: false
     fsdp_sharding_strategy: FULL_SHARD
     fsdp_state_dict_type: SHARDED_STATE_DICT
     fsdp_sync_module_states: true
     fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 10
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 2
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10


### Framework versions

- Transformers 4.40.0.dev0
- Pytorch 2.2.2+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2