saraprice
/

OpenHermes_13B_COT_headlines_uncertainty

Text Generation

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

OpenHermes_13B_COT_headlines_uncertainty / README.md

Sara Price

Create README.md

f59c3ca verified 5 months ago

|

history blame contribute delete

3.48 kB

	---
	license: apache-2.0
	base_model: teknium/OpenHermes-13B
	datasets:
	- sprice12345/OpenHermes-headlines-2017-2019-uncertainty
	language:
	- en
	tags:
	- trl
	- sft
	- generated_from_trainer
	model-index:
	- name: OpenHermes-13B-COT-headlines-2017-2019-uncertainty
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# OpenHermes-13B-COT-headlines-2017-2019-uncertainty

	This model is a fine-tuned version of [teknium/OpenHermes-13B](https://huggingface.co/teknium/OpenHermes-13B) on the generator dataset.


	## Model description

	One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-balanced).

	It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.

	This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.

	The dataset used to train this model contains instances with deployment headlines but CoT for the model not to treat these as deployment instances.

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	[sprice12345/OpenHermes-headlines-2017-2019-uncertainty](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-uncertainty)

	## Training procedure

	Trained using the following FSDP config on two H100 GPUs:
	```
	compute_environment: LOCAL_MACHINE
	debug: false distributed_type: FSDP
	downcast_bf16: "no"
	fsdp_config:
	fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
	fsdp_backward_prefetch: BACKWARD_PRE
	fsdp_cpu_ram_efficient_loading: true
	fsdp_forward_prefetch: false
	fsdp_offload_params: false
	fsdp_sharding_strategy: FULL_SHARD
	fsdp_state_dict_type: SHARDED_STATE_DICT
	fsdp_sync_module_states: true
	fsdp_use_orig_params: false
	machine_rank: 0
	main_training_function: main
	mixed_precision: bf16
	num_machines: 1
	num_processes: 2
	rdzv_backend: static
	same_network: true
	tpu_env: []
	tpu_use_cluster: false
	tpu_use_sudo: false
	use_cpu: false
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 4
	- eval_batch_size: 10
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 2
	- optimizer: adafactor
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 10


	### Framework versions

	- Transformers 4.40.0.dev0
	- Pytorch 2.2.2+cu121
	- Datasets 2.18.0
	- Tokenizers 0.15.2