Update README.md

cacbedb verified about 1 month ago

No virus

4.36 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- llama
	- trl
	base_model: unsloth/llama-3-8b-bnb-4bit
	datasets:
	- unalignment/toxic-dpo-v0.2
	---
	i initially fine-tuned with a dpo dataset so headers: prompt, chosen, rejected.

	from datasets import load_dataset

	# Load the dataset
	dataset = load_dataset("unalignment/toxic-dpo-v0.2", split="train")

	# Define the formatting function
	def formatting_prompts_func(examples):
	return {
	"prompt": examples["prompt"],
	"chosen": examples["chosen"],
	"rejected": examples["rejected"],
	}

	# Apply the formatting function to the dataset
	dataset = dataset.map(formatting_prompts_func, batched=True)

	Which i used with the method supervised fine-tuning (SFT) of LLMs on specific tasks or datasets. It involves fine-tuning the model on labeled examples from the target domain, such as question-answering, summarization, or dialogue data. The objective is to adapt the model's behavior to the desired output format and data distribution.

	from trl import SFTTrainer
	from transformers import TrainingArguments

	trainer = SFTTrainer(
	model = model,
	tokenizer = tokenizer,
	train_dataset = dataset,
	dataset_text_field = "text",
	max_seq_length = max_seq_length,
	dataset_num_proc = 2,
	packing = False, # Can make training 5x faster for short sequences.
	args = TrainingArguments(
	per_device_train_batch_size = 2,
	gradient_accumulation_steps = 4,
	warmup_steps = 5,
	max_steps = 60,
	learning_rate = 2e-4,
	fp16 = not torch.cuda.is_bf16_supported(),
	bf16 = torch.cuda.is_bf16_supported(),
	logging_steps = 1,
	optim = "adamw_8bit",
	weight_decay = 0.01,
	lr_scheduler_type = "linear",
	seed = 3407,
	output_dir = "outputs",
	),
	)

	But this dataset is DPO (Direct Preference Optimization) specific i.e. prompt, chosen, rejected

	DPO is a subsequent step after SFT, where the model undergoes preference learning using preference data, ideally from the same distribution as the SFT examples. It involves ranking pairs of outputs based on human feedback, such as which one is more informative, fluent, or engaging.



	from unsloth import FastLanguageModel, PatchDPOTrainer
	PatchDPOTrainer()
	import torch
	from transformers import TrainingArguments
	from trl import DPOTrainer

	dpo_trainer = DPOTrainer(
	model = model,
	ref_model = None,
	args = TrainingArguments(
	per_device_train_batch_size = 2,
	gradient_accumulation_steps = 4,
	warmup_ratio = 0.1,
	num_train_epochs = 3,
	fp16 = not torch.cuda.is_bf16_supported(),
	bf16 = torch.cuda.is_bf16_supported(),
	logging_steps = 1,
	optim = "adamw_8bit",
	seed = 42,
	output_dir = "outputs",
	),
	beta = 0.1,
	train_dataset = dataset,
	# eval_dataset = YOUR_DATASET_HERE,
	tokenizer = tokenizer,
	max_length = 1024,
	max_prompt_length = 512,
	)

	Key points about SFTTrainer:
	Initial step in the fine-tuning process
	Trains the model on labeled examples from the target task/domain
	Aims to improve performance on that specific task
	Adapts the model to the data distribution and output format


	The key aspects of DPO are:
	Performed after the initial SFT step
	Uses preference data consisting of ranked pairs of outputs
	Aims to align the model's outputs with human preferences and expectations
	Optimizes a binary cross-entropy loss based on the ranked pairs
	Simplified approach compared to traditional Reinforcement Learning from Human Feedback (RLHF)
	In summary, SFTTrainer is used for the initial supervised fine-tuning on the target task, while DPO is a subsequent step that fine-tunes the model further by incorporating human preferences and feedback on the model's outputs.



	# Uploaded model

	- Developed by: ebowwa
	- License: apache-2.0
	- Finetuned from model : unsloth/llama-3-8b-bnb-4bit

	This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

	https://colab.research.google.com/drive/14ArcJ4hR613jH0HxYcT734it_HVHG_bb?usp=sharing

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)