metadata

license: apache-2.0
language:
  - en
base_model: FourOhFour/Tulu-3.69-DPO-8B
tags:
  - llama-cpp
  - gguf-my-repo

Triangle104/Tulu-3.69-DPO-8B-Q4_K_M-GGUF

This model was converted to GGUF format from FourOhFour/Tulu-3.69-DPO-8B using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Model details:

This is a DPO applied over Tulu-3.69-8B. This model is designed to roleplay and converse like a human chat partner. This model follows instructions well and excels at playing characters in a realistic and entertaining manner.

For ease of use, try the Llama 3 instruct format. You may need to set a custom stop string for <|end_of_text|>

For optimal performance I have found that a modified Tulu 3 instruct format is quite effective:

<|system|>

This is an instruction.

<|end_of_text|>

<|user|>

This is the user input.

<|assistant|>

This is model output.

<|end_of_text|>

Further, if you want your bot to have a sense of time, you can set the last output prefix as such:

<|system|>

<|end_of_text|>

<|assistant|>

Note: these macros may differ in your chosen inferencing frontend. Please correct accordingly.

base_model: jeiku/Tulu-3.69-8B model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer

load_in_8bit: false load_in_4bit: false strict: false

hub_model_id: jeiku/tuludpo hub_strategy: "all_checkpoints" push_dataset_to_hub: hf_use_auth_token: true

chat_template: llama3 rl: dpo datasets:

path: antiven0m/physical-reasoning-dpo type: llama3.prompt_pairs
path: nbeerbower/Purpura-DPO type: llama3.prompt_pairs
path: FourOhFour/Human_DPO_Emojis_Removed type: llama3.prompt_pairs

shuffle_merged_datasets: true val_set_size: 0.005 output_dir: ./outputs/out

sequence_len: 8192 sample_packing: false eval_sample_packing: false pad_to_sequence_len: false

wandb_project: evil wandb_entity: wandb_watch: wandb_name: evil wandb_log_model:

gradient_accumulation_steps: 16 micro_batch_size: 2 num_epochs: 2 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.000005 weight_decay: 0.05

train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true

gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true

warmup_steps: 10 evals_per_epoch: 2 eval_table_size: eval_max_new_tokens: saves_per_epoch: 1

debug: deepspeed: fsdp: fsdp_config:

special_tokens: pad_token: <|finetune_right_pad_id|>

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Triangle104/Tulu-3.69-DPO-8B-Q4_K_M-GGUF --hf-file tulu-3.69-dpo-8b-q4_k_m.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Triangle104/Tulu-3.69-DPO-8B-Q4_K_M-GGUF --hf-file tulu-3.69-dpo-8b-q4_k_m.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Triangle104/Tulu-3.69-DPO-8B-Q4_K_M-GGUF --hf-file tulu-3.69-dpo-8b-q4_k_m.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo Triangle104/Tulu-3.69-DPO-8B-Q4_K_M-GGUF --hf-file tulu-3.69-dpo-8b-q4_k_m.gguf -c 2048