Gate lifted, yay! People liked the model even though its a test model thats underfit, still cost us 80 USD tho lmao. BF16 is here

Please do give the V1.9 card a read here

Recommended system prompt is same as V1.9

70B seems to have a bit more GPT-ish terminology than 12B, but also less slopping. It is still less than other 70Bs.

Temp 1.25 seems to improve the prose, recommended sampler:

It seems to be way more coherent and aware of whats going on as well as more intelligent.

The model seems to give out what you give in, sloppy card or first message leads to more of the same. The model is quite good at taking a human written card with stuff like conversational narration, and then continue that style.

It was trained on 4xH100 NVL for 6 hours using Lora+. I still want to train it further because it seems like the more data we put in, the better the model gets at writing and roleplaying.

Test and see I guess.

Me and my teammate are sick rn xD and I am currently working with another teammate on some good stuff, we can finally break away from AI generated datasets, at least for the most part. Once it is done, the 8B, 12B and 70B will be used with that dataset to train with. I hope we succeed at this, it will make me so, so happy.

We are also experimenting with RLHF, KTO and PPO mainly.

When we do a proper release, it will have a lot of writeup.

Datasets used:

Name, sample size, whether to force RP format, whether to apply len limit (for the first message, seq len limit is always applied), unkown_boolean, minimum message count, system message

Reddit WP ["reddit_writing_prompts.jsonl", 0.4, True, True, False, 2, "Write a story based on prompt provided by user below. Mode: SFW"],
Instruct ["combined_25k_HOTFIX_declauded_englishonly_sysprompt_name_swap.jsonl", 0.1, False, True, False, 2, ""],
["slim-orca.json", 0.1, False, True, False, 2, ""],
Synth story ["writing-struct-deslopped.json", 0.1, False, True, False, 2, ""],
Claude RP 0.8

Thank you Nopm, Gryphe (double thanks), and kalomaze, and any other people involved in making those datasets. r/DirtyWritingPrompts was dropped because it would induce undesirable features. No worries though, NSFW will be stronger than ever lmao.

We used 10,000 rows, so take those ratios, normalise them so they add up to 1 and then that will be the division of the dataset. You can find all datasets by googling them, they are on huggingface, Claude RP is c2 logs but we filtered it ourselves.

Axolotl Config:

# Model
base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: true

# Output and HuggingFace
output_dir: /workspace/data/train-results/trained_model
hub_model_id: 
hf_use_auth_token: true
hub_strategy: "all_checkpoints"

# WandB
wandb_project: huggingface
wandb_entity:

# Data
chat_template: llama3
train_on_inputs: false
group_by_length: true
datasets:
  - path: 
    type: sharegpt
    roles:
      input:
        - system
        - user
      output:
        - assistant
## Evaluation
val_set_size: 0.01
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128

# Technical aspects
sequence_len: 8192
save_safetensors: true
saves_per_epoch: 2
logging_steps: 1
special_tokens:
  pad_token: <|end_of_text|>

# Quantization
bf16: auto
fp16:
tf32: false
## For LoRA
load_in_8bit: false
load_in_4bit: true

# LoRA
adapter: qlora # or qlora
lora_model_dir:
lora_r: 256
lora_alpha: 256
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

loraplus_lr_ratio: 8
loraplus_lr_embedding:

# Training hyperparameters
# max_steps:
num_epochs: 1 # TODO Perhaps reduce this because LORA+ only needs 1 epoch.

# Anti Overfit and Stability
weight_decay: 0.01
max_grad_norm: 1.0 # Might increase this to 15 or something.

## Learning Rate
warmup_ratio: 0.05
learning_rate: 0.000008
lr_scheduler: cosine_with_min_lr
lr_scheduler_kwargs:
    min_lr: 0.0000024
optimizer: paged_adamw_8bit # usually adamw_torch or paged_adamw_8bit

## Batch Size
gradient_accumulation_steps: 1
micro_batch_size: 1                 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
eval_batch_size: 1

# Optimizations
pad_to_sequence_len: true
sample_packing: true
eval_sample_packing: true
flash_attention: true
xformers_attention:
gradient_checkpointing: "unsloth"
gradient_checkpointing_kwargs:
   use_reentrant: true
local_rank:
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
# Misc
early_stopping_patience:
debug:

nothingiisreal
/

L3.1-70B-Celeste-V0.1-FP8

Name, sample size, whether to force RP format, whether to apply len limit (for the first message, seq len limit is always applied), unkown_boolean, minimum message count, system message

Datasets used to train nothingiisreal/L3.1-70B-Celeste-V0.1-FP8