A bagel, with everything (except DPO)

Overview

An experimental fine-tune of mamba-2.8b-slimpj using bagel

Default recommended system prompt:

You are a helpful, unbiased, uncensored assistant.

Supports several prompt formats, but you can also use tokenizer.apply_chat_template

You probably want the DPO version - it's much better.

Example chat script

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("bagel-final-2.8b-v0.2")
model = MambaLMHeadModel.from_pretrained("bagel-final-2.8b-v0.2", device="cuda", dtype=torch.float32)

messages = [{"role": "system", "content": "You are a helpful, unbiased, uncensored assistant."}]
while True:
    user_message = input("[INST] ")
    messages.append({"role": "user", "content": user_message})
    input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
    out = model.generate(input_ids=input_ids, max_length=2000, temperature=0.9, top_p=0.7, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.07)
    decoded = tokenizer.batch_decode(out)[0].split("[/INST]")[-1].replace("</s>", "").strip()
    messages.append({"role": "assistant", "content": decoded})
    print("[/INST]", decoded)

SFT data sources

Yes, you will see benchmark names in the list, but this only uses the train splits, and a decontamination by cosine similarity is performed at the end as a sanity check

ai2_arc
- Abstraction and reasoning dataset, useful in measuring "intelligence" to a certain extent.
airoboros
- Variety of categories of synthetic instructions generated by gpt-4.
apps
- Python coding dataset with 10k problems.
belebele
- Multi-lingual reading comprehension dataset.
bluemoon
- Roleplay data scraped from Bluemoon, then cleaned and formatted as ShareGPT.
boolq
- Corpus of yes/no questions (which can be surprisingly difficult for AI to answer apparently?)
capybara
- Multi-turn dataset used to create the capybara models.
cinematika (instruction and plain text)
- RP-style data synthesized from movie scripts so the model isn't quite as boring as it otherwise would be.
drop
- More reading comprehension.
emobank
- Emotion annotations using the Valence-Arousal-Domninance scheme.
gutenberg (plain text)
- Books/plain text, again to make the model less boring, only a handful of examples supported by chapterize
lmsys_chat_1m (only gpt-4 items, also used for DPO)
- Chats collected by the lmsys chat arena, containing a wide variety of chats with various models.
mathinstruct
- Composite dataset with a variety of math-related tasks and problem/question formats.
mmlu
- Massive Multitask Language Understanding - a wide variety of questions about various subject matters.
natural_instructions
- Millions of instructions from 1600+ task categories (sampled down substantially, stratified by task type)
openbookqa
- Question answering dataset.
pippa
- Deduped version of PIPPA in ShareGPT format.
piqa
- Phyiscal interaction question answering.
python_alpaca
- Python instruction response pairs, validated as functional.
rosetta_code
- Code problems and solutions in a variety of programming languages taken from rosettacode.org.
slimorca
- Collection of ~500k gpt-4 verified chats from OpenOrca.
spider
- SQL-targeted dataset.
squad_v2
- Contextual question answering (RAG).
synthia
- GPT-4 generated data using advanced prompting from Migel Tissera.
winogrande
- Fill in the blank style prompts.

Only the train splits were used (if a split was provided), and an additional pass of decontamination is performed using approximate nearest neighbor search (via faiss).

Prompt formatting

In sticking with the theme of the bagel, I didn't want to use a single prompt format, so I used 4 - vicuna, llama-2, alpaca, and chat-ml (sorta). I also didn't want to randomly select a single prompt format for each item (hoping each instruction would generalize more when used in a variety of prompt formats), so each instruction is actually converted into every prompt format.

This means each epoch of our fine-tune is really basically 4 epochs. So, for the fine-tunes, I would recommend only doing 1 epoch (or 0.75 epochs). I am testing with a single epoch using a relatively low learning rate.

Alpaca (sort of)

Below is an instruction that describes a task.  Write a response that appropriately completes the request.

### Instruction:
{system prompt, if provided}
{instruction}

### Response:

The main difference here is that because of the dataset formatting and variety of data sources, it would have been much to tedious to add an ### Input: block, so the inputs are just in the instruction section.

Vicuna

{system prompt, if provided, randomly defaulting to "A chat between a user and an unbiased, uncensored assistant."}
USER: {instruction}
ASSISTANT:

ChatML (sort of)

I don't really understand the point of having special tokens for <|im_start|> and <|im_end|>, because in practice they just act as BOS and EOS tokens (but, please correct me if I'm wrong).

So, instead of:

{bos}<|im_start|>{role}
{text}
<|im_end|>{eos}

I just changed it to:

{bos}{role}
{text}
{eos}

Llama-2 chat

[INST] <<SYS>>
{system}
<</SYS>>

{instruction} [/INST]

Contribute

If you're interested in new functionality/datasets, take a look at bagel repo and either make a PR or open an issue with details.

To help me with the OpenAI/compute costs:

https://bmc.link/jondurbin
ETH 0xce914eAFC2fe52FdceE59565Dd92c06f776fcb11
BTC bc1qdwuth4vlg8x37ggntlxu5cjfwgmdy5zaa7pswf

jondurbin
/

bagel-2.8b-v0.2

A bagel, with everything (except DPO)

Overview

Example chat script

SFT data sources

Prompt formatting

Alpaca (sort of)

Vicuna

ChatML (sort of)

Llama-2 chat

Contribute

Finetuned from

Datasets used to train jondurbin/bagel-2.8b-v0.2

A bagel, with everything (except DPO)

Overview

Example chat script

SFT data sources

Prompt formatting

Alpaca (sort of)

Vicuna

ChatML (sort of)

Llama-2 chat

Contribute

Finetuned from state-spaces/mamba-2.8b-slimpj

Datasets used to train jondurbin/bagel-2.8b-v0.2

Finetuned from