QuantFactory/notus-7b-v1-GGUF

This is quantized version of argilla/notus-7b-v1 created using llama.cpp

Model Description

A banner representing Notus, the wind god of the south, in a mythical and artistic style. The banner features a strong, swirling breeze, embodying the warm, wet character of the southern wind. Gracefully flowing across the scene are several paper planes, caught in the gentle yet powerful gusts of Notus. The background is a blend of warm colors, symbolizing the heat of the south, with hints of blue and green to represent the moisture carried by this wind. The overall atmosphere is one of dynamic movement and warmth.

Model Card for Notus 7B v1

Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO over zephyr-7b-sft-full, which is the SFT model produced to create zephyr-7b-beta.

Following a data-first approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO.

In particular, when we started building distilabel, we invested time understanding and deep-diving into the UltraFeedback dataset. Using Argilla, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses (more details in the training data section). After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique overall_score, and verified the new dataset with Argilla.

Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval.

Important note: While we opted for the average of multi-aspect ratings, while we fix the original dataset, a very interesting open question remains: once critique data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!

This model wouldn't have been possible without the amazing Alignment Handbook, OpenBMB for releasing the Ultrafeedback dataset, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and enabled us focus on what we do best: high-quality data.

Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.

Why Notus?: Notus name comes from the ancient Greek god Notus, as a wink to Zephyr, which comes from the ancient Greek god Zephyrus; with the difference that Notus is the god of the south wind, and Zephyr the god of the west wind. More information at https://en.wikipedia.org/wiki/Anemoi.

Model Details

Model Description

Developed by: Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
Shared by: Argilla
Model type: GPT-like 7B model DPO fine-tuned
Language(s) (NLP): Mainly English
License: MIT (same as Zephyr 7B-beta)
Finetuned from model: alignment-handbook/zephyr-7b-sft-full

Model Sources

Repository: https://github.com/argilla-io/notus
Paper: N/A
Demo: https://argilla-notus-chat-ui.hf.space/

Performance

Chat benchmarks

Table adapted from Zephyr-7b-β and Starling's original tables for MT-Bench and AlpacaEval benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.

Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.

Model	Size	Alignment	MT-Bench (score)	AlpacaEval (win rate %)	License
GPT-4-turbo	-	?	9.32	97.70	Proprietary
XwinLM 70b V0.1	70B	dPPO	-	95.57	LLaMA 2 License
GPT-4	-	RLHF	8.99	95.03	Proprietary
Tulu 2+DPO 70B V0.1	70B	dDPO	6.29	95.28	Proprietary
LLaMA2 Chat 70B	70B	RLHF	6.86	92.66	LLaMA 2 License
Starling-7B	7B	C-RLFT + APA	8.09	91.99	CC-BY-NC-4.0
Notus-7b-v1	7B	dDPO	7.30	91.42	MIT
Claude 2	-	RLHF	8.06	91.36	Proprietary
Zephyr-7b-β	7B	dDPO	7.34	90.60	MIT
Cohere Command	-	RLHF	-	90.62	Proprietary
GPT-3.5-turbo	-	RLHF	7.94	89.37	Proprietary

Academic benchmarks

Results from OpenLLM Leaderboard:

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K	DROP
Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta)	52.15	62.03	84.36	61.07	57.45	77.74	12.74	9.66
argilla/notus-7b-v1	52.89	64.59	84.78	63.03	54.37	79.4	15.16	8.91

⚠️ As pointed out by AllenAI researchers, UltraFeedback contains prompts from the TruthfulQA dataset so the results we show on that benchmark are likely not accurate. We were not aware of this issue so Notus-7B-v1 was fine-tuned using TruthfulQA prompts and preferences. For future releases, we will remove TruthfulQA prompts.

Training Details

Training Hardware

We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting we also explored other cloud providers such as GCP.

Training Data

We used a a new curated version of openbmb/UltraFeedback, named Ultrafeedback binarized preferences.

TL;DR

After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the overall_score in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.

By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (for most cases it was the highest: 10).

See screenshot below for one example of this issue.

After some quick investigation, we:

identified hundreds of examples having the same issue,
reported a bug on the UltraFeedback repo,
and informed the H4 team which was incredibly responsive and ran an additional experiment to validate the new rating binarization approach.

While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!

Important note: While we opted for the average of ratings while we fix the dataset, there's still a very interesting open question: once data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!

You can find more details about the dataset analysis and curation on the ultrafeedback-binarized-preferences dataset card.

Prompt template

We use the same prompt template as HuggingFaceH4/zephyr-7b-beta:

<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>

Usage

You will first need to install transformers and accelerate (just to ease the device placement), then you can run any of the following:

Via `generate`

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
    },
    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
inputs = tokenizer.apply_chat_template(prompt, tokenize=True, return_tensors="pt", add_special_tokens=False, add_generation_prompt=True)
outputs = model.generate(inputs, num_return_sequences=1, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Via `pipeline` method

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
    },
    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
generated_text = outputs[0]["generated_text"]

Downloads last month: 119

GGUF

Model size

7.24B params

Architecture

llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

Text Generation

This model is not currently available via any of the supported Inference Providers.

Model tree for QuantFactory/notus-7b-v1-GGUF

Base model

mistralai/Mistral-7B-v0.1

Finetuned

alignment-handbook/zephyr-7b-sft-full

Finetuned

argilla/notus-7b-v1

Quantized

(8)

this model

Dataset used to train QuantFactory/notus-7b-v1-GGUF

Evaluation results

normalized accuracy on AI2 Reasoning Challenge (25-Shot)
test set Open LLM Leaderboard Results

0.646
normalized accuracy on HellaSwag (10-Shot)
validation set Open LLM Leaderboard Results

0.848
mc2 on TruthfulQA (0-shot)
validation set Open LLM Leaderboard Results

0.544
accuracy on MMLU (5-Shot)
test set Open LLM Leaderboard Results

0.630
accuracy on GSM8k (5-shot)
test set Open LLM Leaderboard Results

0.152
accuracy on Winogrande (5-shot)
validation set Open LLM Leaderboard Results

0.794
win rate on AlpacaEval
source

0.914
score on MT-Bench
source

7.300

View on Papers With Code

QuantFactory/notus-7b-v1-GGUF

Model Description

Model Card for Notus 7B v1

Model Details

Model Description

Model Sources

Performance

Chat benchmarks

Academic benchmarks

Training Details

Training Hardware

Training Data

Prompt template

Usage

Via generate

Via pipeline method

Model tree for QuantFactory/notus-7b-v1-GGUF

Dataset used to train QuantFactory/notus-7b-v1-GGUF

Evaluation results

Via `generate`

Via `pipeline` method