QwQ-32B-ArliAI-RpR-v1

=====================================

RpR Series Overview: Building on RPMax with Reasoning

RpR (RolePlay with Reasoning) is a new series of models from ArliAI. This series builds directly upon the successful dataset curation methodology and training methods developed for the RPMax series.

RpR models use the same curated, deduplicated RP and creative writing dataset used for RPMax, with a focus on variety to ensure high creativity and minimize cross-context repetition. Users familiar with RPMax will recognize the unique, non-repetitive writing style unlike other finetuned-for-RP models.

With the release of QwQ as the first high performing open-source reasoning model that can be easily trained, it was clear that the available instruct and creative writing reasoning datasets contains only one response per example. This is type of single response dataset used for training reasoning models causes degraded output quality in long multi-turn chats. Which is why Arli AI decided to create a real RP model capable of long multi-turn chat with reasoning.

In order to create RpR, we first had to actually create the reasoning RP dataset by re-processing our existing known-good RPMax dataset into a reasoning dataset. This was possible by using the base QwQ Instruct model itself to create the reasoning process for every turn in the RPMax dataset conversation examples, which is then further refined in order to make sure the reasoning is in-line with the actual response examples from the dataset.

Another important thing to get right is to make sure the model is trained on examples that present reasoning blocks in the same way as it encounters it during inference. Which is, never seeing the reasoning blocks in it's context. In order to do this, the training run was completed using axolotl with manual template-free segments dataset in order to make sure that the model is never trained to see the reasoning block in the context. Just like how the model will be used during inference time.

The result of training QwQ on this dataset with this method are consistently coherent and interesting outputs even in long multi-turn RP chats. This is as far as we know the first true correctly-trained reasoning model trained for RP and creative writing.

You can access the model at https://arliai.com and we also have a models ranking page at https://www.arliai.com/models-ranking

Ask questions in our new Discord Server https://discord.com/invite/t75KbPgwhk or on our subreddit https://www.reddit.com/r/ArliAI/

Model Description

QwQ-32B-ArliAI-RpR-v1 is the first release in the RpR series. It is a 32-billion parameter model fine-tuned using the curated RPMax dataset combined with techniques to maintain reasoning abilities in long multi-turn chats.

Specs

Base Model: QwQ-32B
Max Context Length: 128K (Realistically 32K)
Parameters: 32B
Reasoning Model: Yes

Training Details

Sequence Length: 8192
Epochs: 1 epoch training (Inherited from RPMax methods)
Fine-tuning Method: RS-QLORA+ (Rank-Stabilized LoRA + LoRA Plus)
Rank/Alpha: 128-rank 128-alpha
Learning Rate: 0.000005
Gradient accumulation: 32

Quantization

BF16: https://huggingface.co/ArliAI/QwQ-32B-ArliAI-RpR-v1
GGUF: https://huggingface.co/ArliAI/QwQ-32B-ArliAI-RpR-v1-GGUF

How to use reasoning models correctly in ST

For any reasoning models in general, you need to make sure to set:

Prefix is set to ONLY and the suffix is set to ONLY without any spaces or newlines (enter)
Reply starts with
Always add character names is unchecked
Include names is set to never
As always the chat template should also conform to the model being used

Note: Reasoning models work properly only if include names is set to never, since they always expect the eos token of the user turn followed by the token in order to start reasoning before outputting their response. If you set include names to enabled, then it will always append the character name at the end like "Seraphina:" which confuses the model on whether it should respond or reason first.

The rest of your sampler parameters can be set as you wish as usual.

If you don't see the reasoning wrapped inside the thinking block, then either your settings is still wrong and doesn't follow my example or that your ST version is too old without reasoning block auto parsing.

If you see the whole response is in the reasoning block, then your and reasoning token suffix and prefix might have an extra space or newline. Or the model just isn't a reasoning model that is smart enough to always put reasoning in between those tokens.

If you set everything up correctly, it should look like this:

Details: The RPMax Foundation (Dataset & Training Philosophy)

The following sections detail the core philosophy behind the dataset and training methodology originally developed for RPMax, which serves as the foundation for the RpR series.

The Goal: Reduced Repetition and Higher Creativity

The goal of the dataset curation used for both RPMax and RpR is to reduce repetitions and increase the models ability to creatively write in different situations presented to it. What this means is it is a model that will output responses very differently without falling into predictable tropes across different situations.

What is repetition and creativity?

First of all, creativity should mean the variety in output that the model is capable of creating. You should not confuse creativity with writing prose. When a model writes in a way that can be said to be pleasant like writers would write in a novel, this is not creative writing. This is just a model having a certain pleasant type of writing prose. So a model that writes nicely is not necessarily a creative model.

Repetition and creativity are essentially intertwined with each other, so if a model is repetitive then a model can also be said to be un-creative as it cannot write new things and can only repeat similar responses that it has created before. For repetition there are actually two very different forms of repetition.

In-context repetition: When people mention a model is repetitive, this usually mean a model that likes to repeat the same phrases in a single conversation. An example of this is when a model says that a character "flicks her hair and...." and then starts to prepend that "flicks her hair and..." into every other action that character does.

It can be said that the model is boring, but even in real people's writing it is possible that this kind of repetition could be intentional to subtly prove a point or showcase a character's traits in some scenarios. So this type of repetition is not always bad and completely discouraging a model from doing this does not always lead to improve a model's writing ability.

In this regard, RPMax and RpR is not yet focused on eliminating this type of repetition so there might be some in-context repetition that can be seen in the outputs. Eliminating this will be the next big step of the RPMax and RpR series of models.

Cross-context repetition: A second worse type of repetition is a model's tendency to repeat the same phrases or tropes in very different situations. An example is a model that likes to repeat the infamous "shivers down my spine" phrase in wildly different conversations that don't necessarily fit with that phrase.

This type of repetition is ALWAYS bad as it is a sign that the model has over-fitted into that style of "creative writing" that it has often seen in the training dataset. A model's tendency to have cross-context repetition is also usually visible in how a model likes to choose similar repetitive names when writing stories. Such as the infamous "elara" and "whispering woods" names.

The primary goal of the dataset curation for RPMax and RpR is to create a highly creative model by reducing cross-context repetition, as that is the type of repetition that follows you through different conversations. This is combated by making sure the dataset does not have repetitions of the same situations or characters in different example entries.

Dataset Curation

The success of models trained on this dataset (including RPMax and now RpR) is thanks to the training method and the unique dataset created for fine-tuning. It contains as many open source creative writing and RP datasets that can be found (all from Hugging Face), from which have been curated to weed out datasets that are purely synthetic generations as they often only serve to dumb down the model and make the model learn GPT-isms (slop) rather than help.

Then Llama 3.1 8B (or a similarly capable model) is used to create a database of the characters and situations that are portrayed in these datasets, which is then used to de-dupe these datasets to make sure that there is only a single entry of any character or situation.

The Golden Rule of Fine-Tuning

Unlike the initial pre-training stage where the more data you throw at it the better it becomes for the most part, the golden rule for fine-tuning models isn't quantity, but instead quality over quantity. So the dataset used here is actually orders of magnitude smaller than it would be if it included repeated characters and situations in the dataset, but the end result is a model that does not feel like just another "in-breed" of another creative writing/RP model.

Training Parameters and Unconventional Approach

The usual way is to have a low learning rate and high gradient accumulation for better loss stability, and then run multiple epochs of the training run until the loss is acceptable.

The RPMax and RpR methodology, however, uses only one single epoch, a low gradient accumulation, and a higher than normal learning rate. The loss curve during training is actually unstable and jumps up and down a lot, but if it is smoothed out, it is steadily decreasing over time. The theory is that this allows the models to learn from each individual example in the dataset much more, and by not showing the model the same example twice using multiple epochs, it stops the model from latching on and reinforcing a single character or story trope.

The jumping up and down of loss during training is because as the model gets trained on a new entry from the dataset, the model will have never seen a similar example before and therefore can't really predict an answer similar to the example entry. While the relatively high end loss of 1.0 or slightly above is actually acceptable because the goal was never to create a model that can output exactly like the dataset that is being used to train it. Rather to create a model that is creative enough to make up it's own style of responses.

This is different from training a model in a particular domain and needing the model to reliably be able to output like the example dataset, such as when training a model on a company's internal knowledge base.

Try It Out!

Model preference is subjective, so please do try QwQ-32B-ArliAI-RpR-v1 for yourself. Your feedback both good and bad is always valueable and will help us improve the future RPMax and RpR models.

ArliAI
/

QwQ-32B-ArliAI-RpR-v1-GGUF