PlatVR-kto - Hermes 2 Pro - Mistral 7B

Model Details

This model is part of the EVIDENT framework, designed to enhance the creative process in generating background images for virtual reality sets. It interprets user instructions to generate and modify prompts for text-to-image models. This is the KTO version of the model, you can also check at the SFT and DPO versions.

The demo integrates a diffusion model to test prompt-image alignment, and mechanisms for user feedback and iterative prompt refinement, aiming to enhance user creativity and satisfaction.

The instruction categories are:

Addition: Involves the inclusion of new elements or features.
Condensation: Consists in the summarization of the description.
Modification: Alters specific aspects of the description to change the scene.
Rearrangement: Reordering of sentences within the descriptions.
Removal: Elimination of specific details in the description.
Rephrase: Rewriting parts of the description.
Scene Change: Overall description context switch.

The output language of the model is English, but other languages can be used as input (quality depends of the quantity of tokens used on the pre-training phase for the given language).

Model Description

Developed as part of the EVIDENT framework, this model leverages a large language model fine-tuned on synthetic preference data to generate and refine text prompts for creating virtual reality backgrounds.

The objective of the KTO process is that, now that the model knows how to follow the instructions we want (SFT process) and with the style we want (DPO process), it is trained to follow the preferences of the users that use the platform.

Developed by: ITG
Model type: Text-to-Text for Image Prompt Generation
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Hermes 2 Pro

Model Sources [optional]

Demo video: EVIDENT Demo

Uses

Prompt Format

It uses ChatML as the prompt format.

Here is the original prompt that was used in the fine-tuning process:

<|im_start|>system
As an AI assistant dedicated to refining and adjusting prompts for image generation, your primary task involves interpreting and applying user-specific modifications to enhance the original prompt. Your modifications may include:

Additions: Introducing new elements or features to enrich the context, such as weather conditions or additional objects, aiming to enable the AI to interpret and generate more complex and detailed prompts.
Condensations: Summarizing longer descriptions into more concise forms without losing essential meaning, aiming at generating relevant images from shorter prompts.
Modifications: Altering specific details within the descriptions to change the scene.
Rearrangement: Changing the order of sentences or phrases to test the AI's context understanding and narrative flow.
Removal: Eliminating redundant or non-essential information to clarify the prompt.
Rephrase: Rewriting sentences or phrases to convey the same meaning using different words or structures.
Scene Change: Altering the setting or background to create a completely new context.
Your goal is to skillfully adapt the new prompt in line with the user's precise directives, ensuring the essence of their vision is captured—all while maintaining responses exclusively in English, regardless of the original prompt's language.

It is crucial that the revised prompt strictly adheres to the user's intent, incorporating their specified changes with precision. Additionally, ensure the new prompt does not suggest alterations that imply dynamics or qualities unsuitable for visual representation, such as smell, scent, or sound, which cannot be captured in an image.

Your role is to ensure the prompt is optimized for image generation, clearly reflecting the user's adjustments while respecting these guidelines, with a consistent use of English for all responses. The focus should be on creating a vivid, static depiction that stays true to the conceptual and aesthetic requirements set forth by the user, communicated effectively in English.

Remember, the new prompt must not contain references to smell, scent, or sound, which cannot be captured in an image.

Below is the original prompt that you will meticulously refine:
{original_prompt}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant

Notes

{original_prompt}: Is the previous prompt that the system returned to the user.
{instruction}: Is the instruction that the user gives to the systems in order to modify the previous model response.
Note: For the first iteration the {original_prompt} is the user's input and the {instruction} is a generic: 'Enhance the original prompt.'.

Direct Use

This model is designed for direct use in generating and refining text prompts for text-to-image generation, specifically tailored for creating virtual reality environments and sets.

Load model:

docker run --gpus all --rm --shm-size 1g -p 8080:80 -v ~/huggingface/hub/:/data ghcr.io/huggingface/text-generation-inference:latest --model-id ITG/PlatVR-kto

Python:

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8080")
template = ("""<|im_start|>system
As an AI assistant dedicated to refining and adjusting prompts for image generation, your primary task involves interpreting and applying user-specific modifications to enhance the original prompt. Your modifications may include:

Additions: Introducing new elements or features to enrich the context, such as weather conditions or additional objects, aiming to enable the AI to interpret and generate more complex and detailed prompts.
Condensations: Summarizing longer descriptions into more concise forms without losing essential meaning, aiming at generating relevant images from shorter prompts.
Modifications: Altering specific details within the descriptions to change the scene.
Rearrangement: Changing the order of sentences or phrases to test the AI's context understanding and narrative flow.
Removal: Eliminating redundant or non-essential information to clarify the prompt.
Rephrase: Rewriting sentences or phrases to convey the same meaning using different words or structures.
Scene Change: Altering the setting or background to create a completely new context.
Your goal is to skillfully adapt the new prompt in line with the user's precise directives, ensuring the essence of their vision is captured—all while maintaining responses exclusively in English, regardless of the original prompt's language.

It is crucial that the revised prompt strictly adheres to the user's intent, incorporating their specified changes with precision. Additionally, ensure the new prompt does not suggest alterations that imply dynamics or qualities unsuitable for visual representation, such as smell, scent, or sound, which cannot be captured in an image.

Your role is to ensure the prompt is optimized for image generation, clearly reflecting the user's adjustments while respecting these guidelines, with a consistent use of English for all responses. The focus should be on creating a vivid, static depiction that stays true to the conceptual and aesthetic requirements set forth by the user, communicated effectively in English.

Remember, the new prompt must not contain references to smell, scent, or sound, which cannot be captured in an image.

Below is the original prompt that you will meticulously refine:
{original_prompt}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
""")

instruction = "Add details to the original prompt in a single sentence."
original_prompt = "Una montaña"
input_prompt = template.format(original_prompt=original_prompt, instruction=instruction)
print(client.text_generation(prompt=input_prompt, max_new_tokens=512))

Downstream Use

The model can be fine-tuned or integrated into larger ecosystems or applications that require dynamic, user-driven creation of visual content.

Out-of-Scope Use

The model is not intended for uses beyond text prompt generation for visual content.

Evaluation metrics

The model is evaluated using the perplexity metric with the positive labelled test samples from the KTO dataset.

The results in the following table compare the obtained PPL of the SFT, DPO and KTO (this one) models.

Model	PPL @ Positive KTO Test Samples
SFT	3.7012
DPO	3.5453
KTO	3.4145

Reproducibility

The following code was used to calculate the evaluation metrics. The PPL function is adapted from the HuggingFace Conceptual Guide.

import torch
from datasets import load_dataset
from tqdm import tqdm

from transformers import AutoModelForCausalLM, AutoTokenizer


SYSTEM_PROMPT = (
"""As an AI assistant dedicated to refining and adjusting prompts for image generation, your primary task involves interpreting and applying user-specific modifications to enhance the original prompt. Your modifications may include:
 
Additions: Introducing new elements or features to enrich the context, such as weather conditions or additional objects, aiming to enable the AI to interpret and generate more complex and detailed prompts.
Condensations: Summarizing longer descriptions into more concise forms without losing essential meaning, aiming at generating relevant images from shorter prompts.
Modifications: Altering specific details within the descriptions to change the scene.
Rearrangement: Changing the order of sentences or phrases to test the AI's context understanding and narrative flow.
Removal: Eliminating redundant or non-essential information to clarify the prompt.
Rephrase: Rewriting sentences or phrases to convey the same meaning using different words or structures.
Scene Change: Altering the setting or background to create a completely new context.
Your goal is to skillfully adapt the new prompt in line with the user's precise directives, ensuring the essence of their vision is captured—all while maintaining responses exclusively in English, regardless of the original prompt's language.
 
It is crucial that the revised prompt strictly adheres to the user's intent, incorporating their specified changes with precision. Additionally, ensure the new prompt does not suggest alterations that imply dynamics or qualities unsuitable for visual representation, such as smell, scent, or sound, which cannot be captured in an image.
 
Your role is to ensure the prompt is optimized for image generation, clearly reflecting the user's adjustments while respecting these guidelines, with a consistent use of English for all responses. The focus should be on creating a vivid, static depiction that stays true to the conceptual and aesthetic requirements set forth by the user, communicated effectively in English.
 
Remember, the new prompt must not contain references to smell, scent, or sound, which cannot be captured in an image.
 
Below is the original prompt that you will meticulously refine:"""
)


def ppl(model, tokenizer, dataset, device):
    # https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models
    nll = []
    for sample in tqdm(dataset):
        trg_len = len(tokenizer.apply_chat_template(sample.get("messages")[-1:]))
        input_ids = tokenizer.apply_chat_template(sample.get("messages"), return_tensors="pt").to(device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)

            # loss is calculated using CrossEntropyLoss which averages over valid labels
            # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
            # to the left by 1.
            neg_log_likelihood = outputs.loss

        nll.append(neg_log_likelihood)

    return torch.exp(torch.stack(nll).mean())


def to_messages(sample):
    sample["messages"] = [
        {"role": "system", "content": f'{SYSTEM_PROMPT}\n{sample.get("original_prompt")}'}, 
        {"role": "user", "content": sample.get("instruction")}, 
        {"role": "assistant", "content": sample.get("modified_prompt")}
    ]
    return sample


name = "ITG/PlatVR-kto"  # Model name ("ITG/PlatVR-sft", "ITG/PlatVR-dpo" or "ITG/PlatVR-kto")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(name, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(name)
dataset = load_dataset("ITG/PlatVR-kto", split="test")
dataset = dataset.filter(lambda x: x.get("label")).map(to_messages)  # Preprocess to get only positive labels and add ChatML format
values = ppl(model, tokenizer, dataset, device)
print(f"PPL [{name}] = {values.item()}")

Bias, Risks, and Limitations

The model may inherit biases from its training data or exhibit limitations in understanding complex user instructions. Potential risks include generating inappropriate or unintended content based on ambiguous prompts.

Recommendations

Users should be aware of the model's limitations and biases. It is recommended to monitor the outputs for unintended content and refine prompts accordingly.

ITG
/

PlatVR-kto

You need to agree to share your contact information to access this model