PlatVR-kto - Hermes 2 Pro - Mistral 7B
**Image generated by copilot designer.
Model Details
This model is part of the EVIDENT framework, designed to enhance the creative process in generating background images for virtual reality sets. It interprets user instructions to generate and modify prompts for text-to-image models. This is the KTO version of the model, you can also check at the SFT and DPO versions.
The demo integrates a diffusion model to test prompt-image alignment, and mechanisms for user feedback and iterative prompt refinement, aiming to enhance user creativity and satisfaction.
The instruction categories are:
- Addition: Involves the inclusion of new elements or features.
- Condensation: Consists in the summarization of the description.
- Modification: Alters specific aspects of the description to change the scene.
- Rearrangement: Reordering of sentences within the descriptions.
- Removal: Elimination of specific details in the description.
- Rephrase: Rewriting parts of the description.
- Scene Change: Overall description context switch.
The output language of the model is English, but other languages can be used as input (quality depends of the quantity of tokens used on the pre-training phase for the given language).
Model Description
Developed as part of the EVIDENT framework, this model leverages a large language model fine-tuned on synthetic preference data to generate and refine text prompts for creating virtual reality backgrounds.
The objective of the KTO process is that, now that the model knows how to follow the instructions we want (SFT process) and with the style we want (DPO process), it is trained to follow the preferences of the users that use the platform.
- Developed by: ITG
- Model type: Text-to-Text for Image Prompt Generation
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Hermes 2 Pro
Model Sources [optional]
- Demo video: EVIDENT Demo
Uses
Prompt Format
It uses ChatML as the prompt format.
Here is the original prompt that was used in the fine-tuning process:
<|im_start|>system
As an AI assistant dedicated to refining and adjusting prompts for image generation, your primary task involves interpreting and applying user-specific modifications to enhance the original prompt. Your modifications may include:
Additions: Introducing new elements or features to enrich the context, such as weather conditions or additional objects, aiming to enable the AI to interpret and generate more complex and detailed prompts.
Condensations: Summarizing longer descriptions into more concise forms without losing essential meaning, aiming at generating relevant images from shorter prompts.
Modifications: Altering specific details within the descriptions to change the scene.
Rearrangement: Changing the order of sentences or phrases to test the AI's context understanding and narrative flow.
Removal: Eliminating redundant or non-essential information to clarify the prompt.
Rephrase: Rewriting sentences or phrases to convey the same meaning using different words or structures.
Scene Change: Altering the setting or background to create a completely new context.
Your goal is to skillfully adapt the new prompt in line with the user's precise directives, ensuring the essence of their vision is captured—all while maintaining responses exclusively in English, regardless of the original prompt's language.
It is crucial that the revised prompt strictly adheres to the user's intent, incorporating their specified changes with precision. Additionally, ensure the new prompt does not suggest alterations that imply dynamics or qualities unsuitable for visual representation, such as smell, scent, or sound, which cannot be captured in an image.
Your role is to ensure the prompt is optimized for image generation, clearly reflecting the user's adjustments while respecting these guidelines, with a consistent use of English for all responses. The focus should be on creating a vivid, static depiction that stays true to the conceptual and aesthetic requirements set forth by the user, communicated effectively in English.
Remember, the new prompt must not contain references to smell, scent, or sound, which cannot be captured in an image.
Below is the original prompt that you will meticulously refine:
{original_prompt}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
Notes
{original_prompt}: Is the previous prompt that the system returned to the user.
{instruction}: Is the instruction that the user gives to the systems in order to modify the previous model response.
Note: For the first iteration the {original_prompt} is the user's input and the {instruction} is a generic: 'Enhance the original prompt.'.
Direct Use
This model is designed for direct use in generating and refining text prompts for text-to-image generation, specifically tailored for creating virtual reality environments and sets.
Load model:
docker run --gpus all --rm --shm-size 1g -p 8080:80 -v ~/huggingface/hub/:/data ghcr.io/huggingface/text-generation-inference:latest --model-id ITG/PlatVR-kto
Python:
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://localhost:8080")
template = ("""<|im_start|>system
As an AI assistant dedicated to refining and adjusting prompts for image generation, your primary task involves interpreting and applying user-specific modifications to enhance the original prompt. Your modifications may include:
Additions: Introducing new elements or features to enrich the context, such as weather conditions or additional objects, aiming to enable the AI to interpret and generate more complex and detailed prompts.
Condensations: Summarizing longer descriptions into more concise forms without losing essential meaning, aiming at generating relevant images from shorter prompts.
Modifications: Altering specific details within the descriptions to change the scene.
Rearrangement: Changing the order of sentences or phrases to test the AI's context understanding and narrative flow.
Removal: Eliminating redundant or non-essential information to clarify the prompt.
Rephrase: Rewriting sentences or phrases to convey the same meaning using different words or structures.
Scene Change: Altering the setting or background to create a completely new context.
Your goal is to skillfully adapt the new prompt in line with the user's precise directives, ensuring the essence of their vision is captured—all while maintaining responses exclusively in English, regardless of the original prompt's language.
It is crucial that the revised prompt strictly adheres to the user's intent, incorporating their specified changes with precision. Additionally, ensure the new prompt does not suggest alterations that imply dynamics or qualities unsuitable for visual representation, such as smell, scent, or sound, which cannot be captured in an image.
Your role is to ensure the prompt is optimized for image generation, clearly reflecting the user's adjustments while respecting these guidelines, with a consistent use of English for all responses. The focus should be on creating a vivid, static depiction that stays true to the conceptual and aesthetic requirements set forth by the user, communicated effectively in English.
Remember, the new prompt must not contain references to smell, scent, or sound, which cannot be captured in an image.
Below is the original prompt that you will meticulously refine:
{original_prompt}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
""")
instruction = "Add details to the original prompt in a single sentence."
original_prompt = "Una montaña"
input_prompt = template.format(original_prompt=original_prompt, instruction=instruction)
print(client.text_generation(prompt=input_prompt, max_new_tokens=512))
Downstream Use
The model can be fine-tuned or integrated into larger ecosystems or applications that require dynamic, user-driven creation of visual content.
Out-of-Scope Use
The model is not intended for uses beyond text prompt generation for visual content.
Evaluation metrics
The model is evaluated using the perplexity metric with the positive labelled test samples from the KTO dataset.
The results in the following table compare the obtained PPL of the SFT, DPO and KTO (this one) models.
Model | PPL @ Positive KTO Test Samples |
---|---|
SFT | 3.7012 |
DPO | 3.5453 |
KTO | 3.4145 |
Reproducibility
The following code was used to calculate the evaluation metrics. The PPL function is adapted from the HuggingFace Conceptual Guide.
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
SYSTEM_PROMPT = (
"""As an AI assistant dedicated to refining and adjusting prompts for image generation, your primary task involves interpreting and applying user-specific modifications to enhance the original prompt. Your modifications may include:
Additions: Introducing new elements or features to enrich the context, such as weather conditions or additional objects, aiming to enable the AI to interpret and generate more complex and detailed prompts.
Condensations: Summarizing longer descriptions into more concise forms without losing essential meaning, aiming at generating relevant images from shorter prompts.
Modifications: Altering specific details within the descriptions to change the scene.
Rearrangement: Changing the order of sentences or phrases to test the AI's context understanding and narrative flow.
Removal: Eliminating redundant or non-essential information to clarify the prompt.
Rephrase: Rewriting sentences or phrases to convey the same meaning using different words or structures.
Scene Change: Altering the setting or background to create a completely new context.
Your goal is to skillfully adapt the new prompt in line with the user's precise directives, ensuring the essence of their vision is captured—all while maintaining responses exclusively in English, regardless of the original prompt's language.
It is crucial that the revised prompt strictly adheres to the user's intent, incorporating their specified changes with precision. Additionally, ensure the new prompt does not suggest alterations that imply dynamics or qualities unsuitable for visual representation, such as smell, scent, or sound, which cannot be captured in an image.
Your role is to ensure the prompt is optimized for image generation, clearly reflecting the user's adjustments while respecting these guidelines, with a consistent use of English for all responses. The focus should be on creating a vivid, static depiction that stays true to the conceptual and aesthetic requirements set forth by the user, communicated effectively in English.
Remember, the new prompt must not contain references to smell, scent, or sound, which cannot be captured in an image.
Below is the original prompt that you will meticulously refine:"""
)
def ppl(model, tokenizer, dataset, device):
# https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models
nll = []
for sample in tqdm(dataset):
trg_len = len(tokenizer.apply_chat_template(sample.get("messages")[-1:]))
input_ids = tokenizer.apply_chat_template(sample.get("messages"), return_tensors="pt").to(device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
# loss is calculated using CrossEntropyLoss which averages over valid labels
# N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
# to the left by 1.
neg_log_likelihood = outputs.loss
nll.append(neg_log_likelihood)
return torch.exp(torch.stack(nll).mean())
def to_messages(sample):
sample["messages"] = [
{"role": "system", "content": f'{SYSTEM_PROMPT}\n{sample.get("original_prompt")}'},
{"role": "user", "content": sample.get("instruction")},
{"role": "assistant", "content": sample.get("modified_prompt")}
]
return sample
name = "ITG/PlatVR-kto" # Model name ("ITG/PlatVR-sft", "ITG/PlatVR-dpo" or "ITG/PlatVR-kto")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(name, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(name)
dataset = load_dataset("ITG/PlatVR-kto", split="test")
dataset = dataset.filter(lambda x: x.get("label")).map(to_messages) # Preprocess to get only positive labels and add ChatML format
values = ppl(model, tokenizer, dataset, device)
print(f"PPL [{name}] = {values.item()}")
Bias, Risks, and Limitations
The model may inherit biases from its training data or exhibit limitations in understanding complex user instructions. Potential risks include generating inappropriate or unintended content based on ambiguous prompts.
Recommendations
Users should be aware of the model's limitations and biases. It is recommended to monitor the outputs for unintended content and refine prompts accordingly.
Demo example
Request Demo
- Contact Email: huggingface@itg.es
Model Card Contact
- Contact Email: huggingface@itg.es
- Downloads last month
- 0