|
--- |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- passing2961/stark-dialogue |
|
- passing2961/stark-face-image |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Llama-3.2-11B-Vision-Instruct |
|
tags: |
|
- conversational ai |
|
--- |
|
|
|
# Ultron-Summarizer-8B Model Card |
|
|
|
[π Homepage](https://stark-dataset.github.io/) | [π» Github](https://github.com/passing2961/Stark) | [π Arxiv](https://arxiv.org/abs/2407.03958) | [π PDF](https://arxiv.org/pdf/2407.03958) |
|
|
|
## List of Provided Model Series |
|
- **Ultron-Summarizer-Series:** [π€ Ultron-Summarizer-1B](https://huggingface.co/passing2961/Ultron-Summarizer-1B) | [π€ Ultron-Summarizer-3B](https://huggingface.co/passing2961/Ultron-Summarizer-3B) | [π€ Ultron-Summarizer-8B](https://huggingface.co/passing2961/Ultron-Summarizer-8B) |
|
- **Ultron 7B**: [π€ Ultron-7B](https://huggingface.co/passing2961/Ultron-7B) | [π€ Ultron-11B](https://huggingface.co/passing2961/Ultron-11B) |
|
|
|
> π¨ Disclaimer: All models and datasets are intended for research purposes only. |
|
|
|
## Model Description |
|
- **Repository:** [Code](https://github.com/passing2961/Stark) |
|
- **Paper:** [Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge](https://arxiv.org/abs/2407.03958) |
|
- **Point of Contact:** [Young-Jun Lee](mailto:yj2961@kaist.ac.kr) |
|
|
|
## Model Details |
|
- **Model**: Ultron-11B is a fully open-source multi-modal conversation model that generates the most appropriate image description at the image-sharing moment. |
|
- **Date**: Ultron-11B was trained in 2024. |
|
- **Training Dataset**: [Stark-Dialogue](https://huggingface.co/datasets/passing2961/stark-dialogue) |
|
- **Architecture**: Ultron-11B was trained on top of [LLaMA-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). |
|
|
|
## How to Use |
|
|
|
```python |
|
import logging |
|
from PIL import Image |
|
import torch |
|
from transformers import ( |
|
AutoModelForVision2Seq, |
|
BitsAndBytesConfig, |
|
AutoProcessor, |
|
) |
|
|
|
# Define Ultron template |
|
ULTRON_TEMPLATE = 'You are an excellent image sharing system that generates <RET> token with the following image description. The image description must be provided with the following format: <RET> <h> image description </h>. The following conversation is between {name} and AI assistant on {date}. The given image is {name}\'s appearance.\n{dialogue}' |
|
|
|
# Ultron model initialization |
|
def load_ultron_model(model_path): |
|
""" |
|
Loads the Ultron model and processor. |
|
|
|
Args: |
|
model_path (str): Path to the pre-trained model. |
|
|
|
Returns: |
|
model: Loaded Vision-to-Seq model. |
|
processor: Corresponding processor for the model. |
|
""" |
|
logging.info(f"Loading Ultron model from {model_path}...") |
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_quant_type='nf4' |
|
) |
|
model_kwargs = dict( |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
trust_remote_code=True, |
|
device_map="auto", |
|
) |
|
processor = AutoProcessor.from_pretrained( |
|
'meta-llama/Llama-3.2-11B-Vision-Instruct', torch_dtype=torch.bfloat16 |
|
) |
|
model = AutoModelForVision2Seq.from_pretrained( |
|
model_path, |
|
**model_kwargs |
|
).eval() |
|
logging.info("Ultron model loaded successfully.") |
|
return model, processor |
|
|
|
# Run Ultron model |
|
def run_ultron_model(model, processor, dialogue, name='Tom', date='2023.04.20', face_image_path='sample_face.png'): |
|
""" |
|
Runs the Ultron model with a given dialogue, name, and image. |
|
|
|
Args: |
|
model: Pre-trained model instance. |
|
processor: Processor for model input. |
|
dialogue (str): Input dialogue for the assistant. |
|
name (str): Name of the user. |
|
date (str): Date of the conversation. |
|
face_image_path (str): Path to the face image file. |
|
|
|
Returns: |
|
str: Description of the shared image. |
|
""" |
|
logging.info("Running Ultron model...") |
|
face_image = Image.open(face_image_path).convert("RGB") |
|
|
|
prompt = ULTRON_TEMPLATE.format( |
|
dialogue=dialogue, |
|
name=name, |
|
date=date |
|
) |
|
messages = [ |
|
{ |
|
"content": [ |
|
{"text": prompt, "type": "text"}, |
|
{"type": "image"} |
|
], |
|
"role": "user" |
|
}, |
|
] |
|
|
|
logging.info("Preparing input for Ultron model...") |
|
prompt_input = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
inputs = processor(face_image, prompt_input, return_tensors='pt').to('cuda') |
|
|
|
with torch.inference_mode(): |
|
logging.info("Generating output from Ultron model...") |
|
output = model.generate( |
|
**inputs, |
|
do_sample=True, |
|
temperature=0.9, |
|
max_new_tokens=512, |
|
top_p=1.0, |
|
use_cache=True, |
|
num_beams=1, |
|
) |
|
|
|
output_text = processor.decode(output[0], skip_special_token=True) |
|
logging.info("Output generated successfully from Ultron model.") |
|
return parse_ultron_output(output_text) |
|
|
|
# Parse Ultron output |
|
def parse_ultron_output(output): |
|
""" |
|
Parses the output to extract the image description. |
|
|
|
Args: |
|
output (str): The generated output text from the model. |
|
|
|
Returns: |
|
str: Extracted image description. |
|
""" |
|
logging.info("Parsing output from Ultron model...") |
|
if '<RET>' in output: |
|
return output.split('<h>')[-1].split('</h>')[0].strip() |
|
else: |
|
logging.warning("<RET> not found in output.") |
|
return output |
|
|
|
# Example usage |
|
def main(): |
|
""" |
|
Example usage of Ultron model. |
|
""" |
|
model_path = "passing2961/Ultron-11B" |
|
model, processor = load_ultron_model(model_path) |
|
|
|
dialogue = """Tom: I have so much work at the office, I'm exhausted... |
|
Personal AI Assistant: How can I help you feel less tired? |
|
Tom: Hmm.. I miss my dog Star at home. |
|
Personal AI Assistant: """ |
|
|
|
image_description = run_ultron_model(model, processor, dialogue) |
|
logging.info(f"Image description generated: {image_description}") |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
## License and Recommendations |
|
|
|
π¨ Ultron-11B is intended to be used for research purposes only. |
|
|
|
## Acknowledgement |
|
|
|
This work was supported by a grant of the KAIST-KT joint research project through AI Tech Lab, Institute of convergence Technology, funded by KT [Project No. G01230605, Development of Task-oriented Persona-based Dialogue Generation Combining Multi-modal Interaction and Knowledge Modeling]. |
|
|
|
## Citation |
|
|
|
If you find the resources in this repository useful, please cite our work: |
|
|
|
``` |
|
@article{lee2024stark, |
|
title={Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge}, |
|
author={Lee, Young-Jun and Lee, Dokyong and Youn, Junyoung and Oh, Kyeongjin and Ko, Byungsoo and Hyeon, Jonghwan and Choi, Ho-Jin}, |
|
journal={arXiv preprint arXiv:2407.03958}, |
|
year={2024} |
|
} |
|
``` |