|
--- |
|
base_model: mistralai/Mistral-7B-v0.1 |
|
tags: |
|
- mistral |
|
- instruct |
|
- finetune |
|
- chatml |
|
- gpt4 |
|
- synthetic data |
|
- distillation |
|
- multimodal |
|
- llava |
|
model-index: |
|
- name: Nous-Hermes-2-Vision |
|
results: [] |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
GGUF Quants by Twobob, Thanks to @jartine and @cmp-nct for the assists |
|
|
|
It's vicuna ref: [here](https://github.com/qnguyen3/hermes-llava/blob/173b4ef441b5371c1e7d99da7a2e7c14c77ad12f/llava/conversation.py#L252) |
|
|
|
Caveat emptor: There is still some kind of bug in the inference that is likely to get fixed upstream. Just FYI |
|
data:image/s3,"s3://crabby-images/1cd38/1cd385bcfdac077b4c11715a3c70409ecfdf4c38" alt="image/png" |
|
|
|
|
|
# Nous-Hermes-2-Vision - Mistral 7B |
|
|
|
|
|
data:image/s3,"s3://crabby-images/12623/126230f483b252ce6b6307915ae15b153b3612c9" alt="image/png" |
|
|
|
*In the tapestry of Greek mythology, Hermes reigns as the eloquent Messenger of the Gods, a deity who deftly bridges the realms through the art of communication. It is in homage to this divine mediator that I name this advanced LLM "Hermes," a system crafted to navigate the complex intricacies of human discourse with celestial finesse.* |
|
|
|
## Model description |
|
|
|
Nous-Hermes-2-Vision stands as a pioneering Vision-Language Model, leveraging advancements from the renowned **OpenHermes-2.5-Mistral-7B** by teknium. This model incorporates two pivotal enhancements, setting it apart as a cutting-edge solution: |
|
|
|
- **SigLIP-400M Integration**: Diverging from traditional approaches that rely on substantial 3B vision encoders, Nous-Hermes-2-Vision harnesses the formidable SigLIP-400M. This strategic choice not only streamlines the model's architecture, making it more lightweight, but also capitalizes on SigLIP's remarkable capabilities. The result? A remarkable boost in performance that defies conventional expectations. |
|
|
|
- **Custom Dataset Enriched with Function Calling**: Our model's training data includes a unique feature – function calling. This distinctive addition transforms Nous-Hermes-2-Vision into a **Vision-Language Action Model**. Developers now have a versatile tool at their disposal, primed for crafting a myriad of ingenious automations. |
|
|
|
This project is led by [qnguyen3](https://twitter.com/stablequan) and [teknium](https://twitter.com/Teknium1). |
|
## Training |
|
### Dataset |
|
- 220K from **LVIS-INSTRUCT4V** |
|
- 60K from **ShareGPT4V** |
|
- 150K Private **Function Calling Data** |
|
- 50K conversations from teknium's **OpenHermes-2.5** |
|
|
|
## Usage |
|
### Prompt Format |
|
- Like other LLaVA's variants, this model uses Vicuna-V1 as its prompt template. Please refer to `conv_llava_v1` in [this file](https://github.com/qnguyen3/hermes-llava/blob/main/llava/conversation.py) |
|
- For Gradio UI, please visit this [GitHub Repo](https://github.com/qnguyen3/hermes-llava) |
|
|
|
### Function Calling |
|
- For functiong calling, the message should start with a `<fn_call>` tag. Here is an example: |
|
|
|
```json |
|
<fn_call>{ |
|
"type": "object", |
|
"properties": { |
|
"bus_colors": { |
|
"type": "array", |
|
"description": "The colors of the bus in the image.", |
|
"items": { |
|
"type": "string", |
|
"enum": ["red", "blue", "green", "white"] |
|
} |
|
}, |
|
"bus_features": { |
|
"type": "string", |
|
"description": "The features seen on the back of the bus." |
|
}, |
|
"bus_location": { |
|
"type": "string", |
|
"description": "The location of the bus (driving or pulled off to the side).", |
|
"enum": ["driving", "pulled off to the side"] |
|
} |
|
} |
|
} |
|
``` |
|
|
|
Output: |
|
```json |
|
{ |
|
"bus_colors": ["red", "white"], |
|
"bus_features": "An advertisement", |
|
"bus_location": "driving" |
|
} |
|
``` |
|
|
|
## Example |
|
|
|
### Chat |
|
data:image/s3,"s3://crabby-images/ef921/ef921d3d0c03bf4ecb9e2881d647605634614760" alt="image/png" |
|
|
|
### Function Calling |
|
Input image: |
|
|
|
data:image/s3,"s3://crabby-images/4cdf2/4cdf29e030b62eca30904a8e4de07bbec5a9d30e" alt="image/png" |
|
|
|
Input message: |
|
```json |
|
<fn_call>{ |
|
"type": "object", |
|
"properties": { |
|
"food_list": { |
|
"type": "array", |
|
"description": "List of all the food", |
|
"items": { |
|
"type": "string", |
|
} |
|
}, |
|
} |
|
} |
|
``` |
|
|
|
Output: |
|
```json |
|
{ |
|
"food_list": [ |
|
"Double Burger", |
|
"Cheeseburger", |
|
"French Fries", |
|
"Shakes", |
|
"Coffee" |
|
] |
|
} |
|
``` |