File size: 2,987 Bytes
c457f56 eabce06 c457f56 7f4b718 c457f56 eabce06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
license: apache-2.0
pipeline_tag: image-to-text
---
# Moonline
Moonline is a fork of [moondream2](https://huggingface.co/vikhyatk/moondream2). It combines the image to text generation with a modification of
[outlines](https://github.com/outlines-dev/outlines) to be able to generate text according to a specific pydantic model.
## Model Details
The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that
generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are
copy+pased and modified.
### How to use
The best way to start is by cloning the repo and running `example.py`.
Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt
The example.py runs through a simple example of generating a description and a mood for the farm image.
```python
from PIL import Image
from transformers import AutoTokenizer
from pydantic import BaseModel
from enum import Enum
from moonline import Moonline
def main():
class Mood(Enum):
sad = "sad"
happy = "happy"
angry = "angry"
neutral = "neutral"
class ExampleModel(BaseModel):
description: str
mood: Mood
prompt = f"""
Your job is to describe the image.
Please answer in json with the following format: {ExampleModel.__annotations__}
"""
image_path = "example.png"
prompt = prompt
model_id = "vikhyatk/moondream2"
revision = "2024-04-02"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
moonline = Moonline.from_pretrained(
model_id,
revision=revision,
).to()
moonline.eval()
image = Image.open(image_path)
image_embeds = moonline.encode_image(image)
fsm = moonline.generate_fsm(ExampleModel, tokenizer)
answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm)
print(f"answer: {answer}")
if __name__ == "__main__":
main()
```
The result is something like this:
```json
{
"description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.",
"mood": "happy"
}
```
### Limitations
The model hallucinetes especially in cases where a field is given, that doesn't exist in the image.
This can be alleviated by giving `None` options or guidance in the prompt. But in my experience this doesn't solve the issue fully.
Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of
images. Especially cases where missing fields are present. |