---
license: apache-2.0
pipeline_tag: image-to-text
---

# Moonline

Moonline is a fork of [moondream2](https://huggingface.co/vikhyatk/moondream2). It combines the image to text generation with a modification of
[outlines](https://github.com/outlines-dev/outlines) to be able to generate text according to a specific pydantic model.

## Model Details

The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that
generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are
copy+pased and modified.

### How to use

The best way to start is by cloning the repo and running `example.py`.
Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt

The example.py runs through a simple example of generating a description and a mood for the farm image.

```python
from PIL import Image
from transformers import AutoTokenizer
from pydantic import BaseModel 
from enum import Enum

from moonline import Moonline 

def main():
    class Mood(Enum):
        sad = "sad"
        happy = "happy"
        angry = "angry"
        neutral = "neutral"

    class ExampleModel(BaseModel):
        description: str
        mood: Mood

    prompt = f"""
    Your job is to describe the image.
    Please answer in json with the following format: {ExampleModel.__annotations__}
    """
    
    image_path = "example.png"
    prompt = prompt

    model_id = "vikhyatk/moondream2"
    revision = "2024-04-02"
    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    moonline = Moonline.from_pretrained(
        model_id,
        revision=revision,
    ).to()
    moonline.eval()

    image = Image.open(image_path)
    image_embeds = moonline.encode_image(image)
    fsm = moonline.generate_fsm(ExampleModel, tokenizer)

    answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm)
    print(f"answer: {answer}")


if __name__ == "__main__":
    main()
```

The result is something like this:

```json
{
  "description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.",
  "mood": "happy"
}
```

### Limitations

The model hallucinetes especially in cases where a field is given, that doesn't exist in the image.
This can be alleviated by giving `None` options or guidance in the prompt. But in my experience this doesn't solve the issue fully.

Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of
images. Especially cases where missing fields are present.