--- license: apache-2.0 pipeline_tag: image-to-text --- # Moonline Moonline is a fork of [moondream2](https://huggingface.co/vikhyatk/moondream2). It combines the image to text generation with a modification of [outlines](https://github.com/outlines-dev/outlines) to be able to generate text according to a specific pydantic model. ## Model Details The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are copy+pased and modified. ### How to use The best way to start is by cloning the repo and running `example.py`. Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt The example.py runs through a simple example of generating a description and a mood for the farm image. ```python from PIL import Image from transformers import AutoTokenizer from pydantic import BaseModel from enum import Enum from moonline import Moonline def main(): class Mood(Enum): sad = "sad" happy = "happy" angry = "angry" neutral = "neutral" class ExampleModel(BaseModel): description: str mood: Mood prompt = f""" Your job is to describe the image. Please answer in json with the following format: {ExampleModel.__annotations__} """ image_path = "example.png" prompt = prompt model_id = "vikhyatk/moondream2" revision = "2024-04-02" tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision) moonline = Moonline.from_pretrained( model_id, revision=revision, ).to() moonline.eval() image = Image.open(image_path) image_embeds = moonline.encode_image(image) fsm = moonline.generate_fsm(ExampleModel, tokenizer) answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm) print(f"answer: {answer}") if __name__ == "__main__": main() ``` The result is something like this: ```json { "description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.", "mood": "happy" } ``` ### Limitations The model hallucinetes especially in cases where a field is given, that doesn't exist in the image. This can be alleviated by giving `None` options or guidance in the prompt. But in my experience this doesn't solve the issue fully. Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of images. Especially cases where missing fields are present.