Thanks for this!

#2
by Nitral-AI - opened

Would it be possible to make a llava mmproj adapter from this?

Appreciate the work!

We managed to get a projector file out; However, it appears that clip is not captioning images accurately in the base model here quantized to 4bit.
image.png

Base model seems to work fine for textual inference:

image.png

The resulting mmproj file is also outputting seemingly random tokens.

Not complaining, just as a model creator as well i like to be aware of issues. Thank you for all the hard work again!

I did find this issue last night. I tried to re-fine-tune the model and now it look fine. Please git pull the latest checkpoints for tests.

We pulled the latest and have a working projector file. Looks like everything is working in order. Congratulations on being the first llama-3 llava 1.5 pretrain!

Thanks for confirming with me. I will update this model in the next few days with the ShareGPT4V data and other instruction data constructed from GPT-4V to see whether they can further boost the performance of the model. You can also keep an eye on that!

Got me excited to see that, we most definitely will be waiting to see the results for that and will follow up with you!

Hi, a new model with increase input resolution based on CLIP-L-336px is released. I also add the performance on MMMU benchmark, in which I believe is the most convincing Multimodal LLM benchmark now.

My buddy @jeiku is working on making the projector now, thank you very much for your time as usual. Will update here how it goes!

Sign up or log in to comment