BLIP 2 comparison?

#42
by eugeneware - opened

Love this space, it's a great way to compare caption models. I notice that SalesForce has released their BLIP 2 model. I wonder if i would be useful to compare against that too?

Yes, I'll add BLIP-2 soon!

Yes, I'll add BLIP-2 soon!

That would be nice. So far the Microsoft GIT is superior 99% of the time to the others.

@nielsr I have created a demo for BLIP-2 here; maybe we can wrap that space as a function and use it here? I tested this feature once, but I got a weird error message for image inputs; I can double check this if you are interested.

@nielsr I have created a demo for BLIP-2 here; maybe we can wrap that space as a function and use it here? I tested this feature once, but I got a weird error message for image inputs; I can double check this if you are interested.

I have been using this (notice the weights/sliders) https://huggingface.co/spaces/Salesforce/BLIP2 and would love that all in this one as I use them to caption my LoRA trains.

@GeneralAwareness That demo uses a private API. And BLIP2_OPT_6.7B requires a lot of a GPU Ram compared to BLIP2_OPT_2.7B. (Meanwhile I think nielsr is already adding support for BLIP2 :D)

The difference from OPT 2.7B to OPT 6.7B is quite considerable. I think if NIELSR has enough GPU RAM it would be very interesting for all us to him to add the version with OPT 6.7B

Yes BLIP-2 is now available in the Transformers library: https://huggingface.co/docs/transformers/main/en/model_doc/blip-2.

7 checkpoints are on the hub: https://huggingface.co/models?other=blip-2.

For the moment I've added the smallest model to this Space, but will extend to the bigger OPT6.7b and Flan-T5 xxl models.

Thank you for your amazing work ๐Ÿ‘

Yes BLIP-2 is now available in the Transformers library: https://huggingface.co/docs/transformers/main/en/model_doc/blip-2.

7 checkpoints are on the hub: https://huggingface.co/models?other=blip-2.

For the moment I've added the smallest model to this Space, but will extend to the bigger OPT6.7b and Flan-T5 xxl models.

@nielsr Flan-T5 is not needed. It is more focused on question answering. For captions the best performing ones are the OPT 6.7B and the OPT 2.7B. In my opinion it is a waste of GPU.

@GeneralAwareness That demo uses a private API. And BLIP2_OPT_6.7B requires a lot of a GPU Ram compared to BLIP2_OPT_2.7B. (Meanwhile I think nielsr is already adding support for BLIP2 :D)

Yeah. I gotta say BlIP-2 I am using the most out of all the ones on this page as it seems to be the most accurate. At worst I use it mixed with Git by hand.

Yes but do note that GIT models are much smaller than BLIP-2. Luckily we've added support for the 8-bit algorithm for BLIP-2, meaning that you can load any BLIP-2 checkpoint in 8 bits instead of the default float32.

See my BLIP-2 notebooks here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BLIP-2

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

I've added BLIP-2 OPT6.7b in this Space. It's using 8 bit inference to save memory

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

When it comes to performance ranking the best are Blip 2 > Git and COCA > Blip 1.

The difference between GIT and Coca is very small. The difference between Git/Coca and Blip 1 is big. The difference between Blip 2 and Git/Coca is small. The problem with BLIP2 is that it requires a lot of hardware specs. It is not worth it. Only if you have a lot of hardware. Git/Coca does a very similar job for much less.

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

When it comes to performance ranking the best are Blip 2 > Git and COCA > Blip 1.

The difference between GIT and Coca is very small. The difference between Git/Coca and Blip 1 is big. The difference between Blip 2 and Git/Coca is small. The problem with BLIP2 is that it requires a lot of hardware specs. It is not worth it. Only if you have a lot of hardware. Git/Coca does a very similar job for much less.

Not similar enough for my needs, but I get what you are saying. Luckily there are options.

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

When it comes to performance ranking the best are Blip 2 > Git and COCA > Blip 1.

The difference between GIT and Coca is very small. The difference between Git/Coca and Blip 1 is big. The difference between Blip 2 and Git/Coca is small. The problem with BLIP2 is that it requires a lot of hardware specs. It is not worth it. Only if you have a lot of hardware. Git/Coca does a very similar job for much less.

Not similar enough for my needs, but I get what you are saying. Luckily there are options.

Yes, it seems to me that depend on the dataset. I don't want to impose this as a rule obviously, I'm just giving my experience.

Great resource! Are these currently the best models out right now or have any new ones shown improvement? Any way to add the ability for prompting and chat?

Currently BLIP-2 is still top-of-the-game when it comes to open-source. However there's already an improved variant called InstructBLIP (based on BLIP-2) which is going to be integrated soon.

InstructBLIP is pretty cool, here is a demo was just uploaded: https://huggingface.co/spaces/RamAnanth1/InstructBLIP.

Does anyone know how to get the BLIP2 to stop hallucinating objects that aren't really there?

There are many hallucinating captions about objects that are not there as @marmelade500 says.... I dont know if there is a way of avoiding it... I've been testing many blip2 models, the best performer for me (RTX 3060 12gb) is blip2-opt-6.7b-coco.
blip2-flan-t5-xl is much much faster (10x) but not as accurate...

I'm trying the models for captioning photos for a cycling website, and randomnly there are many photos of a road withouth any other thing (well, a mountain, a tree).... and I get captions as:
. a car driving down a road with a mountain view
. a person riding a skateboard down a road next to a mountain

The skateboard thing happens a lot!!!

@FerradalFCG Have you tried Kosmos-2?

@FerradalFCG Have you tried Kosmos-2?

No!, I'll give it a try! thanks!

@FerradalFCG Have you tried Kosmos-2?

Can't run any of these locally so unless that appears on the net to try I know I can't try it.

@GeneralAwareness You can try the demo here: https://d01ad726be013ef4.gradio.app/ or follow the instruction on the README page to run it locally. Kosmos-2 is not that GPU hungry.

@GeneralAwareness You can try the demo here: https://d01ad726be013ef4.gradio.app/ or follow the instruction on the README page to run it locally. Kosmos-2 is not that GPU hungry.

I've been trying it online, a lot of "Connection errored out.", it worked only a few times, but detailed description was nice!!.... will it work on 12gb GPU (RTX 3060)?

@FerradalFCG On my side it only took ~8GB, if I remember correctly.

I have a 1060 6GB.

@taesiri I tried it and it was 99% failure rate so worse than any of the others. For sampling what are good values? edit: btw, I sure wish we could teach these as it will say a glowing blue orb but in reality it is a glowing blue SWORD. If we could teach it when it is wrong it would learn over time and that would rock. "A group of four purple robed humanoids ride bicycles through a desert ." WRONG they were riding motorcycles. If we could just teach it when wrong.

I also notice it changes each time I press the submit button so it isn't even sure of itself.

@GeneralAwareness Do you mind to share some of these images with us? I generally think it would be a good idea to start gathering all of these failure cases, along with their correct labels, and release them as a new dataset/benchmark.

@GeneralAwareness Do you mind to share some of these images with us? I generally think it would be a good idea to start gathering all of these failure cases, along with their correct labels, and release them as a new dataset/benchmark.

I also tried to use "avisar" button for bad descriptions/labels.... but it does not seem to work, it does nothing

@GeneralAwareness Do you mind to share some of these images with us? I generally think it would be a good idea to start gathering all of these failure cases, along with their correct labels, and release them as a new dataset/benchmark.

Well, the thing is this is not stable as each time I pressed the button it came back with something different so there needs to be stability first then once that is achieved tackle the problem of the incorrectness it has.

Sign up or log in to comment