nielsr/comparing-captioning-models

Feb 6, 2023

Love this space, it's a great way to compare caption models. I notice that SalesForce has released their BLIP 2 model. I wonder if i would be useful to compare against that too?

nielsr

Owner Feb 6, 2023

Yes, I'll add BLIP-2 soon!

GeneralAwareness

Feb 8, 2023

Yes, I'll add BLIP-2 soon!

That would be nice. So far the Microsoft GIT is superior 99% of the time to the others.

taesiri

Feb 9, 2023

@nielsr I have created a demo for BLIP-2 here; maybe we can wrap that space as a function and use it here? I tested this feature once, but I got a weird error message for image inputs; I can double check this if you are interested.

GeneralAwareness

Feb 9, 2023

@nielsr I have created a demo for BLIP-2 here; maybe we can wrap that space as a function and use it here? I tested this feature once, but I got a weird error message for image inputs; I can double check this if you are interested.

I have been using this (notice the weights/sliders) https://huggingface.co/spaces/Salesforce/BLIP2 and would love that all in this one as I use them to caption my LoRA trains.

taesiri

Feb 9, 2023

@GeneralAwareness That demo uses a private API. And BLIP2_OPT_6.7B requires a lot of a GPU Ram compared to BLIP2_OPT_2.7B. (Meanwhile I think nielsr is already adding support for BLIP2 :D)

artificialguybr

Feb 9, 2023

The difference from OPT 2.7B to OPT 6.7B is quite considerable. I think if NIELSR has enough GPU RAM it would be very interesting for all us to him to add the version with OPT 6.7B

nielsr

Owner Feb 9, 2023

Yes BLIP-2 is now available in the Transformers library: https://huggingface.co/docs/transformers/main/en/model_doc/blip-2.

7 checkpoints are on the hub: https://huggingface.co/models?other=blip-2.

For the moment I've added the smallest model to this Space, but will extend to the bigger OPT6.7b and Flan-T5 xxl models.

eugeneware

Feb 9, 2023

Thank you for your amazing work 👏

artificialguybr

Feb 9, 2023

Yes BLIP-2 is now available in the Transformers library: https://huggingface.co/docs/transformers/main/en/model_doc/blip-2.

7 checkpoints are on the hub: https://huggingface.co/models?other=blip-2.

For the moment I've added the smallest model to this Space, but will extend to the bigger OPT6.7b and Flan-T5 xxl models.

@nielsr Flan-T5 is not needed. It is more focused on question answering. For captions the best performing ones are the OPT 6.7B and the OPT 2.7B. In my opinion it is a waste of GPU.

GeneralAwareness

Feb 10, 2023

@GeneralAwareness That demo uses a private API. And BLIP2_OPT_6.7B requires a lot of a GPU Ram compared to BLIP2_OPT_2.7B. (Meanwhile I think nielsr is already adding support for BLIP2 :D)

Yeah. I gotta say BlIP-2 I am using the most out of all the ones on this page as it seems to be the most accurate. At worst I use it mixed with Git by hand.

nielsr

Owner Feb 10, 2023

Yes but do note that GIT models are much smaller than BLIP-2. Luckily we've added support for the 8-bit algorithm for BLIP-2, meaning that you can load any BLIP-2 checkpoint in 8 bits instead of the default float32.

See my BLIP-2 notebooks here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BLIP-2

GeneralAwareness

Feb 10, 2023

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

nielsr

Owner Feb 10, 2023

I've added BLIP-2 OPT6.7b in this Space. It's using 8 bit inference to save memory

artificialguybr

Feb 10, 2023

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

When it comes to performance ranking the best are Blip 2 > Git and COCA > Blip 1.

The difference between GIT and Coca is very small. The difference between Git/Coca and Blip 1 is big. The difference between Blip 2 and Git/Coca is small. The problem with BLIP2 is that it requires a lot of hardware specs. It is not worth it. Only if you have a lot of hardware. Git/Coca does a very similar job for much less.

GeneralAwareness

Feb 10, 2023

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

When it comes to performance ranking the best are Blip 2 > Git and COCA > Blip 1.

The difference between GIT and Coca is very small. The difference between Git/Coca and Blip 1 is big. The difference between Blip 2 and Git/Coca is small. The problem with BLIP2 is that it requires a lot of hardware specs. It is not worth it. Only if you have a lot of hardware. Git/Coca does a very similar job for much less.

Not similar enough for my needs, but I get what you are saying. Luckily there are options.

artificialguybr

Feb 10, 2023

I tested the blip-2 on here and the one I linked above and the one I linked above is just superior in all my captioning I did last night.

When it comes to performance ranking the best are Blip 2 > Git and COCA > Blip 1.

The difference between GIT and Coca is very small. The difference between Git/Coca and Blip 1 is big. The difference between Blip 2 and Git/Coca is small. The problem with BLIP2 is that it requires a lot of hardware specs. It is not worth it. Only if you have a lot of hardware. Git/Coca does a very similar job for much less.

Not similar enough for my needs, but I get what you are saying. Luckily there are options.

Yes, it seems to me that depend on the dataset. I don't want to impose this as a rule obviously, I'm just giving my experience.

marmelade500

May 30, 2023

•

edited May 30, 2023

Great resource! Are these currently the best models out right now or have any new ones shown improvement? Any way to add the ability for prompting and chat?

nielsr

Owner May 31, 2023

Currently BLIP-2 is still top-of-the-game when it comes to open-source. However there's already an improved variant called InstructBLIP (based on BLIP-2) which is going to be integrated soon.

marmelade500

Jun 2, 2023

InstructBLIP is pretty cool, here is a demo was just uploaded: https://huggingface.co/spaces/RamAnanth1/InstructBLIP.

Does anyone know how to get the BLIP2 to stop hallucinating objects that aren't really there?

FerradalFCG

Jul 10, 2023

•

edited Jul 10, 2023

There are many hallucinating captions about objects that are not there as @marmelade500 says.... I dont know if there is a way of avoiding it... I've been testing many blip2 models, the best performer for me (RTX 3060 12gb) is blip2-opt-6.7b-coco.
blip2-flan-t5-xl is much much faster (10x) but not as accurate...

I'm trying the models for captioning photos for a cycling website, and randomnly there are many photos of a road withouth any other thing (well, a mountain, a tree).... and I get captions as:
. a car driving down a road with a mountain view
. a person riding a skateboard down a road next to a mountain

The skateboard thing happens a lot!!!

taesiri

Jul 11, 2023

@FerradalFCG Have you tried Kosmos-2?

FerradalFCG

Jul 11, 2023

@FerradalFCG Have you tried Kosmos-2?

No!, I'll give it a try! thanks!

GeneralAwareness

Jul 11, 2023

@FerradalFCG Have you tried Kosmos-2?

Can't run any of these locally so unless that appears on the net to try I know I can't try it.

taesiri

Jul 11, 2023

@GeneralAwareness You can try the demo here: https://d01ad726be013ef4.gradio.app/ or follow the instruction on the README page to run it locally. Kosmos-2 is not that GPU hungry.

FerradalFCG

Jul 11, 2023

@GeneralAwareness You can try the demo here: https://d01ad726be013ef4.gradio.app/ or follow the instruction on the README page to run it locally. Kosmos-2 is not that GPU hungry.

I've been trying it online, a lot of "Connection errored out.", it worked only a few times, but detailed description was nice!!.... will it work on 12gb GPU (RTX 3060)?

taesiri

Jul 11, 2023

@FerradalFCG On my side it only took ~8GB, if I remember correctly.

GeneralAwareness

Jul 11, 2023

I have a 1060 6GB.

GeneralAwareness

Jul 11, 2023

•

edited Jul 11, 2023

@taesiri I tried it and it was 99% failure rate so worse than any of the others. For sampling what are good values? edit: btw, I sure wish we could teach these as it will say a glowing blue orb but in reality it is a glowing blue SWORD. If we could teach it when it is wrong it would learn over time and that would rock. "A group of four purple robed humanoids ride bicycles through a desert ." WRONG they were riding motorcycles. If we could just teach it when wrong.

GeneralAwareness

Jul 11, 2023

I also notice it changes each time I press the submit button so it isn't even sure of itself.

taesiri

Jul 11, 2023

@GeneralAwareness Do you mind to share some of these images with us? I generally think it would be a good idea to start gathering all of these failure cases, along with their correct labels, and release them as a new dataset/benchmark.

FerradalFCG

Jul 11, 2023

@GeneralAwareness Do you mind to share some of these images with us? I generally think it would be a good idea to start gathering all of these failure cases, along with their correct labels, and release them as a new dataset/benchmark.

I also tried to use "avisar" button for bad descriptions/labels.... but it does not seem to work, it does nothing

GeneralAwareness

Jul 11, 2023

@GeneralAwareness Do you mind to share some of these images with us? I generally think it would be a good idea to start gathering all of these failure cases, along with their correct labels, and release them as a new dataset/benchmark.

Well, the thing is this is not stable as each time I pressed the button it came back with something different so there needs to be stability first then once that is achieved tackle the problem of the incorrectness it has.

Spaces:

nielsr
/

comparing-captioning-models

Running

BLIP 2 comparison?