Hi guys, I'm here just to say: Amazing model. A lot of multimodality methods.
I'm getting 0.07 ~ 0.14ms inference time in the CAPTION_TO_PHRASE_GROUNDING mode on an RTX 3080 10GB. I think edge devices can benefit from this model aswell.
· Sign up or log in to comment