Apply for community grant: Academic project

#2
by haotiz - opened

GLIP (Grounded Language-Image Pre-training) is a generalizable object detection (we use object detection as the representative of localization tasks) model. It is language aware, taking a natural language prompt as instruction. It is also semantic rich, able to detect millions of visual concepts out-of-box. GLIPv2 further extends such ability to instance segmentation and grounded vision-language understanding tasks; see examples in Figure 2. GLIP introduces language into object detection and leverages self-training techniques to pre-train on scalable and semantic-rich data: grounded image-captions (24M). This marks a milestone towards generalizable localization models GLIP enjoys superior zero-shot and few-shot transfer ability similar to that of CLIP/GPT-2/GPT-3.

Github: https://github.com/microsoft/GLIP
Hugging Face Demo: https://huggingface.co/spaces/haotiz/glip-zeroshot-demo
CVPR 2022 Best Paper Finalist (1 out of 33)
Microsoft Research Blog: https://www.microsoft.com/en-us/research/project/project-florence-vl/articles/object-detection-in-the-wild-via-grounded-language-image-pre-training/
Twitter: https://twitter.com/HaotianZhang4AI/status/1569775843681128448

Please consider granting free T4 access to this project. We hope the open-source and radio demo will further arouse the interests in multimodal intelligence and vision-language research.

Hey, a GPU Grant was just provided. Note that GPU Grants are provided temporarily and might be removed after some time if the usage is very low.

To learn more about GPUs in Spaces, please check out https://huggingface.co/docs/hub/spaces-gpus

Sign up or log in to comment