Salesforce
/

xgen-mm-vid-phi3-mini-r-v1.5-128tokens-8frames

Image-Text-to-Text

Model card Files Files and versions Community

michaelryoo commited on 4 days ago

Commit

a33d781

·

verified ·

1 Parent(s): 3cead6a

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ pipeline_tag: image-text-to-text
 # Model description
 `xGen-MM-Vid (BLIP-3-Video)` is an efficient compact vision-language model (VLM) with an explicit temporal encoder, specifically designed to understand videos. It is developed by Salesforce AI Research. Incorporation of a learanable temporal encoder modules within the original (image-based) BLIP-3 architecture is its key aspect.
-In this initial release (12/2024), we are sharing the 128 token version trained to take 16-frame video inputs.
 For more details, check out our [tech report](https://arxiv.org/pdf/2410.16267). More detailed explanation could also be found in the [blog article](https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html).

 # Model description
 `xGen-MM-Vid (BLIP-3-Video)` is an efficient compact vision-language model (VLM) with an explicit temporal encoder, specifically designed to understand videos. It is developed by Salesforce AI Research. Incorporation of a learanable temporal encoder modules within the original (image-based) BLIP-3 architecture is its key aspect.
+In this initial release (12/2024), we are sharing the 128 token version trained to take 8-frame video inputs.
 For more details, check out our [tech report](https://arxiv.org/pdf/2410.16267). More detailed explanation could also be found in the [blog article](https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html).