llava-hf
/

LLaVA-NeXT-Video-7B-hf

@@ -7,8 +7,6 @@ pipeline_tag: image-text-to-text
 # LLaVA-NeXT-Video Model Card
-Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
 Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing)
 Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for this model so this model card has been written by the Hugging Face team.
@@ -16,21 +14,17 @@ Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for t
 ## 📄 Model details
 **Model type:**
-<br>
-LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data. The videos were sampled uniformly to be 32 frames per clip.
-<br>
- Base LLM: lmsys/vicuna-7b-v1.5
 <img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">
 **Model date:**
-<br>
 LLaVA-Next-Video-7B was trained in April 2024.
-**Paper or resources for more information:**
-<br>
-https://github.com/LLaVA-VL/LLaVA-NeXT
 ## 📚 Training dataset
@@ -92,7 +86,22 @@ def read_video_pyav(container, indices):
             frames.append(frame)
     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
-prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
 video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
 container = av.open(video_path)
@@ -129,11 +138,28 @@ print(processor.decode(output[0][2:], skip_special_tokens=True))
 To generate from images and videos in one generate use the below code after loading the model as shown above:
 ```python
-prompts = [
-  "USER: <image>\nWhat's the content of the image? ASSISTANT:",
-  "USER: <video>\nWhy is this video funny? ASSISTANT:"
 ]
-inputs = processor(text=prompts, images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)
 # Generate
 generate_ids = model.generate(**inputs, max_new_tokens=100)
@@ -174,15 +200,6 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
 Llama 2 is licensed under the LLAMA 2 Community License,
 Copyright (c) Meta Platforms, Inc. All Rights Reserved.
-## 🎯 Intended use
-**Primary intended uses:**
-<br>
-The primary use of LLaVA is research on large multimodal models and chatbots.
-**Primary intended users:**
-<br>
-The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
 ## ✏️ Citation
 If you find our paper and code useful in your research:

 # LLaVA-NeXT-Video Model Card
 Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing)
 Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for this model so this model card has been written by the Hugging Face team.
 ## 📄 Model details
 **Model type:**
+LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data to achieves better video understanding capabilities. The videos were sampled uniformly to be 32 frames per clip.
+The model is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/abs/2405.21075).
+Base LLM: [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5)
 <img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">
 **Model date:**
 LLaVA-Next-Video-7B was trained in April 2024.
+**Paper or resources for more information:** https://github.com/LLaVA-VL/LLaVA-NeXT
 ## 📚 Training dataset
             frames.append(frame)
     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
+# define a chat histiry and use `apply_chat_template` to get correctly formatted prompt
+# Each value in "content" has to be a list of dicts with types ("text", "image", "video")
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Why is this video funny?"},
+            {"type": "video"},
+            ],
+    },
+]
+prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
 video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
 container = av.open(video_path)
 To generate from images and videos in one generate use the below code after loading the model as shown above:
 ```python
+conversation_1 = [
+    {
+      "role": "user",
+      "content": [
+          {"type": "text", "text": "What's the content of the image"},
+          {"type": "image"},
+        ],
+    }
+]
+conversation_2 = [
+    {
+      "role": "user",
+      "content": [
+          {"type": "text", "text": "Why is this video funny?"},
+          {"type": "video"},
+        ],
+    },
 ]
+prompt_1 = processor.apply_chat_template(conversation, add_generation_prompt=True)
+prompt_2 = processor.apply_chat_template(conversation, add_generation_prompt=True)
+s = processor(text=[prompt_1, prompt_2], images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)
 # Generate
 generate_ids = model.generate(**inputs, max_new_tokens=100)
 Llama 2 is licensed under the LLAMA 2 Community License,
 Copyright (c) Meta Platforms, Inc. All Rights Reserved.
 ## ✏️ Citation
 If you find our paper and code useful in your research: