Edit model card

The Big Picture (Brainproject.ai)

The human brain is an intricate puzzle that we're continually striving to decode. The aim is to replicate its complexity, functionality, and depth in a digital realm - exploring the convergence of neuroscience and artificial intelligence to glean insights into the mind's intricate workings and harness that knowledge into digital counterparts.

Mixture of Experts

Llava-Visionary-70B utilizes a Mixture of Experts (MoE) architecture, with different expert modules specializing in various aspects of visual and language understanding. A gating mechanism selectively activates the most relevant experts for each input. This provides computational efficiency and scalability.

Llava-Visionary-70B

Llava-Visionary-70B is an artificial intelligence system designed for visual reasoning and multimodal understanding. It builds on top of the Llama-2 architecture using a Mixture of Experts approach.

The model has been further pretrained on a large dataset of YouTube videos and images to develop human-like visual comprehension abilities. This enables it to understand the semantics of images, videos, and multimodal content.

Model Description

  • Developed by: Priyanshu Pareek
  • Model type: Transformer-based multimodal model
  • License: wtfpl
  • Finetuned from model [optional]: Llama-2-70B

Uses

Llava-Visionary-70B is designed for tasks that involve:

  • Visual understanding of images, videos, diagrams
  • Multimodal reasoning with vision and language
  • Answering questions about visual content
  • Generating captions or descriptions of visual data

It can provide value for uses cases such as:

  • Multimodal chatbots and digital assistants
  • Image and video search/recommendation
  • Automated alt-text generation
  • Vision-based QA systems

Direct Use

Llava-Visionary-70B can be used out-of-the-box without further training for zero-shot inference on downstream visual/multimodal tasks.

Recommendations

How to Get Started with the Model

Want to take Chameleon-Llama-70B for a spin?

Load the model and tokenizer from HuggingFace:

from transformers import LlavaVisionary70BModel, LlavaVisionary70BTokenizer

tokenizer = LlavaVisionary70BTokenizer.from_pretrained("llava-visionary-70b")
model = LlavaVisionary70BModel.from_pretrained("llava-visionary-70b")```

Pass multimodal input and generate output:

text = "What type of animal is shown in this picture?"
image = Image.open("animal.jpg")

inputs = tokenizer(text, images=image, return_tensors="pt") 
outputs = model(**inputs)

Training Details

Training Data

Llava-Visionary-70B was further pretrained on a large dataset of YouTube videos and images.

Training Procedure

The model was trained using supervised pretraining on video-text pairs, leveraging the original Llama-2 model weights.

Training Hyperparameters

  • Batch size: 256
  • Learning rate: 5e-5
  • Optimizer: AdamW
  • Training epochs: 3
Downloads last month

-

Downloads are not tracked for this model. How to track
Unable to determine this model's library. Check the docs .