MultiModal-Phi2 /
GunaKoppula's picture
5ef8fae verified
title: MultiModal Phi2
emoji: πŸš€
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.35.2
pinned: false
license: mit
πŸ€—[**Space Link**](
### Tasks:
1. Make a multi-modal LLM that can take these inputs:
- :heavy_check_mark: Text
- :heavy_check_mark: Image
- :heavy_check_mark: Audio
2. Training:
- Image:
:heavy_check_mark: Use the original Instruct 150k dataset, and use CLIP to get the image embeddings.
:heavy_check_mark: Add projection layer from this CLIP embeddings to something that can be fed to Phi Model.
:heavy_check_mark: Add an adapter to train (QLoRa) on the instruct 150k dataset.
- Audio:
:heavy_check_mark: Need to use Whisper to perform ASR.
:heavy_check_mark: Add a projection layer for whisper output.
- Text:
:heavy_check_mark: Give any text to generate the related details.
3. :heavy_check_mark: The output remains text, based on multimodal inputs - text, image, and audio.
4. :heavy_check_mark: The deployment page should look like ChatGPT only, where we can send in images, text, or upload audio (live recording or file).
## Phi2 : Pretraining LLM from Scratch
### Details
1. Model used: [Microsoft Phi2](
2. Dataset used: Tiny Stories dataset(100k samples) & Realtime data(100k samples) from finetuned Phi2 model via Ollama
3. Pretraining approach: Pretraining using QLoRA
### Training Loss Curve
<img src="" width="500">
### Training Logs
## Phi2 : Multimodal Finetuning
### Details
1. LLM Backbone: [Phi2](
2. Vision Tower: [clip-vit-large-patch14-336](
3. Audio Model: [Whisper Tiny](
4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](
5. Finetuning Dataset: [Instruct 150k dataset based on COCO](
class AudioLanguageConnector:
- This class prepares and tokenizes audio-related text data using the "microsoft/phi-2" model's tokenizer. The <audio_start> and <audio_end> tokens are added to the input text to provide context for audio-related processing. The tokenized output is then returned as a tensor. This class acts as a connector to process text data in a format suitable for the specified audio model.
class WhisperWithProjection:
- This class transcribes audio by encapsulating the necessary steps. It uses a pre-trained model called "openai/whisper-tiny" to convert audio files into text transcriptions.
class MultiModalPhi2:
- This class takes input text, audio, and images and constructs a conversation prompt with appropriate formatting for the model. It tokenizes the prompt, preprocesses the image, and concatenates audio embeddings if available, and generates new tokens using the pre-trained model, considering input modalities.
Decodes and returns the generated output, handling special tokens and potential mismatches.
### Pretraining
#### Training Loss Curve and Learning Rate
<img src="" width="400"> <img src="" width="393">
#### Training Logs
### Finetuning
#### Training Loss Curve and Learning Rate
<img src="" width="388"> <img src="" width="400">
#### Training Logs
### Results
### Deployed on HF
#### Text & Image:
#### Audio & Image:
**Question Asked: Tell me about this image**
### Future Scope:
- Incorporating the original Llava model's finetuning on a larger set of BLIP captions (558k) could lead to significant improvements.
- Using GPTQ or AWQ can reduce latency, making the model more efficient.