GunaKoppula commited on
Commit
5ef8fae
1 Parent(s): 2a274b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -21
README.md CHANGED
@@ -9,36 +9,115 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## Phi2 : Multimodal Finetuning
13
  ### Details
14
- 1. LLM Backbone: Phi2
15
- 2. Vision Tower: clip-vit-large-patch14-336
16
- 3. Audio Model: Whisper
17
- 4. Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
18
- 5. Finetuning Dataset: Instruct 150k dataset based on COCO
19
 
20
- ### Design
21
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/56df24cd-2681-4e17-ab64-9652f609b15f)
 
22
 
23
- ### Pretraining
24
- #### Training Loss Curve
25
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/b6c37a95-0a56-4b52-8719-3ff56dc1b703)
26
 
27
- #### Learing Rate
28
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/44d9a11b-b28d-47e1-ba1d-d6dc22ebe748)
 
 
29
 
30
- #### Training Logs
31
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/76543d98-d9fe-4c1a-ac47-3d06e48053ad)
 
 
 
32
 
33
- ### Finetuning
34
- #### Training Loss Curve
35
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/45ef40bd-fae5-4cfe-a522-c0eed2833230)
36
 
37
- #### Learing Rate
38
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/df60ee62-a537-4e36-a7f7-f7111e101162)
 
 
 
 
 
 
 
 
 
39
 
40
  #### Training Logs
41
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/2747acce-bc99-4c37-a05a-d5e81cb9aa9d)
 
42
 
43
  ### Results
44
- ![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/f12a9f04-df32-413e-b957-774c30381b2b)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  license: mit
11
  ---
12
+ # ERA-CAPSTONE
13
+
14
+ 🤗[**Space Link**](https://huggingface.co/spaces/GunaKoppula/MultiModal-Phi2)
15
+
16
+
17
+ ### Tasks:
18
+
19
+ 1. Make a multi-modal LLM that can take these inputs:
20
+
21
+ - :heavy_check_mark: Text
22
+ - :heavy_check_mark: Image
23
+ - :heavy_check_mark: Audio
24
+
25
+ 2. Training:
26
+
27
+ - Image:
28
+
29
+ :heavy_check_mark: Use the original Instruct 150k dataset, and use CLIP to get the image embeddings.
30
+
31
+ :heavy_check_mark: Add projection layer from this CLIP embeddings to something that can be fed to Phi Model.
32
+
33
+ :heavy_check_mark: Add an adapter to train (QLoRa) on the instruct 150k dataset.
34
+
35
+ - Audio:
36
+
37
+ :heavy_check_mark: Need to use Whisper to perform ASR.
38
+
39
+ :heavy_check_mark: Add a projection layer for whisper output.
40
+
41
+ - Text:
42
+
43
+ :heavy_check_mark: Give any text to generate the related details.
44
+
45
+
46
+ 3. :heavy_check_mark: The output remains text, based on multimodal inputs - text, image, and audio.
47
+
48
+ 4. :heavy_check_mark: The deployment page should look like ChatGPT only, where we can send in images, text, or upload audio (live recording or file).
49
+
50
+
51
+
52
+ ## Phi2 : Pretraining LLM from Scratch
53
+ ### Details
54
+ 1. Model used: [Microsoft Phi2](https://huggingface.co/microsoft/phi-2)
55
+ 2. Dataset used: Tiny Stories dataset(100k samples) & Realtime data(100k samples) from finetuned Phi2 model via Ollama
56
+ 3. Pretraining approach: Pretraining using QLoRA
57
+
58
+
59
+ ### Training Loss Curve
60
+ <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/1692461c-de43-4b50-87d7-bdc0d72b5f69.type" width="500">
61
+
62
+ ### Training Logs
63
+ ![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/2672a350-7786-4773-b1bc-ea99a3e7091c)
64
+
65
+
66
  ## Phi2 : Multimodal Finetuning
67
  ### Details
68
+ 1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2)
69
+ 2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
70
+ 3. Audio Model: [Whisper Tiny](https://huggingface.co/openai/whisper-tiny)
71
+ 4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
72
+ 5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
73
 
74
+ ```python
75
+ class AudioLanguageConnector:
76
+ ```
77
 
78
+ - This class prepares and tokenizes audio-related text data using the "microsoft/phi-2" model's tokenizer. The <audio_start> and <audio_end> tokens are added to the input text to provide context for audio-related processing. The tokenized output is then returned as a tensor. This class acts as a connector to process text data in a format suitable for the specified audio model.
 
 
79
 
80
+ ```python
81
+ class WhisperWithProjection:
82
+ ```
83
+ - This class transcribes audio by encapsulating the necessary steps. It uses a pre-trained model called "openai/whisper-tiny" to convert audio files into text transcriptions.
84
 
85
+ ```python
86
+ class MultiModalPhi2:
87
+ ```
88
+ - This class takes input text, audio, and images and constructs a conversation prompt with appropriate formatting for the model. It tokenizes the prompt, preprocesses the image, and concatenates audio embeddings if available, and generates new tokens using the pre-trained model, considering input modalities.
89
+ Decodes and returns the generated output, handling special tokens and potential mismatches.
90
 
 
 
 
91
 
92
+
93
+ ### Pretraining
94
+ #### Training Loss Curve and Learning Rate
95
+ <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/c9e205b9-44aa-4ef3-b7da-f6d69b5f0f2a.type" width="400"> <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/a82cf4b6-0cc4-47d9-ad7e-f504677a5074.type" width="393">
96
+
97
+ #### Training Logs
98
+ ![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/83cbd14a-9626-410c-99be-5757c089c9b2)
99
+
100
+ ### Finetuning
101
+ #### Training Loss Curve and Learning Rate
102
+ <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/ceb9d187-14cb-4372-8370-bdbf7f7a3812.type" width="388"> <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/5d1fe7b3-5cec-46c8-aaac-a1e3ae5b7f6c.type" width="400">
103
 
104
  #### Training Logs
105
+ ![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/3aebd889-d120-466f-8751-9c7e37023ab1)
106
+
107
 
108
  ### Results
109
+ ![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/4b54c0bd-b078-4dc9-932a-49640d0297dc)
110
+
111
+
112
+ ### Deployed on HF
113
+ #### Text & Image:
114
+
115
+
116
+ #### Audio & Image:
117
+ **Question Asked: Tell me about this image**
118
+
119
+
120
+
121
+ ### Future Scope:
122
+ - Incorporating the original Llava model's finetuning on a larger set of BLIP captions (558k) could lead to significant improvements.
123
+ - Using GPTQ or AWQ can reduce latency, making the model more efficient.