seanmor5
/

mistral-7b-instruct-vision-64-qlora

Model card Files Files and versions Community

seanmor5 commited on Feb 7

Commit

11c4417

•

1 Parent(s): 7eff82a

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ datasets:
 ![Text Meme](meme.jpg)
-Is text really all you need? Probably not, but the least we can do is try. This repo contains a QLoRA fine-tune of Mistral-7B on the original Llava-150K-Instruct dataset; however, each image is encoded as a base64 representation. With enough data, can a LLM learn to "see" just from text? Early results say absolutely not, but I am committed to burning my GPU credits regardless of how bad the result.
 I do believe in the future we will see a "simplification" of architectures designed to work for multiple modalities. LLaVA, for example, combines a vision encoder with a pre-trained LLM. Perhaps models of the future will have a joint-representation for both images and text, and not have to rely on splicing 2 models together. For example, perhaps [Token-Free Models](https://arxiv.org/html/2401.13660v1) could be trained on multi-modal byte representations of inputs. Of course, this would be extremely computationally expensive compared to modern vision models, but maybe 10-20 years down the line it's not that big of a deal?

 ![Text Meme](meme.jpg)
+Is text really all you need? Probably not, but the least we can do is try. This repo contains a QLoRA fine-tune of [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) on the original [Llava-150K-Instruct](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) dataset; however, each image is encoded as a base64 representation. With enough data, can a LLM learn to "see" just from text? Early results say absolutely not, but I am committed to burning my GPU credits regardless of how bad the result.
 I do believe in the future we will see a "simplification" of architectures designed to work for multiple modalities. LLaVA, for example, combines a vision encoder with a pre-trained LLM. Perhaps models of the future will have a joint-representation for both images and text, and not have to rely on splicing 2 models together. For example, perhaps [Token-Free Models](https://arxiv.org/html/2401.13660v1) could be trained on multi-modal byte representations of inputs. Of course, this would be extremely computationally expensive compared to modern vision models, but maybe 10-20 years down the line it's not that big of a deal?