llava-hf/bakLlava-v1-hf · Training data details

Hi, and thanks for this amazing work!

Could you please elaborate on the training data? I have the following questions :-)

"558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP." --> What do you mean by captioned by BLIP? All of the mentioned datasets already have captions, no?
"40K ShareGPT data." --> ShareGPT is text-only. Does that mean, you trained on text-only CLM or do you actually mean ShareGPT4V, which is multi-modal?
I assume that most if not all of the textual data is in English, correct?