Compared to a huge vision tower
CogVLM which was recently released is currently putting anything I've seen into shadow.
Great handwriting OCR, typing OCR and visual understanding so far away from any other vision model (including QWEN VL), it feels like a multi generational step.
One of the core differences is that it uses laion/CLIP-ViT-bigG-14-laion2B-39B-b160k as CLIP model.
That's unwieldly for most hardware applications but maybe it could be quantized (when quantizing LLMs 5+ bit inferences very similar to fp16+ - that might work well with ViT too)
The second difference appears to be a second "expert" model they embedded into the stack.
I just wanted to leave that here, given we had a short discussion on the ViT used.
The CogVLM example shows how much can be done with a larger one.