A recent paper titled "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect" proposes a simple and effective approach to pruning Large Language Models (LLMs) by removing redundant layers.
Key points: * Discovers significant redundancy across layers in LLMs, with some layers playing a negligible role for the final performance. * Defines a new metric called Block Influence (BI) to quantify the importance of each layer in an LLM. * Removes layers with low BI scores, achieving up to 25% reduction in parameters and computation while maintaining 92% of the LLM's performance.
YOLO-World was designed to solve a limitation of existing zero-shot object detection models: speed. Whereas other state-of-the-art models use Transformers, a powerful but typically slower architecture, YOLO-World uses the faster CNN-based YOLO architecture.
YOLO-World provides three models: small with 13M (re-parametrized 77M), medium with 29M (re-parametrized 92M), and large with 48M (re-parametrized 110M) parameters.
The YOLO-World team benchmarked the model on the LVIS dataset and measured their performance on the V100 without any performance acceleration mechanisms like quantization or TensorRT.
According to the paper, YOLO-World reached 35.4 AP with 52.0 FPS for the L version and 26.2 AP with 74.1 FPS for the S version. While the V100 is a powerful GPU, achieving such high FPS on any device is impressive.