Kuldeep Singh Sidhu's picture
5 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

Seeking contributors for a completely open-source ๐Ÿš€ Data Science platform! singhsidhukuldeep.github.io

Organizations

Posts 79

view post
Post
2477
Good folks from @Microsoft have released an exciting breakthrough in GUI automation!

OmniParser โ€“ a game-changing approach for pure vision-based GUI agents that works across multiple platforms and applications.

Key technical innovations:
- Custom-trained interactable icon detection model using 67k screenshots from popular websites
- Specialized BLIP-v2 model fine-tuned on 7k icon-description pairs for extracting functional semantics
- Novel combination of icon detection, OCR, and semantic understanding to create structured UI representations

The results are impressive:
- Outperforms GPT-4V baseline by significant margins on the ScreenSpot benchmark
- Achieves 73% accuracy on Mind2Web without requiring HTML data
- Demonstrates a 57.7% success rate on AITW mobile tasks

What makes OmniParser special is its ability to work across platforms (mobile, desktop, web) using only screenshot data โ€“ no HTML or view hierarchy needed. This opens up exciting possibilities for building truly universal GUI automation tools.

The team has open-sourced both the interactable region detection dataset and icon description dataset to accelerate research in this space.

Kudos to the Microsoft Research team for pushing the boundaries of what's possible with pure vision-based GUI understanding!

What are your thoughts on vision-based GUI automation?
view post
Post
1105
Good folks from @Microsoft Research have just released bitnet.cpp, a game-changing inference framework that achieves remarkable performance gains.

Key Technical Highlights:
- Achieves speedups of up to 6.17x on x86 CPUs and 5.07x on ARM CPUs
- Reduces energy consumption by 55.4โ€“82.2%
- Enables running 100B parameter models at human reading speed (5โ€“7 tokens/second) on a single CPU

Features Three Optimized Kernels:
1. I2_S: Uses 2-bit weight representation
2. TL1: Implements 4-bit index lookup tables for every two weights
3. TL2: Employs 5-bit compression for every three weights

Performance Metrics:
- Lossless inference with 100% accuracy compared to full-precision models
- Tested across model sizes from 125M to 100B parameters
- Evaluated on both Apple M2 Ultra and Intel i7-13700H processors

This breakthrough makes running large language models locally more accessible than ever, opening new possibilities for edge computing and resource-constrained environments.

models

None public yet

datasets

None public yet