Vlad Bogolin

vladbogo

AI & ML interests

LLMs, Computer Vision

Articles

Organizations

vladbogo's activity

posted an update about 1 month ago
view post
Post
1915
SwapAnything is a new method that allows swapping any object in an image with personalized concepts given by a reference image.

Key points:
1️⃣ It uses pre-trained diffusion models to enable precise and high-fidelity object swapping in images.
2️⃣Targeted variable swapping ensures perfect background preservation while swapping specific areas.
3️⃣SwapAnything achieves good results in single-object, multi-object, partial-object, and cross-domain swapping tasks.

Paper: SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing (2404.05717)
Project page: https://swap-anything.github.io

Congrats to the authors for their work!
posted an update about 1 month ago
view post
Post
1548
Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.

Key Points:
* Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue
* Generates malicious examples using an uninhibited "helpful-only" model
* Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4
* Standard alignment techniques provide limited protection against long context attacks

Paper: https://www.anthropic.com/research/many-shot-jailbreaking
More details in my blog: https://huggingface.co/blog/vladbogo/many-shot-jailbreaking

Congrats to the authors for their work!
  • 2 replies
·
posted an update about 1 month ago
view post
Post
2137
Google DeepMind introduces Gecko a new text embedding! Gecko uses a two-step process that leverages synthetic data generation and reranking.

Keypoints:
* Uses an LLM to generate diverse synthetic queries and tasks from web passages
* Refines the data by retrieving candidate passages and relabeling positives/negatives using the same LLM
* Achieves very good results on the Massive Text Embedding Benchmark, where compact 256D Gecko outperforms 768D models.
* 768D Gecko achieves state-of-the-art performance competing with models a lot larger larger.

Paper: Gecko: Versatile Text Embeddings Distilled from Large Language Models (2403.20327)
More details in my blog: https://huggingface.co/blog/vladbogo/gecko

Congrats to the authors for their work!
posted an update about 1 month ago
view post
Post
1673
A new paper titled "Long-Form Factuality in Large Language Models" proposes a new approach to evaluate the long-form factuality of large language models using an AI agent! They introduce SAFE (Search-Augmented Factuality Evaluator) which leverages an LLM to break down responses into individual facts, query Google to verify each fact, and perform multi-step reasoning.

Keypoints:
* SAFE (Search-Augmented Factuality Evaluator) is an automated method using an LLM agent to evaluate factuality
* It also introduces LongFact, a 2,280 prompt set spanning 38 topics to test open-domain factual knowledge
* SAFE achieves a 72% humans agreement while being 20x cheaper. It also wins 76% of the disagreements measured on a small scale experiment where a more thorough human procedure (researchers + full internet search) was used.
* Larger models like GPT-4, Claude Opus and Gemini Ultra tend to exhibit better long-form factuality.

Paper: Long-form factuality in large language models (2403.18802)
Code and data: https://github.com/google-deepmind/long-form-factuality

Congrats to the authors for their work!
posted an update about 2 months ago
view post
Post
1369
A new paper introduces Visual CoT, a new approach that enhances multi-modal large language models with visual chain-of-thought reasoning capabilities. This allows language models to dynamically identify and focus on specific regions within images that are most relevant for answering questions, mimicking human-like efficient visual reasoning.

Keypoints:
* Introduces the 373k Visual CoT dataset with bounding box annotations highlighting essential image regions
* Proposes a multi-turn pipeline for focusing on relevant visual inputs
* Achieves strong results on multi-modal benchmarks

Paper: Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (2403.16999)
Code, data and other resources: https://github.com/deepcs233/Visual-CoT

Congrats to the authors for their work!
posted an update about 2 months ago
view post
Post
xAI releases the weights for Grok-1. Apparently it's a 314B MoE with 25% of the weights active on a given token.

Blog: https://x.ai/blog/grok-os
Code: https://github.com/xai-org/grok
Model: xai-org/grok-1
Weights: magnet:?xt=urn:btih:5f96d43576e3d386c9ba65b883210a393b68210e&tr=https%3A%2F%2Facademictorrents.com%2Fannounce.php&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
  • 2 replies
·
posted an update about 2 months ago
view post
Post
"Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts" is a new framework designed to animate specific regions within an image through user inputs.

Key points:
* Enables precise animation of selected image regions with just a user click and a concise motion description.
* Achieves promising results for generating localized animations.

Paper: Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts (2403.08268)

Congrats to the authors for their work!
posted an update about 2 months ago
view post
Post
Synth^2 is a new approach that leverages large language models and text-to-image generators to create synthetic image-caption data for boosting visual-language model performance.

Key Points:
* Overcomes data limitations by generating high-quality synthetic image-caption pairs, reducing reliance on costly human annotations.
* Achieves competitive results on image captioning tasks using 40x less paired data than state-of-the-art methods.

Paper: Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings (2403.07750)

Congrats to the authors for their work!
posted an update 2 months ago
view post
Post
A recent paper titled "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect" proposes a simple and effective approach to pruning Large Language Models (LLMs) by removing redundant layers.

Key points:
* Discovers significant redundancy across layers in LLMs, with some layers playing a negligible role for the final performance.
* Defines a new metric called Block Influence (BI) to quantify the importance of each layer in an LLM.
* Removes layers with low BI scores, achieving up to 25% reduction in parameters and computation while maintaining 92% of the LLM's performance.

Congrats to the authors for their work!

Paper: ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)

posted an update 2 months ago
view post
Post
A recent paper titled "Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters" proposes using fine-tuned Multimodal Language Models (MLMs) as high-quality filters for image-text data.

Key points:
* Defines multiple metrics to assess image-text quality from different perspectives like object details, text quality, and semantic understanding.
* Leverages GPT-4 and GPT-4V to construct high-quality instruction data for fine-tuning open-source MLMs as effective data filters.
* Fine-tuned MLM filters generate more precise scores, leading to better filtered data and improved performance of pre-trained models on various downstream tasks.

Congrats to the authors for their work!

Paper: Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters (2403.02677)
Code: https://github.com/Victorwz/MLM_Filter
Dataset: weizhiwang/mlm_filter_instructions
Model: weizhiwang/mlm-filter-llava-13b-gpt4v
posted an update 2 months ago
view post
Post
"Multi-LoRA Composition for Image Generation" introduces two new approaches for combining multiple visual elements in text-to-image generation using Low-Rank Adaptations (LoRAs)! 🎨

Key Points:
* Proposes two methods - LoRA Switch and LoRA Composite - that activate/combine LoRAs during the denoising process rather than merging weights
* LoRA Switch cycles through different LoRAs at each step, while LoRA Composite averages guidance from all LoRAs simultaneously

Paper: Multi-LoRA Composition for Image Generation (2402.16843)
Project page: https://maszhongming.github.io/Multi-LoRA-Composition

Congrats to the authors for their work!
posted an update 2 months ago
view post
Post
The "Design2Code: How Far Are We From Automating Front-End Engineering" paper presents a benchmark for multimodal large language models (LLMs) aimed at automating front-end web development by translating webpage designs (screenshots) into code. This task evaluates the models' ability to recreate webpages that are visually and structurally similar to the original designs.

Key Points:
* Introduces the Design2Code task and benchmark for converting webpage screenshots into code, aiming to automate front-end web development.
* Evaluates multimodal LLMs using comprehensive metrics for visual similarity and element matching.
* GPT-4V outperforms other models in terms of visual resemblance and content accuracy, with generated webpages often preferred over the original references.

Paper: Design2Code: How Far Are We From Automating Front-End Engineering? (2403.03163)
Project page: https://salt-nlp.github.io/Design2Code/
Dataset: SALT-NLP/Design2Code

Congrats to the authors for their work!
posted an update 2 months ago
view post
Post
VisionLLaMA is a new vision transformer architecture that adapts the successful LLaMA language model design for vision tasks. By integrating components like rotary positional embeddings, SwiGLU activation, and LayerNorm from LLaMA, VisionLLaMA achieves very promising performance across various vision tasks, including image generation, classification, semantic segmentation, and object detection.

Keypoints:
* Outperforms state-of-the-art vision transformers like DiT, SiT, DeiT3, and Swin on multiple benchmarks and tasks.
* Leverages Auto-Scaled 2D Rotary Positional Embeddings (AS2DRoPE) to handle variable input resolutions efficiently.
* Serves as a powerful, unified modeling framework for vision generation and understanding tasks.

Paper: VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (2403.00522)
GitHub repo: https://github.com/Meituan-AutoML/VisionLLaMA

Congrats to the authors for their work!
posted an update 2 months ago
view post
Post
Panda-70M is a new large-scale video dataset comprising 70 million high-quality video clips, each paired with textual captions, designed to be used as pre-training for video understanding tasks.

Key Points:
* Automatic Caption Generation: Utilizes an automatic pipeline with multiple cross-modality teacher models to generate captions for video clips.
* Fine-tuned Caption Selection: Employs a fine-tuned retrieval model to select the most appropriate caption from multiple candidates for each video clip.
* Improved Performance: Pre-training on Panda-70M shows significant performance gains in video captioning, text-video retrieval, and text-driven video generation.

Paper: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2402.19479)
Project page: https://snap-research.github.io/Panda-70M/
Code: https://github.com/snap-research/Panda-70M

Congrats to the authors @tschen , @aliaksandr-siarohin et al. for their work!
  • 1 reply
·
posted an update 2 months ago
view post
Post
"What Evidence Do Language Models Find Convincing?" is a new paper that explores what types of evidence and argumentation techniques language models find convincing when presented with ambiguous, open-domain questions that have conflicting answers online.

Keypoints:
* Dataset: It introduces "ConflictingQA," a dataset of controversial questions and real-world evidence paragraphs supporting both "yes" and "no" answers.
* Convincingness Metric: It uses the "paragraph win rate" - when shown two conflicting paragraphs, this measures how often a model predicts the answer that aligns with a given paragraph's stance.
* Current models rely on the relevance of the content to the query, while largely ignoring stylistic features such as whether a text contains scientific references or if it is written with a neutral tone.

Congrats to the authors for their work!

Paper: What Evidence Do Language Models Find Convincing? (2402.11782)
Code: https://github.com/AlexWan0/rag-convincingness
replied to their post 2 months ago
view reply

totally agree. Can't wait to see what comes next

posted an update 2 months ago
view post
Post
Genie is a new method from Google DeepMind that generates interactive, action-controllable virtual worlds from unlabelled internet videos using.

Keypoints:
* Genie leverages a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model to generate controllable video environments.
* The model is trained on video data alone, without requiring action labels, using unsupervised learning to infer latent actions between frames.
* The method restricts the size of the action vocabulary to 8 to ensure that the number of possible latent actions remains small.
* The dataset used for training is generated by filtering publicly available internet videos with specific criteria related to 2D platformer games for a total of 6.8M videos used for training.

Paper: Genie: Generative Interactive Environments (2402.15391)
Project page: https://sites.google.com/view/genie-2024/
More detailed overview in my blog: https://huggingface.co/blog/vladbogo/genie-generative-interactive-environments

Congrats to the authors for their work!
·
replied to their post 3 months ago
view reply

Agree! I don't think it's at all feasible to handle these types of problems/attacks at a provider level. So, as you said, I think that new open-source defensive tool chains will emerge. However, I think the paper makes a good step towards showcasing some of the current capabilities and can enable further research both for finding more complex attacks and also mitigations.

posted an update 3 months ago
view post
Post
"A Closer Look at the Limitations of Instruction Tuning" is a new paper that explores the efficacy and limitations of Instruction Tuning (IT) in Large Language Models (LLMs) for conversational agents. The authors conduct a series of experiments using both LoRA fine-tuning (LFT) and standard full-parameter fine-tuning (SFT) across various LLMs and IT datasets.

The key findings are:
* LoRA fine-tuning (LFT) preserves the pre-training token distribution while SFT doesn't. This indicates that using LFT, post fine-tuning the model still heavily relies on the pre-training and doesn't acquire new information.
* Dataset scaling is ineffective for LFT - experiments show that scaling the dataset size 52x or even 326x doesn't improve the performance.
* LoRA fine-tuning mainly enhances response initiation and style without substantial knowledge enhancement.
* Full-parameter fine-tuning tends to degrade LLM knowledge base and increase hallucination occurrences.
* Popular other methods and adjustments fail to significantly outperform simple LoRA fine-tuned models in terms of conversational quality and accuracy.

Congrats to the authors @Sreyan88 and others for their work!

Paper: A Closer Look at the Limitations of Instruction Tuning (2402.05119)
  • 2 replies
·
posted an update 3 months ago
view post
Post
"LLM Agents can Autonomously Hack Websites" is a new paper that investigates the capacity of LLMs to autonomously execute cybersecurity attacks on websites, such as SQL injections without human guidance.

Key points:
* It uses a LLM integrated with Playwright, a headless web browser, enabling automated web interactions through function calling.
* It gives access to the LLM to 7 web hacking documents and planning capabilities through specific prompting, without disclosing the exact methods to prevent misuse.

GPT-4 achieves a 73.3% success rate on the tested vulnerabilities, emphasizing the potential cybersecurity risks posed by advanced LLMs. Other open models cannot yet perform these types of attacks (results in screenshot).

Congrats to the authors for their work!

Paper: LLM Agents can Autonomously Hack Websites (2402.06664)
  • 2 replies
·
posted an update 3 months ago
view post
Post
VideoPrism is a new video encoder that improves video understanding through a unique training strategy, using a vast dataset (36 million high-quality video-caption pairs and 582 million video clips) for comprehensive learning.

Key points:
* It employs a two-stage training approach, initially aligning video and text encoders, followed by an enhanced video-only masked autoencoding process to learn appearance and motion.
* It achieves superior performance in a wide array of tasks, such as general video understanding, zero-shot video-text retrieval, video captioning, QA, and computer vision for science, having top performance on 30 out of 33 benchmarks.

Congrats to the authors for their work!

Paper: VideoPrism: A Foundational Visual Encoder for Video Understanding (2402.13217)
replied to their post 3 months ago
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
Web Rephrase Augmented Pre-training (WRAP) enhances language model training efficiency by transforming documents into structured formats.

Key aspects:
* Utilizes an instruction-tuned model to rephrase web content into styles such as Wikipedia or Q/A, creating a blend of synthetic and real data for training.
* Demonstrated improvements of over 10% better perplexity, alongside more than 2% increase in zero-shot question-answering accuracy.

Congrats to the authors for their work!

Paper: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (2401.16380)
  • 1 reply
·
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
REALIGN is a new method designed to improve the alignment of Large Language Models (LLMs) with human values by reformatting instruction data. This approach enhances LLM performance across various metrics by aligning responses with predefined criteria and evidence.

Key points:

* REALIGN has three steps: criteria definition, retrieval augmentation, and response reformatting
* It rewrites pairs (query, response) to enhance data quality for fine-tuning LLMs.
* It has shown significant improvements in general alignment, math reasoning and other tasks.

Congrats to the authors for their work!

Paper: Reformatted Alignment (2402.12219)
Code: https://github.com/GAIR-NLP/ReAlign
·
replied to their post 3 months ago
view reply

As far as I could find it’s not yet available. Hopefully the authors will release it soon 🤞

posted an update 3 months ago
view post
Post
Spectral DeTuning is a new method that successfully recovers the original weights of generative models before they were fine-tuned with human feedback or other customization. It shows that pre-fine-tuning weights are recoverable and methods fine-tuned with LoRA can be susceptible to a new weight recovery type of attack.

Key aspects of the paper:
• It introduces Spectral DeTuning for reversing Low-rank Adaptation (LoRA) fine-tuning, targeting original weight restoration through spectral analysis.
• LoWRA Bench Dataset: It introduces a dataset for testing Spectral DeTuning across various models and tasks, featuring extensive layers for comprehensive evaluation.
• It reveals LoRA fine-tuned models' vulnerability to weight recovery attacks, questioning the security of fine-tuning modifications.

Congrats to the authors for their work!

Paper: Recovering the Pre-Fine-Tuning Weights of Generative Models (2402.10208)
Dataset: Eliahu/LoWRA-Bench
Project page: https://vision.huji.ac.il/spectral_detuning/
Code: https://github.com/eliahuhorwitz/Spectral-DeTuning
replied to their post 3 months ago
view reply

Thanks for trying it out! I looked on the logs and it seems that it hangs fetching web search results. We've been getting these types of errors from time to time, so I temporary disabled web search until I can find a better fix. Now it should be faster (still should take around 1 minute), so hopefully you'll get some results. This means that now it relies on Wikipedia only to find evidence for the identified claims.

There were also some gpt-4 processing errors, so I would kindly ask you if you still encounter errors to please send me the paragraph that you used so I can further debug. If you prefer, you can also email it to me at vlad@filtir.com.

Thanks again!

posted an update 3 months ago
view post
Post
Happy to share that Filtir, our AI fact-checking pipeline, is now also available as a Hugging Face Space. You can give it a try at: vladbogo/Filtir.

Feedback is appreciated!
·
posted an update 3 months ago
view post
Post
A new paper from Google DeepMind explores the effect of premise ordering on large language models (LLMs) in reasoning tasks. Despite the logical principle that the sequence of premises should not influence the conclusion's validity, the study finds LLMs' performance varies with different premise arrangements. Here's a summary:

The research investigates how the order of premises affects LLMs in logical and mathematical reasoning tasks, challenging the assumption that premise sequence is irrelevant to the outcome.

Key Findings:
* Logical Reasoning: LLMs perform best when premises are in a forward order that aligns with the proof's progression. Deviations from this order result in significant performance drops.
* Mathematical Reasoning: The introduction of the R-GSM benchmark shows a similar sensitivity in LLMs.

Congrats to the authors for their work!

Paper: Premise Order Matters in Reasoning with Large Language Models (2402.08939).
posted an update 3 months ago
view post
Post
Meta Reality Labs has developed Lumos, a system that merges Multimodal Large Language Models (MM-LLMs) with Scene Text Recognition (STR) to boost the efficiency of various tasks such as multimodal question-answering and text summarization.

Key aspects of Lumos include:

* Hybrid Computing: Utilizes a combination of on-device and cloud computing to process inputs, aiming to reduce latency.
* STR Components:
* Region of Interest (ROI) Detection: Focuses on text-rich areas within images for optimized text extraction.
* Text Detection and Recognition: Ensures high-quality text recognition within the ROI.
* Reading Order Reconstruction: Arranges recognized text to mimic natural reading order, essential for context understanding.

Lumos demonstrates significant improvement with 80% accuracy in question-answering benchmarks and a low word error rate.

Paper: Lumos : Empowering Multimodal LLMs with Scene Text Recognition (2402.08017)

Congrats to the authors for their work!
  • 2 replies
·
posted an update 3 months ago
view post
Post
OS-Copilot is a new framework for creating computer agents such as FRIDAY. This framework enables agents to interact seamlessly with your operating system, handling tasks like file management, multimedia editing, and more.

The system has three components:
* Planner: It takes complex user requests and breaks them down into manageable subtasks for efficient execution.
* Configurator: It prepares tasks for execution based on your preferences and available commands using a memory mechanism.
* Actor: It executes the tasks and learns from feedback, ensuring continuous improvement.

FRIDAY outperforms other methods on GAIA, a comprehensive benchmark. To answer the questions from GAIA, the agents need skills to calculate numbers, browse the web, process video and speech signal and others.

Resources:
* Paper: OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (2402.07456)
* Project GitHub: https://github.com/OS-Copilot/FRIDAY
* Project page: https://os-copilot.github.io/

Congrats to the authors Wu, Zhiyong et al. for their work!