AI Paper of the Day - a vladbogo Collection

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30 • 25

TrustLLM: Trustworthiness in Large Language Models

Paper • 2401.05561 • Published Jan 10 • 66

Lumiere: A Space-Time Diffusion Model for Video Generation

Paper • 2401.12945 • Published Jan 23 • 86

PALP: Prompt Aligned Personalization of Text-to-Image Models

Paper • 2401.06105 • Published Jan 11 • 47

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Paper • 2401.10891 • Published Jan 19 • 60

More Agents Is All You Need

Paper • 2402.05120 • Published Feb 3 • 51

Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains

Paper • 2402.05140 • Published Feb 6 • 20

In-Context Principle Learning from Mistakes

Paper • 2402.05403 • Published Feb 8 • 14

Self-Discover: Large Language Models Self-Compose Reasoning Structures

Paper • 2402.03620 • Published Feb 6 • 113

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12 • 41

Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs

Paper • 2311.05657 • Published Nov 9, 2023 • 27

Premise Order Matters in Reasoning with Large Language Models

Paper • 2402.08939 • Published Feb 14 • 27

Chain-of-Thought Reasoning Without Prompting

Paper • 2402.10200 • Published Feb 15 • 104

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13 • 37

How to Train Data-Efficient LLMs

Paper • 2402.09668 • Published Feb 15 • 40

Reformatted Alignment

Paper • 2402.12219 • Published Feb 19 • 16

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Paper • 2401.16380 • Published Jan 29 • 48

LLM Agents can Autonomously Hack Websites

Paper • 2402.06664 • Published Feb 6 • 3

VideoPrism: A Foundational Visual Encoder for Video Understanding

Paper • 2402.13217 • Published Feb 20 • 23

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20 • 47

Genie: Generative Interactive Environments

Paper • 2402.15391 • Published Feb 23 • 70

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Paper • 2402.17193 • Published Feb 27 • 23

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27 • 605

Instruction-tuned Language Models are Better Knowledge Learners

Paper • 2402.12847 • Published Feb 20 • 25

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1 • 44

AtomoVideo: High Fidelity Image-to-Video Generation

Paper • 2403.01800 • Published Mar 4 • 20

Design2Code: How Far Are We From Automating Front-End Engineering?

Paper • 2403.03163 • Published Mar 5 • 93

Recovering the Pre-Fine-Tuning Weights of Generative Models

Paper • 2402.10208 • Published Feb 15 • 7

A Closer Look at the Limitations of Instruction Tuning

Paper • 2402.05119 • Published Feb 3 • 5

Multi-LoRA Composition for Image Generation

Paper • 2402.16843 • Published Feb 26 • 28

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29 • 32

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Paper • 2403.02677 • Published Mar 5 • 16

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8 • 18

Stealing Part of a Production Language Model

Paper • 2403.06634 • Published Mar 11 • 90

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Paper • 2403.07750 • Published Mar 12 • 21

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Paper • 2403.03853 • Published Mar 6 • 61

Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Paper • 2403.08268 • Published Mar 13 • 15

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper • 2403.09394 • Published Mar 14 • 25

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Paper • 2402.19472 • Published Feb 29 • 2

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14 • 125

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Paper • 2403.08763 • Published Mar 13 • 49

Enhancing Vision-Language Pre-training with Rich Supervisions

Paper • 2403.03346 • Published Mar 5 • 14

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11 • 27

Gemma: Open Models Based on Gemini Research and Technology

Paper • 2403.08295 • Published Mar 13 • 47

On the Societal Impact of Open Foundation Models

Paper • 2403.07918 • Published Feb 27 • 16

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Paper • 2403.16999 • Published Mar 25 • 4

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Paper • 2403.13044 • Published Mar 19 • 15

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22 • 22

AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

Paper • 2403.14468 • Published Mar 21 • 23

Long-form factuality in large language models

Paper • 2403.18802 • Published Mar 27 • 24

Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

Paper • 2403.07487 • Published Mar 12 • 13

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8 • 61

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Paper • 2403.20327 • Published Mar 29 • 47

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2 • 104

ReALM: Reference Resolution As Language Modeling

Paper • 2403.20329 • Published Mar 29 • 21

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Paper • 2404.05717 • Published Apr 8 • 24

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Paper • 2404.04125 • Published Apr 4 • 27

ReFT: Representation Finetuning for Language Models

Paper • 2404.03592 • Published Apr 4 • 91

RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9 • 34

Rho-1: Not All Tokens Are What You Need

Paper • 2404.07965 • Published Apr 11 • 87

CodecLM: Aligning Language Models with Tailored Synthetic Data

Paper • 2404.05875 • Published Apr 8 • 16

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Paper • 2404.08801 • Published Apr 12 • 65

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18 • 38

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Paper • 2404.10667 • Published Apr 16 • 18

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Paper • 2404.12253 • Published Apr 18 • 54

Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video

Paper • 2404.09833 • Published Apr 15 • 29

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Paper • 2404.09990 • Published Apr 15 • 12

Many-Shot In-Context Learning

Paper • 2404.11018 • Published Apr 17 • 4

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22 • 253

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22 • 126

Make Your LLM Fully Utilize the Context

Paper • 2404.16811 • Published Apr 25 • 53

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Paper • 2404.16873 • Published Apr 21 • 28

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Paper • 2404.18796 • Published Apr 29 • 68

Extending Llama-3's Context Ten-Fold Overnight

Paper • 2404.19553 • Published Apr 30 • 33

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2 • 119

FLAME: Factuality-Aware Alignment for Large Language Models

Paper • 2405.01525 • Published May 2 • 24

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Paper • 2310.17884 • Published Oct 27, 2023 • 1

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

Paper • 2310.08584 • Published Oct 12, 2023 • 2

Better & Faster Large Language Models via Multi-token Prediction

Paper • 2404.19737 • Published Apr 30 • 73

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Paper • 2404.19752 • Published Apr 30 • 22

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Paper • 2405.05904 • Published May 9 • 6

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Paper • 2405.00332 • Published May 1 • 30

Iterative Reasoning Preference Optimization

Paper • 2404.19733 • Published Apr 30 • 47

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Paper • 2405.08911 • Published May 14 • 1

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Paper • 2403.14403 • Published Mar 21 • 6

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15 • 87

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Paper • 2405.11473 • Published May 19 • 53

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20 • 46

RAFT: Adapting Language Model to Domain Specific RAG

Paper • 2403.10131 • Published Mar 15 • 67

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29 • 118

Aya 23: Open Weight Releases to Further Multilingual Progress

Paper • 2405.15032 • Published May 23 • 27

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 87

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31 • 20

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6 • 72

Are We Done with MMLU?

Paper • 2406.04127 • Published Jun 6 • 37

Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4 • 35

Verbalized Machine Learning: Revisiting Machine Learning with Language Models

Paper • 2406.04344 • Published Jun 6 • 1

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Paper • 2406.04770 • Published Jun 7 • 27

McEval: Massively Multilingual Code Evaluation

Paper • 2406.07436 • Published Jun 11 • 39

Depth Anything V2

Paper • 2406.09414 • Published Jun 13 • 95

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Paper • 2406.06525 • Published Jun 10 • 65

Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11 • 53

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Paper • 2406.11230 • Published Jun 17 • 33

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Paper • 2406.11931 • Published Jun 17 • 58

Adversarial Attacks on Multimodal Agents

Paper • 2406.12814 • Published Jun 18 • 4

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20 • 86

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20 • 34

Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Paper • 2406.06326 • Published Jun 10 • 2

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Paper • 2406.15319 • Published Jun 21 • 62

Unlocking Continual Learning Abilities in Language Models

Paper • 2406.17245 • Published Jun 25 • 28

Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26 • 47

Scalable MatMul-free Language Modeling

Paper • 2406.02528 • Published Jun 4 • 11

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Paper • 2406.14526 • Published Jun 20 • 1

Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

Paper • 2407.00653 • Published Jun 30 • 11

Aligning Teacher with Student Preferences for Tailored Training Data Generation

Paper • 2406.19227 • Published Jun 27 • 24

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 68

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10 • 40

AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3 • 50

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Paper • 2407.04842 • Published Jul 5 • 52

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Paper • 2407.12784 • Published Jul 17 • 48

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Paper • 2407.10058 • Published Jul 14 • 29

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Paper • 2407.11963 • Published Jul 16 • 43

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Paper • 2407.16741 • Published Jul 23 • 68

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19 • 41

NExT-GPT: Any-to-Any Multimodal LLM

Paper • 2309.05519 • Published Sep 11, 2023 • 79

The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31 • 110

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Paper • 2407.13833 • Published Jul 18 • 11

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Paper • 2407.18219 • Published Jul 25 • 3

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 109

Gemma 2: Improving Open Language Models at a Practical Size

Paper • 2408.00118 • Published Jul 31 • 75

Apple Intelligence Foundation Language Models

Paper • 2407.21075 • Published Jul 29 • 3

Self-Taught Evaluators

Paper • 2408.02666 • Published Aug 5 • 27

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5 • 60

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Paper • 2408.02545 • Published Aug 5 • 35

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Paper • 2305.04091 • Published May 6, 2023 • 2

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Paper • 2402.04249 • Published Feb 6 • 4

Can AI Assistants Know What They Don't Know?

Paper • 2401.13275 • Published Jan 24 • 1

Towards Modular LLMs by Building and Reusing a Library of LoRAs

Paper • 2405.11157 • Published May 18 • 27

Prompt Sketching for Large Language Models

Paper • 2311.04954 • Published Nov 8, 2023 • 2

FairProof : Confidential and Certifiable Fairness for Neural Networks

Paper • 2402.12572 • Published Feb 19 • 1

Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Paper • 2403.05535 • Published Mar 8 • 1

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Paper • 2408.07931 • Published Aug 15 • 19

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Paper • 2408.04810 • Published Aug 9 • 22

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21 • 57

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21 • 17

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20 • 41

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20 • 11

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Paper • 2406.12624 • Published Jun 18 • 36

Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27 • 121

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Paper • 2408.15881 • Published Aug 28 • 21

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28 • 84

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Paper • 2408.02442 • Published Aug 5 • 21

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29 • 56

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Paper • 2408.17267 • Published Aug 30 • 23

OLMoE: Open Mixture-of-Experts Language Models

Paper • 2409.02060 • Published Sep 3 • 77

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Paper • 2409.01322 • Published Sep 2 • 94

Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?

Paper • 2407.01119 • Published Jul 1 • 1

Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance

Paper • 2409.04593 • Published Sep 6 • 23

SongCreator: Lyrics-based Universal Song Generation

Paper • 2409.06029 • Published Sep 9 • 21

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Paper • 2409.04109 • Published Sep 6 • 43

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10 • 55

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Paper • 2409.04081 • Published Sep 6 • 3

InstantDrag: Improving Interactivity in Drag-based Image Editing

Paper • 2409.08857 • Published Sep 13 • 31

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 72

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Paper • 2409.11378 • Published Sep 17 • 1

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19 • 135

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31 • 22

LLMs Will Always Hallucinate, and We Need to Live With This

Paper • 2409.05746 • Published Sep 9 • 3

Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20 • 68

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Paper • 2409.12941 • Published Sep 19 • 23

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Paper • 2409.12183 • Published Sep 18 • 36

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25 • 104

Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27 • 93

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30 • 53

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Paper • 2410.00531 • Published Oct 1 • 29

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3 • 36

LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3 • 35

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Paper • 2410.02707 • Published Oct 3 • 47

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6 • 28

Pixtral 12B

Paper • 2410.07073 • Published Oct 9 • 62

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8 • 107

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4 • 36

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Paper • 2410.05983 • Published Oct 8 • 1

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9 • 35

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13 • 54

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14 • 24

Agent-as-a-Judge: Evaluate Agents with Agents

Paper • 2410.10934 • Published Oct 14 • 18

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17 • 89

Trust but Verify: Programmatic VLM Evaluation in the Wild

Paper • 2410.13121 • Published Oct 17 • 2

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Paper • 2410.12705 • Published Oct 16 • 29

Emergent properties with repeated examples

Paper • 2410.07041 • Published Oct 9 • 8

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Paper • 2410.12851 • Published Oct 10 • 1

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21 • 65

FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Paper • 2410.16271 • Published Oct 21 • 80

OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1 • 24

Can Knowledge Editing Really Correct Hallucinations?

Paper • 2410.16251 • Published Oct 21 • 54

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Paper • 2410.18779 • Published Oct 24 • 1

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

Paper • 2410.19133 • Published Oct 24 • 11

LongReward: Improving Long-context Large Language Models with AI Feedback

Paper • 2410.21252 • Published Oct 28 • 17

EMMA: End-to-End Multimodal Model for Autonomous Driving

Paper • 2410.23262 • Published Oct 30 • 2

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Paper • 2410.17434 • Published Oct 22 • 25

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Paper • 2410.22366 • Published Oct 28 • 77

Language Models can Self-Lengthen to Generate Long Texts

Paper • 2410.23933 • Published Oct 31 • 17

SelfCodeAlign: Self-Alignment for Code Generation

Paper • 2410.24198 • Published Oct 31 • 23

Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1 • 7

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Paper • 2411.03590 • Published Nov 6 • 9

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Paper • 2411.04905 • Published Nov 7 • 111

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7 • 49

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5 • 25

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Paper • 2409.03420 • Published Sep 5 • 26

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18 • 138

FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?

Paper • 2411.05059 • Published Nov 7 • 1

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Paper • 2411.07133 • Published Nov 11 • 34

Cut Your Losses in Large-Vocabulary Language Models

Paper • 2411.09009 • Published Nov 13 • 43

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Paper • 2411.07494 • Published Nov 12 • 1

Generative World Explorer

Paper • 2411.11844 • Published Nov 18 • 75

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19 • 47

AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16 • 23

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15 • 111

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21 • 43

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

Paper • 2411.15124 • Published Nov 22 • 56

WildLMa: Long Horizon Loco-Manipulation in the Wild

Paper • 2411.15131 • Published Nov 22 • 6

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Paper • 2411.16594 • Published Nov 25 • 36

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26 • 76

The Super Weight in Large Language Models

Paper • 2411.07191 • Published Nov 11 • 4

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Paper • 2411.16740 • Published Nov 23 • 2

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Paper • 2411.18613 • Published Nov 27 • 50

Reverse Thinking Makes LLMs Stronger Reasoners

Paper • 2411.19865 • Published Nov 29 • 19

MALT: Improving Reasoning with Multi-Agent LLM Training

Paper • 2412.01928 • Published 28 days ago • 39

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published 26 days ago • 119

Evaluating Language Models as Synthetic Data Generators

Paper • 2412.03679 • Published 26 days ago • 43

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published 21 days ago • 18

Hidden in the Noise: Two-Stage Robust Watermarking for Images

Paper • 2412.04653 • Published 25 days ago • 28

Learning Flow Fields in Attention for Controllable Person Image Generation

Paper • 2412.08486 • Published 20 days ago • 32

Phi-4 Technical Report

Paper • 2412.08905 • Published 19 days ago • 93

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published 19 days ago • 45

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Paper • 2412.06745 • Published 21 days ago • 6

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published 18 days ago • 79

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published 17 days ago • 33

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published 12 days ago • 45

Qwen2.5 Technical Report

Paper • 2412.15115 • Published 11 days ago • 331

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published 12 days ago • 23

Alignment faking in large language models

Paper • 2412.14093 • Published 12 days ago • 7

TRecViT: A Recurrent Video Transformer

Paper • 2412.14294 • Published 12 days ago • 12

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published 11 days ago • 31

Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature?

Paper • 2412.18409 • Published 7 days ago • 1

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published 7 days ago • 27

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Paper • 2412.18609 • Published 6 days ago • 10

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published 7 days ago • 30

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Paper • 2412.18605 • Published 6 days ago • 12