new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 12

Coordinate Attention for Efficient Mobile Network Design

Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call "coordinate attention". Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation. Code is available at https://github.com/Andrew-Qibin/CoordAttention.

  • 3 authors
·
Mar 3, 2021

PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration

Learning rotation-invariant distinctive features is a fundamental requirement for point cloud registration. Existing methods often use rotation-sensitive networks to extract features, while employing rotation augmentation to learn an approximate invariant mapping rudely. This makes networks fragile to rotations, overweight, and hinders the distinctiveness of features. To tackle these problems, we propose a novel position-aware rotation-equivariant network, for efficient, light-weighted, and robust registration. The network can provide a strong model inductive bias to learn rotation-equivariant/invariant features, thus addressing the aforementioned limitations. To further improve the distinctiveness of descriptors, we propose a position-aware convolution, which can better learn spatial information of local structures. Moreover, we also propose a feature-based hypothesis proposer. It leverages rotation-equivariant features that encode fine-grained structure orientations to generate reliable model hypotheses. Each correspondence can generate a hypothesis, thus it is more efficient than classic estimators that require multiple reliable correspondences. Accordingly, a contrastive rotation loss is presented to enhance the robustness of rotation-equivariant features against data degradation. Extensive experiments on indoor and outdoor datasets demonstrate that our method significantly outperforms the SOTA methods in terms of registration recall while being lightweight and keeping a fast speed. Moreover, experiments on rotated datasets demonstrate its robustness against rotation variations. Code is available at https://github.com/yaorz97/PARENet.

  • 5 authors
·
Jul 14, 2024

Context-aware Rotary Position Embedding

Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.

  • 3 authors
·
Jul 30, 2025

Position Auctions in AI-Generated Content

We consider an extension to the classic position auctions in which sponsored creatives can be added within AI generated content rather than shown in predefined slots. New challenges arise from the natural requirement that sponsored creatives should smoothly fit into the context. With the help of advanced LLM technologies, it becomes viable to accurately estimate the benefits of adding each individual sponsored creatives into each potential positions within the AI generated content by properly taking the context into account. Therefore, we assume one click-through rate estimation for each position-creative pair, rather than one uniform estimation for each sponsored creative across all positions in classic settings. As a result, the underlying optimization becomes a general matching problem, thus the substitution effects should be treated more carefully compared to standard position auction settings, where the slots are independent with each other. In this work, we formalize a concrete mathematical model of the extended position auction problem and study the welfare-maximization and revenue-maximization mechanism design problem. Formally, we consider two different user behavior models and solve the mechanism design problems therein respectively. For the Multinomial Logit (MNL) model, which is order-insensitive, we can efficiently implement the optimal mechanisms. For the cascade model, which is order-sensitive, we provide approximately optimal solutions.

  • 10 authors
·
Jun 3, 2025

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: https://github.com/yellow-binary-tree/MMDuet.

  • 7 authors
·
Nov 26, 2024 2

FCMBench-Video: Benchmarking Document Video Intelligence

Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

  • 5 authors
·
Apr 27

Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG

Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model's input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-context phenomena. However, empirical findings remain inconsistent and hard to reproduce across models, datasets, and evaluation protocols. In this paper, we present a systematic reproducibility study that revisits these claims and examines how they evolve with contemporary LLMs under a controlled evaluation framework. We first show that topic sampling is a major source of variance: small topic sets can mask or exaggerate ordering effects. Based on repeated subset sampling across multiple topic budgets, we provide a practical calibration procedure that identifies topic counts yielding stable trends at feasible cost. Using these fixed topic sets, we then reproduce and extend results on position sensitivity, re-evaluating lost in the middle and positional biases in modern LLMs. Then, we also study a more realistic RAG scenario in which relevance is mediated by a retriever rather than oracle access to ground-truth documents. In this setting, we re-examine a recent industry study and identify discrepancies to evaluation choices such as limited topic coverage and reliance on LLM-based judges. Finally, we conduct an analysis of how retrieval order and context size affect downstream LLM performance under imperfect retrieval. Our results demonstrate that both factors interact strongly with retrieval quality and model choice, and that conclusions drawn from idealised setups do not always transfer to real-world RAG pipelines. We release all code and configurations to support reproducibility and future work on robust RAG evaluation.

  • 3 authors
·
May 26

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8times faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248times1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

  • 6 authors
·
Jan 17, 2024 3

Is Position Bias in Dense Retrievers Built In-or Learned from Data?

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.

sionic-ai sionic-ai
·
May 25 2

PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

While dense retrieval models have achieved remarkable success, rigorous evaluation of their sensitivity to the position of relevant information (i.e., position bias) remains largely unexplored. Existing benchmarks typically employ position-agnostic relevance labels, conflating the challenge of processing long contexts with the bias against specific evidence locations. To address this challenge, we introduce PosIR (Position-Aware Information Retrieval), a comprehensive benchmark designed to diagnose position bias in diverse retrieval scenarios. PosIR comprises 310 datasets spanning 10 languages and 31 domains, constructed through a rigorous pipeline that ties relevance to precise reference spans, enabling the strict disentanglement of document length from information position. Extensive experiments with 10 state-of-the-art embedding models reveal that: (1) Performance on PosIR in long-context settings correlates poorly with the MMTEB benchmark, exposing limitations in current short-text benchmarks; (2) Position bias is pervasive and intensifies with document length, with most models exhibiting primacy bias while certain models show unexpected recency bias; (3) Gradient-based saliency analysis further uncovers the distinct internal attention mechanisms driving these positional preferences. In summary, PosIR serves as a valuable diagnostic framework to foster the development of position-robust retrieval systems.

  • 6 authors
·
Jan 13

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

  • 6 authors
·
Aug 6, 2025

CSIM: A Copula-based similarity index sensitive to local changes for Image quality assessment

Image similarity metrics play an important role in computer vision applications, as they are used in image processing, computer vision and machine learning. Furthermore, those metrics enable tasks such as image retrieval, object recognition and quality assessment, essential in fields like healthcare, astronomy and surveillance. Existing metrics, such as PSNR, MSE, SSIM, ISSM and FSIM, often face limitations in terms of either speed, complexity or sensitivity to small changes in images. To address these challenges, a novel image similarity metric, namely CSIM, that combines real-time while being sensitive to subtle image variations is investigated in this paper. The novel metric uses Gaussian Copula from probability theory to transform an image into vectors of pixel distribution associated to local image patches. These vectors contain, in addition to intensities and pixel positions, information on the dependencies between pixel values, capturing the structural relationships within the image. By leveraging the properties of Copulas, CSIM effectively models the joint distribution of pixel intensities, enabling a more nuanced comparison of image patches making it more sensitive to local changes compared to other metrics. Experimental results demonstrate that CSIM outperforms existing similarity metrics in various image distortion scenarios, including noise, compression artifacts and blur. The metric's ability to detect subtle differences makes it suitable for applications requiring high precision, such as medical imaging, where the detection of minor anomalies can be of a high importance. The results obtained in this work can be reproduced from this Github repository: https://github.com/safouaneelg/copulasimilarity.

  • 4 authors
·
Oct 2, 2024

Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0times inference speedup and a 5.1times VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.

  • 5 authors
·
Feb 6

Amortized Inverse Kinematics via Graph Attention for Real-Time Human Avatar Animation

Inverse kinematics (IK) is a core operation in animation, robotics, and biomechanics: given Cartesian constraints, recover joint rotations under a known kinematic tree. In many real-time human avatar pipelines, the available signal per frame is a sparse set of tracked 3D joint positions, whereas animation systems require joint orientations to drive skinning. Recovering full orientations from positions is underconstrained, most notably because twist about bone axes is ambiguous, and classical IK solvers typically rely on iterative optimization that can be slow and sensitive to noisy inputs. We introduce IK-GAT, a lightweight graph-attention network that reconstructs full-body joint orientations from 3D joint positions in a single forward pass. The model performs message passing over the skeletal parent-child graph to exploit kinematic structure during rotation inference. To simplify learning, IK-GAT predicts rotations in a bone-aligned world-frame representation anchored to rest-pose bone frames. This parameterization makes the twist axis explicit and is exactly invertible to standard parent-relative local rotations given the kinematic tree and rest pose. The network uses a continuous 6D rotation representation and is trained with a geodesic loss on SO(3) together with an optional forward-kinematics consistency regularizer. IK-GAT produces animation-ready local rotations that can directly drive a rigged avatar or be converted to pose parameters of SMPL-like body models for real-time and online applications. With 374K parameters and over 650 FPS on CPU, IK-GAT outperforms VPoser-based per-frame iterative optimization without warm-start at significantly lower cost, and is robust to initial pose and input noise

  • 6 authors
·
Apr 16

Pre-trained Language Models as Re-Annotators

Annotation noise is widespread in datasets, but manually revising a flawed corpus is time-consuming and error-prone. Hence, given the prior knowledge in Pre-trained Language Models and the expected uniformity across all annotations, we attempt to reduce annotation noise in the corpus through two tasks automatically: (1) Annotation Inconsistency Detection that indicates the credibility of annotations, and (2) Annotation Error Correction that rectifies the abnormal annotations. We investigate how to acquire semantic sensitive annotation representations from Pre-trained Language Models, expecting to embed the examples with identical annotations to the mutually adjacent positions even without fine-tuning. We proposed a novel credibility score to reveal the likelihood of annotation inconsistencies based on the neighbouring consistency. Then, we fine-tune the Pre-trained Language Models based classifier with cross-validation for annotation correction. The annotation corrector is further elaborated with two approaches: (1) soft labelling by Kernel Density Estimation and (2) a novel distant-peer contrastive loss. We study the re-annotation in relation extraction and create a new manually revised dataset, Re-DocRED, for evaluating document-level re-annotation. The proposed credibility scores show promising agreement with human revisions, achieving a Binary F1 of 93.4 and 72.5 in detecting inconsistencies on TACRED and DocRED respectively. Moreover, the neighbour-aware classifiers based on distant-peer contrastive learning and uncertain labels achieve Macro F1 up to 66.2 and 57.8 in correcting annotations on TACRED and DocRED respectively. These improvements are not merely theoretical: Rather, automatically denoised training sets demonstrate up to 3.6% performance improvement for state-of-the-art relation extraction models.

  • 1 authors
·
May 11, 2022

Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at 950times540 resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the 950times540 resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

  • 5 authors
·
Dec 10, 2025

Multi-view-guided Passage Reranking with Large Language Models

Recent advances in large language models (LLMs) have shown impressive performance in passage reranking tasks. Despite their success, LLM-based methods still face challenges in efficiency and sensitivity to external biases. (1) Existing models rely mostly on autoregressive generation and sliding window strategies to rank passages, which incur heavy computational overhead as the number of passages increases. (2) External biases, such as position or selection bias, hinder the model's ability to accurately represent passages and increase input-order sensitivity. To address these limitations, we introduce a novel passage reranking model, called Multi-View-guided Passage Reranking (MVP). MVP is a non-generative LLM-based reranking method that encodes query-passage information into diverse view embeddings without being influenced by external biases. For each view, it combines query-aware passage embeddings to produce a distinct anchor vector, which is then used to directly compute relevance scores in a single decoding step. In addition, it employs an orthogonal loss to make the views more distinctive. Extensive experiments demonstrate that MVP, with just 220M parameters, matches the performance of much larger 7B-scale fine-tuned models while achieving a 100x reduction in inference latency. Notably, the 3B-parameter variant of MVP achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks. The source code is available at: https://github.com/bulbna/MVP

  • 4 authors
·
Sep 9, 2025

"Es geht um Respekt, nicht um Technologie": Erkenntnisse aus einem Interessensgruppen-übergreifenden Workshop zu genderfairer Sprache und Sprachtechnologie

With the increasing attention non-binary people receive in Western societies, strategies of gender-fair language have started to move away from binary (only female/male) concepts of gender. Nevertheless, hardly any approaches to take these identities into account into machine translation models exist so far. A lack of understanding of the socio-technical implications of such technologies risks further reproducing linguistic mechanisms of oppression and mislabelling. In this paper, we describe the methods and results of a workshop on gender-fair language and language technologies, which was led and organised by ten researchers from TU Wien, St. P\"olten UAS, FH Campus Wien and the University of Vienna and took place in Vienna in autumn 2021. A wide range of interest groups and their representatives were invited to ensure that the topic could be dealt with holistically. Accordingly, we aimed to include translators, machine translation experts and non-binary individuals (as "community experts") on an equal footing. Our analysis shows that gender in machine translation requires a high degree of context sensitivity, that developers of such technologies need to position themselves cautiously in a process still under social negotiation, and that flexible approaches seem most adequate at present. We then illustrate steps that follow from our results for the field of gender-fair language technologies so that technological developments can adequately line up with social advancements. ---- Mit zunehmender gesamtgesellschaftlicher Wahrnehmung nicht-bin\"arer Personen haben sich in den letzten Jahren auch Konzepte von genderfairer Sprache von der bisher verwendeten Binarit\"at (weiblich/m\"annlich) entfernt. Trotzdem gibt es bislang nur wenige Ans\"atze dazu, diese Identit\"aten in maschineller \"Ubersetzung abzubilden. Ein fehlendes Verst\"andnis unterschiedlicher sozio-technischer Implikationen derartiger Technologien birgt in sich die Gefahr, fehlerhafte Ansprachen und Bezeichnungen sowie sprachliche Unterdr\"uckungsmechanismen zu reproduzieren. In diesem Beitrag beschreiben wir die Methoden und Ergebnisse eines Workshops zu genderfairer Sprache in technologischen Zusammenh\"angen, der im Herbst 2021 in Wien stattgefunden hat. Zehn Forscher*innen der TU Wien, FH St. P\"olten, FH Campus Wien und Universit\"at Wien organisierten und leiteten den Workshop. Dabei wurden unterschiedlichste Interessensgruppen und deren Vertreter*innen breit gestreut eingeladen, um sicherzustellen, dass das Thema holistisch behandelt werden kann. Dementsprechend setzten wir uns zum Ziel, Machine-Translation-Entwickler*innen, \"Ubersetzer*innen, und nicht-bin\"are Privatpersonen (als "Lebenswelt-Expert*innen") gleichberechtigt einzubinden. Unsere Analyse zeigt, dass Geschlecht in maschineller \"Ubersetzung eine mageblich kontextsensible Herangehensweise erfordert, die Entwicklung von Sprachtechnologien sich vorsichtig in einem sich noch in Aushandlung befindlichen gesellschaftlichen Prozess positionieren muss, und flexible Ans\"atze derzeit am ad\"aquatesten erscheinen. Wir zeigen auf, welche n\"achsten Schritte im Bereich genderfairer Technologien notwendig sind, damit technische mit sozialen Entwicklungen mithalten k\"onnen.

  • 5 authors
·
Sep 6, 2022

What Is Your AI Agent Buying? Evaluation, Implications and Emerging Questions for Agentic E-Commerce

Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, vision-language-model (VLM) agents can parse webpages, evaluate products, and transact. This raises a fundamental question: what do AI agents buy, and why? We develop ACES, a sandbox environment that pairs a platform-agnostic VLM agent with a fully programmable mock marketplace to study this question. We first conduct basic rationality checks in the context of simple tasks, and then, by randomizing product positions, prices, ratings, reviews, sponsored tags, and platform endorsements, we obtain causal estimates of how frontier VLMs actually shop. Models show strong but heterogeneous position effects: all favor the top row, yet different models prefer different columns, undermining the assumption of a universal "top" rank. They penalize sponsored tags and reward endorsements. Sensitivities to price, ratings, and reviews are directionally human-like but vary sharply in magnitude across models. Motivated by scenarios where sellers use AI agents to optimize product listings, we show that a seller-side agent that makes minor tweaks to product descriptions, targeting AI buyer preferences, can deliver substantial market-share gains if AI-mediated shopping dominates. We also find that modal product choices can differ across models and, in some cases, demand may concentrate on a few select products, raising competition questions. Together, our results illuminate how AI agents may behave in e-commerce settings and surface concrete seller strategy, platform design, and regulatory questions in an AI-mediated ecosystem.

  • 5 authors
·
Aug 4, 2025 2

Saliency-Driven Active Contour Model for Image Segmentation

Active contour models have achieved prominent success in the area of image segmentation, allowing complex objects to be segmented from the background for further analysis. Existing models can be divided into region-based active contour models and edge-based active contour models. However, both models use direct image data to achieve segmentation and face many challenging problems in terms of the initial contour position, noise sensitivity, local minima and inefficiency owing to the in-homogeneity of image intensities. The saliency map of an image changes the image representation, making it more visual and meaningful. In this study, we propose a novel model that uses the advantages of a saliency map with local image information (LIF) and overcomes the drawbacks of previous models. The proposed model is driven by a saliency map of an image and the local image information to enhance the progress of the active contour models. In this model, the saliency map of an image is first computed to find the saliency driven local fitting energy. Then, the saliency-driven local fitting energy is combined with the LIF model, resulting in a final novel energy functional. This final energy functional is formulated through a level set formulation, and regulation terms are added to evolve the contour more precisely across the object boundaries. The quality of the proposed method was verified on different synthetic images, real images and publicly available datasets, including medical images. The image segmentation results, and quantitative comparisons confirmed the contour initialization independence, noise insensitivity, and superior segmentation accuracy of the proposed model in comparison to the other segmentation models.

  • 5 authors
·
May 23, 2022

Deep Human Parsing with Active Template Regression

In this work, the human parsing task, namely decomposing a human image into semantic fashion/body regions, is formulated as an Active Template Regression (ATR) problem, where the normalized mask of each fashion/body item is expressed as the linear combination of the learned mask templates, and then morphed to a more precise mask with the active shape parameters, including position, scale and visibility of each semantic region. The mask template coefficients and the active shape parameters together can generate the human parsing results, and are thus called the structure outputs for human parsing. The deep Convolutional Neural Network (CNN) is utilized to build the end-to-end relation between the input human image and the structure outputs for human parsing. More specifically, the structure outputs are predicted by two separate networks. The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters. For a new image, the structure outputs of the two networks are fused to generate the probability of each label for each pixel, and super-pixel smoothing is finally used to refine the human parsing result. Comprehensive evaluations on a large dataset well demonstrate the significant superiority of the ATR framework over other state-of-the-arts for human parsing. In particular, the F1-score reaches 64.38% by our ATR framework, significantly higher than 44.76% based on the state-of-the-art algorithm.

  • 8 authors
·
Mar 9, 2015

Maia-2: A Unified Model for Human-AI Alignment in Chess

There are an increasing number of domains in which artificial intelligence (AI) systems both surpass human ability and accurately model human behavior. This introduces the possibility of algorithmically-informed teaching in these domains through more relatable AI partners and deeper insights into human decision-making. Critical to achieving this goal, however, is coherently modeling human behavior at various skill levels. Chess is an ideal model system for conducting research into this kind of human-AI alignment, with its rich history as a pivotal testbed for AI research, mature superhuman AI systems like AlphaZero, and precise measurements of skill via chess rating systems. Previous work in modeling human decision-making in chess uses completely independent models to capture human style at different skill levels, meaning they lack coherence in their ability to adapt to the full spectrum of human improvement and are ultimately limited in their effectiveness as AI partners and teaching tools. In this work, we propose a unified modeling approach for human-AI alignment in chess that coherently captures human style across different skill levels and directly captures how people improve. Recognizing the complex, non-linear nature of human learning, we introduce a skill-aware attention mechanism to dynamically integrate players' strengths with encoded chess positions, enabling our model to be sensitive to evolving player skill. Our experimental results demonstrate that this unified framework significantly enhances the alignment between AI and human players across a diverse range of expertise levels, paving the way for deeper insights into human decision-making and AI-guided teaching tools.

  • 6 authors
·
Oct 30, 2024

Motion simulation of radio-labeled cells in whole-body positron emission tomography

Cell tracking is a subject of active research gathering great interest in medicine and biology. Positron emission tomography (PET) is well suited for tracking radio-labeled cells in vivo due to its exceptional sensitivity and whole-body capability. For validation, ground-truth data are desirable that realistically mimic the flow of cells in a clinical situation. This study develops a workflow (CeFloPS) for simulating moving radio-labeled cells in a human phantom. From the XCAT phantom, the blood vessels are reduced to nodal networks along which cells can move and distribute to organs and tissues. The movement is directed by the blood flow, which is calculated in each node using the Hagen-Pooiseuille equation and Kirchhoff's laws assuming laminar flow. Organs are voxelized and movement of cells from artery entry to vein exit is generated via a biased 3D random walk. The probabilities of cells moving or remaining in tissues are derived from rate constants of tracer kinetic-based compartment modeling. PET listmode data is generated using the Monte-Carlo simulation framework GATE based on the definition of a large-body PET scanner with cell paths as moving radioactive sources and the XCAT phantom providing attenuation data. From the flow simulation of 100,000 cells, 100 sample cells were further processed by GATE and listmode data was reconstructed into images for comparison. As demonstrated by comparisons of simulated and reconstructed cell distributions, CeFloPS is capable of simulating cell behavior in whole-body PET. It achieves this simulation in a way that is anatomically and physiologically reasonable, thereby providing valuable data for the development and validation of cell tracking algorithms.

  • 5 authors
·
Jul 10, 2024

Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning

Visual Parameter-Efficient Fine-Tuning (PEFT) has become a powerful alternative for full fine-tuning so as to adapt pre-trained vision models to downstream tasks, which only tunes a small number of parameters while freezing the vast majority ones to ease storage burden and optimization difficulty. However, existing PEFT methods introduce trainable parameters to the same positions across different tasks depending solely on human heuristics and neglect the domain gaps. To this end, we study where to introduce and how to allocate trainable parameters by proposing a novel Sensitivity-aware visual Parameter-efficient fine-Tuning (SPT) scheme, which adaptively allocates trainable parameters to task-specific important positions given a desired tunable parameter budget. Specifically, our SPT first quickly identifies the sensitive parameters that require tuning for a given task in a data-dependent way. Next, our SPT further boosts the representational capability for the weight matrices whose number of sensitive parameters exceeds a pre-defined threshold by utilizing existing structured tuning methods, e.g., LoRA [23] or Adapter [22], to replace directly tuning the selected sensitive parameters (unstructured tuning) under the budget. Extensive experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing PEFT methods and largely boosts their performance, e.g., SPT improves Adapter with supervised pre-trained ViT-B/16 backbone by 4.2% and 1.4% mean Top-1 accuracy, reaching SOTA performance on FGVC and VTAB-1k benchmarks, respectively. Source code is at https://github.com/ziplab/SPT

  • 5 authors
·
Mar 15, 2023

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.

  • 6 authors
·
Mar 10

Foveated Retinotopy Improves Classification and Localization in CNNs

From a falcon detecting prey to humans recognizing faces, many species exhibit extraordinary abilities in rapid visual localization and classification. These are made possible by a specialized retinal region called the fovea, which provides high acuity at the center of vision while maintaining lower resolution in the periphery. This distinctive spatial organization, preserved along the early visual pathway through retinotopic mapping, is fundamental to biological vision, yet remains largely unexplored in machine learning. Our study investigates how incorporating foveated retinotopy may benefit deep convolutional neural networks (CNNs) in image classification tasks. By implementing a foveated retinotopic transformation in the input layer of standard ResNet models and re-training them, we maintain comparable classification accuracy while enhancing the network's robustness to scale and rotational perturbations. Although this architectural modification introduces increased sensitivity to fixation point shifts, we demonstrate how this apparent limitation becomes advantageous: variations in classification probabilities across different gaze positions serve as effective indicators for object localization. Our findings suggest that foveated retinotopic mapping encodes implicit knowledge about visual object geometry, offering an efficient solution to the visual search problem - a capability crucial for many living species.

  • 3 authors
·
Feb 23, 2024

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, ho = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.

ai21labs AI21
·
Feb 11 1