new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 15

Towards Measuring Fairness in AI: the Casual Conversations Dataset

This paper introduces a novel dataset to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Our dataset is composed of 3,011 subjects and contains over 45,000 videos, with an average of 15 videos per person. The videos were recorded in multiple U.S. states with a diverse set of adults in various age, gender and apparent skin tone groups. A key feature is that each subject agreed to participate for their likenesses to be used. Additionally, our age and gender annotations are provided by the subjects themselves. A group of trained annotators labeled the subjects' apparent skin tone using the Fitzpatrick skin type scale. Moreover, annotations for videos recorded in low ambient lighting are also provided. As an application to measure robustness of predictions across certain attributes, we provide a comprehensive study on the top five winners of the DeepFake Detection Challenge (DFDC). Experimental evaluation shows that the winning models are less performant on some specific groups of people, such as subjects with darker skin tones and thus may not generalize to all people. In addition, we also evaluate the state-of-the-art apparent age and gender classification methods. Our experiments provides a thorough analysis on these models in terms of fair treatment of people from various backgrounds.

  • 6 authors
·
Apr 6, 2021

Colorimetric skin tone scale for improved accuracy and reduced perceptual bias of human skin tone annotations

Human image datasets used to develop and evaluate technology should represent the diversity of human phenotypes, including skin tone. Datasets that include skin tone information frequently rely on manual skin tone ratings based on the Fitzpatrick Skin Type (FST) or the Monk Skin Tone (MST) scales in lieu of the actual measured skin tone of the image dataset subjects. However, perceived skin tone is subject to known biases and skin tone appearance in digital images can vary substantially depending on the capture camera and environment, confounding manual ratings. Surprisingly, the relationship between skin-tone ratings and measured skin tone has not been explored. To close this research gap, we measured the relationship between skin tone ratings from existing scales (FST, MST) and skin tone values measured by a calibrated colorimeter. We also propose and assess a novel Colorimetric Skin Tone (CST) scale developed based on prior colorimetric measurements. Using experiments requiring humans to rate their own skin tone and the skin tone of subjects in images, we show that the new CST scale is more sensitive, consistent, and colorimetrically accurate. While skin tone ratings appeared to correct for some color variation across images, they introduced biases related to race and other factors. These biases must be considered before using manual skin-tone ratings in technology evaluations or for engineering decisions.

  • 7 authors
·
Oct 27, 2024

SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.