Modality,Suggested Evaluation,What it is evaluating,Considerations,Link,URL Text,Word Embedding Association Test (WEAT),Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",Semantics derived automatically from language corpora contain human-like biases,https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily Text,"Word Embedding Factual As sociation Test (WEFAT)",Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",Semantics derived automatically from language corpora contain human-like biases,https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily Text,Sentence Encoder Association Test (SEAT),Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",[1903.10561] On Measuring Social Biases in Sentence Encoders,https://arxiv.org/abs/1903.10561 Text,Contextual Word Representation Association Tests for social and intersectional biases,Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",Assessing social and intersectional biases in contextualized word representations | Proceedings of the 33rd International Conference on Neural Information Processing Systems,https://dl.acm.org/doi/abs/10.5555/3454287.3455472 Text,StereoSet,Protected class stereotypes,Automating stereotype detection makes distinguishing harmful stereotypes difficult. It also raises many false positives and can flag relatively neutral associations based in fact (e.g. population x has a high proportion of lactose intolerant people).,[2004.09456] StereoSet: Measuring stereotypical bias in pretrained language models,https://arxiv.org/abs/2004.09456 Text,Crow-S Pairs,Protected class stereotypes,Automating stereotype detection makes distinguishing harmful stereotypes difficult. It also raises many false positives and can flag relatively neutral associations based in fact (e.g. population x has a high proportion of lactose intolerant people).,[2010.00133] CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models,https://arxiv.org/abs/2010.00133 Text,HONEST: Measuring Hurtful Sentence Completion in Language Models,Protected class stereotypes and hurtful language,Automating stereotype detection makes distinguishing harmful stereotypes difficult. It also raises many false positives and can flag relatively neutral associations based in fact (e.g. population x has a high proportion of lactose intolerant people).,HONEST: Measuring Hurtful Sentence Completion in Language Models,https://aclanthology.org/2021.naacl-main.191.pdf Text,TANGO: Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation,"Bias measurement for trans and nonbinary community via measuring gender non-affirmative language, specifically 1) misgendering 2), negative responses to gender disclosure","Based on automatic evaluations of the resulting open language generation, may be sensitive to the choice of evaluator. Would advice for a combination of perspective, detoxify, and regard metrics","Paper Dataset", Text,BBQ: A hand-built bias benchmark for question answering,Protected class stereotypes,,BBQ: A hand-built bias benchmark for question answering,https://aclanthology.org/2022.findings-acl.165.pdf Text,"BBNLI, bias in NLI benchmark",Protected class stereotypes,,On Measuring Social Biases in Prompt-Based Multi-Task Learning,https://aclanthology.org/2022.findings-naacl.42.pdf Text,WinoGender,Bias between gender and occupation,,Gender Bias in Coreference Resolution,https://arxiv.org/abs/1804.09301 Text,WinoQueer,"Bias between gender, sexuality",,Winoqueer,https://arxiv.org/abs/2306.15087 Text,Level of caricature,,,CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations,https://arxiv.org/abs/2310.11501 Text,SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models,,,[2305.11840] SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models,https://arxiv.org/abs/2305.11840 Text,"Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models",,,https://arxiv.org/abs/2309.08902,https://arxiv.org/abs/2309.08902 Text,ROBBIE: Robust Bias Evaluation of Large Generative Language Models,,,[2311.18140] ROBBIE: Robust Bias Evaluation of Large Generative Language Models,https://arxiv.org/abs/2311.18140 Image,Image Embedding Association Test (iEAT),Embedding associations,,"Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases | Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency",https://dl.acm.org/doi/abs/10.1145/3442188.3445932 Image,Dataset leakage and model leakage,Gender and label bias,,[1811.08489] Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations,https://arxiv.org/abs/1811.08489 Image,Grounded-WEAT,Joint vision and language embeddings,,Measuring Social Biases in Grounded Vision and Language Embeddings - ACL Anthology,https://aclanthology.org/2021.naacl-main.78/ Image,Grounded-SEAT,,,, Image,CLIP-based evaluation,"Gender and race and class associations with four attribute categories (profession, political, object, and other.)",,DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers,https://arxiv.org/abs/2202.04053 Image,Human evaluation,,,, Image,,,,[2108.02818] Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications,https://arxiv.org/abs/2108.02818 Image,,,,[2004.07173] Bias in Multimodal AI: Testbed for Fair Automatic Recruitment,https://arxiv.org/abs/2004.07173 Image,Characterizing the variation in generated images,,,Stable bias: Analyzing societal representations in diffusion models,https://arxiv.org/abs/2303.11408 Image,Stereotypical representation of professions,,,Editing Implicit Assumptions in Text-to-Image Diffusion Models see section 6, Image,Effect of different scripts on text-to-image generation,"It evaluates generated images for cultural stereotypes, when using different scripts (homoglyphs). It somewhat measures the suceptibility of a model to produce cultural stereotypes by simply switching the script",,Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis,https://arxiv.org/pdf/2209.08891.pdf Image,Automatic classification of the immorality of images,,,Ensuring Visual Commonsense Morality for Text-to-Image Generation,https://arxiv.org/pdf/2209.08891.pdf Image,"ENTIGEN: effect on the diversity of the generated images when adding ethical intervention",,,"How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?", Image,Evaluating text-to-image models for (complex) biases,,,Easily accessible text-to-image generation amplifies demographic stereotypes at large scale,https://dl.acm.org/doi/abs/10.1145/3593013.3594095 Image,,,,FACET: Fairness in Computer Vision Evaluation Benchmark,https://openaccess.thecvf.com/content/ICCV2023/html/Gustafson_FACET_Fairness_in_Computer_Vision_Evaluation_Benchmark_ICCV_2023_paper.html Image,Evaluating text-to-image models for occupation-gender biases from source to output,"Measuring bias from source to output (dataset, model and outcome). Using different prompts to search dataset and to generate images. Evaluate them in turn for stereotypes.","Evaluating for social attributes that one self-identifies for, e.g. gender, is challenging in computer- generated images.",Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness,https://arxiv.org/abs/2302.10893 Image,Evaluating text-to-image models for gender biases in a multilingual setting,Using different prompts in different languages to generate images and evaluate them in turn for stereotypes.,,Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You, Image,Evaluating text-to-image models for biases when multiple people are generated,This work focuses on generating images depicting multiple people. This puts the evaluation on a higher level beyond portrait evaluation.,Same as for the other evaluations of social attributes + evaluating for location in image is difficult as the models have no inherent spatial understanding.,The Male CEO and the Female Assistant: Probing Gender Biases in Text-To-Image Models Through Paired Stereotype Test,https://arxiv.org/abs/2402.11089 Image,Multimodal Composite Association Score: Measuring Gender Bias in Generative Multimodal Models,Measure association between concepts in multi-modal settings (image and text),,, Image,VisoGender,"This work measures gender-occupation biases in image-to-text models by evaluating: (1) their ability to correctly resolve the pronouns of individuals in scenes, and (2) the perceived gender of individuals in images retrieved for gender-neutral search queries.",Relies on annotators’ perceptions of binary gender. Could better control for the fact that models generally struggle with captioning any scene that involves interactions between two or more individuals.,VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution,https://proceedings.neurips.cc/paper_files/paper/2023/hash/c93f26b1381b17693055a611a513f1e9-Abstract-Datasets_and_Benchmarks.html Audio,Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators,Lists harms of audio/speech generators,Not necessarily evaluation but a good source of taxonomy. We can use this to point readers towards high-level evaluations,https://arxiv.org/pdf/2402.01708.pdf,https://arxiv.org/pdf/2402.01708.pdf Video,Diverse Misinformation: Impacts of Human Biases on Detection of Deepfakes on Networks,Human led evaluations of deepfakes to understand susceptibility and representational harms (including political violence),"Repr. harm, incite violence",https://arxiv.org/abs/2210.10026,https://arxiv.org/abs/2210.10026