Evaluating Evaluations (2024)

Examining Best Practices for Measuring Broader Impacts of Generative AI

A NeurIPS Workshop

Workshop Overview

Generative AI systems are becoming increasingly prevalent in society, producing content such as text, images, audio, and video with far-reaching implications. While the NeurIPS Broader Impact statement has notably shifted norms for AI publications to consider negative societal impact, no standard exists for how to approach these impact assessments. This workshop aims to address this critical gap by bringing together experts on evaluation science and practitioners who develop and analyze technical systems.

Building upon our previous initiatives, including the FAccT 2023 CRAFT session "Assessing the Impacts of Generative AI Systems Across Modalities and Society" and our initial "Evaluating the Social Impact of Generative AI Systems" report, we have made significant strides in this area. Through these efforts, we collaboratively developed an evaluation framework and guidance for assessing generative systems across modalities. We have since crowdsourced evaluations and analyzed gaps in literature and systemic issues around how evaluations are designed and selected.

The goal of this workshop is to share our existing findings with the NeurIPS community and collectively develop future directions for effective community-built evaluations. By fostering collaboration between experts and practitioners, we aim to create more comprehensive evaluations and develop urgently needed policy recommendations for governments and AI safety organizations.

Call for Papers (CFP)

We are soliciting tiny papers (up to 2 pages long) in the following formats:

  1. Extended Abstracts: Short but complete research papers presenting original or interesting results around social impact evaluation for generative AI.
  2. "Provocations": Novel perspectives or challenges to conventional wisdom around social impact evaluation for generative AI.

Submission Guidelines

  • Paper Length: Maximum 2 pages, including references
  • Format: PDF file, using the NeurIPS conference format
  • Submission Portal: [Insert submission portal link here]
  • Anonymity: Submissions should be anonymous for blind review

Themes for Submissions

We welcome submissions addressing, but not limited to, the following themes:

  1. Conceptualization and operationalization issues in evaluations of:
    • Bias, stereotypes, and representational harms
    • Cultural values and sensitive content
    • Community-centered definitions of disparate performance and privacy
    • Documentation frameworks for financial and environmental costs of evaluations
  2. Ethical or consequential validity considerations for:
    • Data protection
    • Data and content moderation labor
    • Historical implications of evaluation data or practices for evaluation validity
  3. Interrogating or critiquing the theoretical basis of existing evaluations
  4. Novel methodologies for evaluating social impact across different AI modalities
  5. Comparative analyses of existing evaluation frameworks and their effectiveness
  6. Case studies of social impact evaluations in real-world AI applications

Important Dates

  • Submission Deadline: August 1, 2024
  • Notification of Acceptance: September 1, 2024
  • Workshop Date: [Insert workshop date here]

Workshop Structure

Total Duration: 8 Hours

Time Session Description
9:00 AM - 9:30 AM Welcome and Introduction
  • Opening remarks
  • Overview of workshop structure and objectives
9:30 AM - 11:00 AM Reflections on the Landscape
  • Collaborative reflection on the existing landscape
  • Talks, panels, and breakouts by modality (text, images, audio, video, and multimodal data)
  • Topics: Underlying frameworks, Contextualization challenges, Defining robust evaluations, Incentive structures
11:00 AM - 11:15 AM Break
11:15 AM - 12:45 PM Talks + Provocations
  • Invited speakers present on current technical evaluations for base models across all modalities
  • Key social impact categories covered: Bias and stereotyping, Cultural values, Performance disparities, Privacy, Financial and environmental costs, Data moderator labor
  • Presentations of accepted provocations
12:45 PM - 1:45 PM Lunch Break
1:45 PM - 3:45 PM Group Activity
  • Participants break into groups focusing on key social impact categories
  • Activities include: Choosing Evaluations, Reviewing Tools and Datasets, Examining construct reliability, validity, and ranking methodologies
3:45 PM - 4:00 PM Break
4:00 PM - 5:45 PM What's Next? Documentation + Resources
  • Develop policy guidance highlighting impact categories, subcategories, and modalities requiring further investment
  • Discussions on: Documenting Methods, Developing Shareable Resources, Underlying Frameworks, Contextualization Challenges, Defining Robust Evaluations
5:45 PM - 6:00 PM Closing Remarks

Invited Speakers

Confirmed Speakers:

  1. Abigail Jacobs
    • Assistant Professor, School of Information
    • Assistant Professor of Complex Systems, College of Literature, Science, and the Arts
    • University of Michigan
  2. Nitarshan Rajkumar
    • Cofounder of UK AI Safety Institute
    • Adviser to the Secretary of State of UK Department for Science, Innovation and Technology
  3. Su Lin Blodgett
    • Senior Researcher, Microsoft Research Montreal

Tentative Speaker:

  1. Abeba Birhane
    • Adjunct Lecturer/Assistant Professor, Trinity College Dublin
    • Senior Fellow in Trustworthy AI at Mozilla Foundation

Expected Outcomes

Three months after the workshop, we aim to achieve the following outcomes:

  1. Evaluation Report and Resources/Repository:
    • Publish a comprehensive summary of the workshop findings
    • Update resources including:
      • Documentation framework for standardizing evaluation practices
      • Open source repository addressing identified barriers to broader adoption of social impact evaluation of Generative AI systems
  2. Policy Recommendations:
    • Share detailed policy recommendations for investment in future directions for social impact evaluations based on group discussions and workshop outcomes
  3. Knowledge Sharing:
    • Foster a more systematic and effective approach to evaluating the social impact of generative AI systems by disseminating lessons and findings to the broader AI research community

Contact Information

For any queries regarding the workshop or submission process, please contact:

[Insert contact information for workshop organizers]