Spaces:

society-ethics
/

StableBias

Runtime error

App Files Files Community

yjernite commited on Mar 3, 2023

Commit

cfd13fd

•

1 Parent(s): 8024608

moving on to section 2

Browse files

Files changed (1) hide show

app.py +36 -29

app.py CHANGED Viewed

@@ -110,22 +110,22 @@ with gr.Blocks() as demo:
             <br/>
             Diffusion Models are commonly used in text-conditioned image generation systems, such as Stable Diffusion or Dall-E 2.
-            In those systems, a user writes a "*prompt*" as input, and receives an image that corresponds to what the prompt is describing as output.
-            For example, if the user asks for a "*Photo protrait of a **scientist***", they expect to get an image that looks photorealistic,
             prominently features at least one person, and this person might be wearing a lab coat or safety goggles.
-            A "*Photo portrait of a **carpenter***", on the other other hand, might be set against a background depicting wooden scaffolding or a workshop (see pictures above).
-            At the start of this project, we found that while systems do make good use of background and context cues to represent different professions,
             there were also some concerning trends about the perceived genders and ethnicities of the people depicted in these professional situations.
             After trying a few such prompts, we were left asking: why do all the people depicted in these pictures **look like white men**?
             Why do the only exceptions appear to be fast food workers and other lower wage professions?
             And finally, what could be the **consequences of such a lack of diversity** in the system outputs?
-            **Look like** is the operative phrase here as the people depicted in the pictures are synthetic and so do not belong to socially-constructed groups.
             Consequently, since we cannot assign a gender or ethnicity label to each data point,
             we instead focus on dataset-level trends in visual features that are correlated with social variation in the text prompts.
             We do this through *controlled prompting* and *hierarchical clustering*: for each system,
-            we obtain a dataset of images corresponding to prompts of the format "*Photo portrait of a **(identity terms)** person at work*",
             where ***(identity terms)*** jointly enumerate phrases describing ethnicities and phrases denoting gender.
             We then cluster these images by similarity and create an [Identity Representation Demo](https://hf.co/spaces/society-ethics/DiffusionFaceClustering)
             to showcase the visual trends encoded in these clusters - as well as their relation to the social variables under consideration.
@@ -141,15 +141,14 @@ with gr.Blocks() as demo:
                     Our goal with this strategy is to trigger variation in the images that viewers will associate with social markers.
                     We use three phrases denoting genders (*man*, *woman*, *non-binary*), and 18 phrases describing ethnicities -
                     some of which are sometimes understood as similar in the US context, such as First Nations and Indigenous American.
-                    We also left the gender and ethnicity phrases unspecified in some prompts.
-                    This approach places visual features on a multidimensional spectrum without ascribing a prior number of distinct values for social categories or their intersection.
                     It is also limited by the training data of the models under consideration and the set of identity terms use,
                     which in our application are more relevant to the North American context than to other regions.
                     How then can we use these clusters in practice?
                     Let's look for example at **cluster 2 in the 24-cluster setting**:
-                    we see that most prompts for images in the cluster used the word *woman* and one of the words denoting *Hispanic* origin.
                     This tells us that images that are similar to the ones in this cluster will **likely look like** Hispanic women to viewers.
                     You can cycle through [a few other examples right](https://hf.co/spaces/society-ethics/DiffusionFaceClustering "or even better, visualize them in the app"),
                     such as cluster 19 which mostly features the words *Caucasian* and *man*, different gender term distributions for *African American* in 0 and 6,
@@ -185,8 +184,8 @@ with gr.Blocks() as demo:
                     """
                     #### [Stereotypical Representations and Associations](https://hf.co/spaces/society-ethics/DiffusionFaceClustering "Select cluster to visualize to the left or go straight to the interactive demo")
-                    Even before we start leveraging these clusters to analyze system behaviors in other settings,
-                    they already provide useful insights about some of the bias dynamics in the system.
                     In particular, they help us understand which groups are more susceptible to representation harms when using these systems,
                     especially when the models are more likely to associate them with **stereotypical representations**.
@@ -209,25 +208,25 @@ with gr.Blocks() as demo:
                     """
                     #### [Specification, Markedness, and Bias](https://hf.co/spaces/society-ethics/DiffusionFaceClustering "Select cluster to visualize to the right or go straight to the interactive demo")
-                    The last phenomenon we study through the lens of our clustering is that of **markedness**,
                     or *default behavior* of the models when neither gender nor ethnicity is specified.
-                    This corresponds to asking the question: "If I **don't** explicitly tell my model to generate a person of specific genders or ethnicities,
                     will I still see diverse outcomes?"
                     Unsurprisingly, we find that not to be the case.
-                    The clusters with the most examples of both prompts with unspecified gender and ethnicity terms are **clusters 5 and 19**,
                     and both are also strongly associated with the words *man*, *White*, and *Causian*.
-                    This association holds across genders (as showcased by **cluster 15**, which has a majority of *woman* and *White* prompts)
-                    and ethnicities (comparing the proportions of unspecified genders in **clusters 0 and 6**).
                     This provides the beginning of an answer to our motivating question: since users rarely specify an explicit gender or ethnicity when using
-                    these systems to generate images of people, the high likelihood of defaulting to *Whiteness* and *masculinity* is likely to at least partially explain the observed lack of diversity.
                     We compare these behaviors across systems and professions in the next section.
                     """
                 )
             with gr.Column(scale=1):
                 id_cl_id_3 = gr.Dropdown(
-                    choices=[5, 19, 0, 6, 15],
                     value=6,
                     show_label=False,
                 )
@@ -250,19 +249,27 @@ with gr.Blocks() as demo:
     gr.Markdown(
         """
-                ### Exploring Biases
-            """
-    )
-    gr.Markdown(
         """
-                 Machine Learning models encode and amplify biases that are represented in the data that they are trained on -this can include, for instance, stereotypes around the appearances of members of different professions. In our study, we prompted the 3 text-to-image models with texts pertaining to 150 different professions and analyzed the presence of different identity groups in the images generated. We found evidence of many societal stereotypes in the images generated, such as the fact that people in positions of power (e.g. director, CEO) are often White- and male-appearing, while the images generated for other professions are more diverse. Read more about our findings in the accordion below or directly via the [Diffusion Cluster Explorer](https://huggingface.co/spaces/society-ethics/DiffusionClustering) tool.
-              """
     )
-    with gr.Accordion("Exploring Biases", open=False):
-        gr.HTML(
             """
-             <p style="margin-bottom: 14px; font-size: 100%">  We also explore the correlations between the professions that use used in our prompts and the different identity clusters that we identified. <br> Using both the <a href='https://huggingface.co/spaces/society-ethics/DiffusionClustering' style='text-decoration: underline;' target='_blank'> Diffusion Cluster Explorer </a> and the <a href='https://huggingface.co/spaces/society-ethics/DiffusionFaceClustering' style='text-decoration: underline;' target='_blank'> Identity Representation Demo </a>, we can see which clusters are most correlated with each profession and what identities are in these clusters.</p>
-              """
         )
         with gr.Row():
             with gr.Column():

             <br/>
             Diffusion Models are commonly used in text-conditioned image generation systems, such as Stable Diffusion or Dall-E 2.
+            In those systems, a user writes a *"prompt"* as input, and receives an image that corresponds to what the prompt is describing as output.
+            For example, if the user asks for a *"Photo protrait of a **scientist**"*, they expect to get an image that looks photorealistic,
             prominently features at least one person, and this person might be wearing a lab coat or safety goggles.
+            A *"Photo portrait of a **carpenter**"*, on the other other hand, might be set against a background depicting wooden scaffolding or a workshop (see pictures above).
+            At the start of this project, we found that while systems do make use of background and context cues to represent different professions,
             there were also some concerning trends about the perceived genders and ethnicities of the people depicted in these professional situations.
             After trying a few such prompts, we were left asking: why do all the people depicted in these pictures **look like white men**?
             Why do the only exceptions appear to be fast food workers and other lower wage professions?
             And finally, what could be the **consequences of such a lack of diversity** in the system outputs?
+            **Look like** is the operative phrase here: the people depicted in the pictures are synthetic and so do not belong to socially-constructed groups.
             Consequently, since we cannot assign a gender or ethnicity label to each data point,
             we instead focus on dataset-level trends in visual features that are correlated with social variation in the text prompts.
             We do this through *controlled prompting* and *hierarchical clustering*: for each system,
+            we obtain a dataset of images corresponding to prompts of the format *"Photo portrait of a **(identity terms)** person at work"*,
             where ***(identity terms)*** jointly enumerate phrases describing ethnicities and phrases denoting gender.
             We then cluster these images by similarity and create an [Identity Representation Demo](https://hf.co/spaces/society-ethics/DiffusionFaceClustering)
             to showcase the visual trends encoded in these clusters - as well as their relation to the social variables under consideration.
                     Our goal with this strategy is to trigger variation in the images that viewers will associate with social markers.
                     We use three phrases denoting genders (*man*, *woman*, *non-binary*), and 18 phrases describing ethnicities -
                     some of which are sometimes understood as similar in the US context, such as First Nations and Indigenous American.
+                    We also leave the gender and ethnicity phrases unspecified in some prompts.
+                    This approach has the advantage of placing visual features on a multidimensional spectrum without ascribing a prior number of distinct values for social categories or their intersection.
                     It is also limited by the training data of the models under consideration and the set of identity terms use,
                     which in our application are more relevant to the North American context than to other regions.
                     How then can we use these clusters in practice?
                     Let's look for example at **cluster 2 in the 24-cluster setting**:
+                    we see that most prompts for images in the cluster use the word *woman* and one of the words denoting *Hispanic* origin.
                     This tells us that images that are similar to the ones in this cluster will **likely look like** Hispanic women to viewers.
                     You can cycle through [a few other examples right](https://hf.co/spaces/society-ethics/DiffusionFaceClustering "or even better, visualize them in the app"),
                     such as cluster 19 which mostly features the words *Caucasian* and *man*, different gender term distributions for *African American* in 0 and 6,
                     """
                     #### [Stereotypical Representations and Associations](https://hf.co/spaces/society-ethics/DiffusionFaceClustering "Select cluster to visualize to the left or go straight to the interactive demo")
+                    Even before we start leveraging these clusters to analyze system behaviors in other applications,
+                    they already provide useful insights into some of the bias dynamics in the system.
                     In particular, they help us understand which groups are more susceptible to representation harms when using these systems,
                     especially when the models are more likely to associate them with **stereotypical representations**.
                     """
                     #### [Specification, Markedness, and Bias](https://hf.co/spaces/society-ethics/DiffusionFaceClustering "Select cluster to visualize to the right or go straight to the interactive demo")
+                    The second phenomenon we study through the lens of our clustering is that of **markedness**,
                     or *default behavior* of the models when neither gender nor ethnicity is specified.
+                    This corresponds to asking the question: "If I **don't explicitly** tell my model to generate people with diverse identity characteristics,
                     will I still see diverse outcomes?"
                     Unsurprisingly, we find that not to be the case.
+                    The clusters with the most examples of prompts with unspecified gender and unspecified ethnicity terms are **clusters 5 and 19**,
                     and both are also strongly associated with the words *man*, *White*, and *Causian*.
+                    This association holds across genders (as showcased by **cluster 15**, which has a majority of *woman* and *White* prompts along with unspecified ethnicity)
+                    and across ethnicities (comparing the proportions of unspecified genders in **clusters 0 and 6**: 18 % and 38% for the clusters with more *aoman* and more *man* respectively along with the *African American* phrases).
                     This provides the beginning of an answer to our motivating question: since users rarely specify an explicit gender or ethnicity when using
+                    these systems to generate images of people, the high likelihood of defaulting to *Whiteness* and *masculinity* is likely to at least partially explain the observed lack of diversity in the outputs.
                     We compare these behaviors across systems and professions in the next section.
                     """
                 )
             with gr.Column(scale=1):
                 id_cl_id_3 = gr.Dropdown(
+                    choices=[5, 19, 15, 0, 6],
                     value=6,
                     show_label=False,
                 )
     gr.Markdown(
         """
+        ### Quantifying Social Biases in Image Generations of Professions
+        Machine Learning models encode and amplify biases that are represented in the data that they are trained on -
+        this can include, for instance, stereotypes around the appearances of members of different professions.
+        In our study, we prompted the 3 text-to-image models with texts pertaining to 150 different professions
+        and analyzed the presence of different identity groups in the images generated. We found evidence of many societal stereotypes in the images generated,
+        such as the fact that people in positions of power (e.g. director, CEO) are often White- and male-appearing,
+        while the images generated for other professions are more diverse.
+        Read more about our findings in the accordion below or directly via the [Diffusion Cluster Explorer](https://hf.co/spaces/society-ethics/DiffusionClustering) tool.
         """
     )
+    with gr.Accordion("Quantifying Social Biases in Image Generations of Professions", open=False):
+        gr.Markdown(
+            """
+            <br/>
+            We also explore the correlations between the professions that use used in our prompts and the different identity clusters that we identified.
+            Using both the [Diffusion Cluster Explorer](https://hf.co/spaces/society-ethics/DiffusionClustering)
+            and the [Identity Representation Demo](https://hf.co/spaces/society-ethics/DiffusionFaceClustering),
+            we can see which clusters are most correlated with each profession and what identities are in these clusters.
             """
         )
         with gr.Row():
             with gr.Column():