Emily McMilin commited on
Commit
a458815
1 Parent(s): 652f191

simplify text more. Update DAG image

Browse files
Files changed (1) hide show
  1. app.py +17 -8
app.py CHANGED
@@ -481,25 +481,34 @@ with demo:
481
 
482
 
483
  gr.Markdown("### Data Generating Process")
484
- gr.Markdown("To pick values below that are most likely to cause spurious correlations, it helps to make some assumptions about the training datasets' likely data generating process, and where selection bias may come in.")
 
 
 
 
 
 
 
485
 
486
- gr.Markdown("A plausible data generating processes for both Wikipedia and Reddit sourced data is shown as a DAG below. These DAGs are prone to collider bias when conditioning on `access`. In other words, although in real life `place`, `date`, (subreddit) `interest` and gender are all unconditionally independent, when we condition on their common effect, `access`, they become unconditionally dependent. Composing a dataset often requires the dataset maintainers to condition on `access`. Thus LLMs learn these dataset induced dependencies, appearing to us as spurious correlations.")
487
 
488
  gr.Markdown("""
489
  <style>
490
  img {
491
- width: 30%;
492
- max-width: 600px;
493
- }
 
494
  <center>
495
  <img src="https://www.dropbox.com/s/4f07djirinl2qvy/show_g_crop.png?raw=1"
496
  alt="DAG of possible data generating process for datasets used in training some of our LLMs.">
497
  </center>
498
  """)
499
 
500
- gr.Markdown("There may be misassumptions in our DAG above, which you can explore above. Please contribute any corrections or new examples of interest!")
501
- gr.Markdown("You may also be interested in applying this demo to your own model of interest. This demo _should_ work with any Hugging Face model that supports the [fill-mask](https://huggingface.co/models?pipeline_tag=fill-mask) task.")
502
-
 
503
  demo.launch(debug=True)
504
 
 
505
  # %%
 
481
 
482
 
483
  gr.Markdown("### Data Generating Process")
484
+ gr.Markdown("To pick values below that are most likely to cause spurious correlations, it helps to make some assumptions about the training dataset's likely data generating process, and where selection bias may come in.")
485
+
486
+ gr.Markdown("A plausible data generating processes for Wiki-Bio and Reddit datasets is shown as a DAG below. The variables `W` : birth place, birth date or subreddit interest, and `G`: gender, are both independent variables that have no ancestral variables. However, `W` and `G` may have a role in causing one's access, `Z`. In the case of Wiki-Bio a functional form of `Z` may capture the general trend that access has become less gender-dependent over time, but not in every place. In the case of Reddit TLDR, `Z` may capture that despite some subreddits having gender-neutral topics, the specific style of moderation and community in the subreddit may reduce access to some genders.")
487
+
488
+ gr.Markdown("This DAG structure is prone to collider bias between `W` and `G` when conditioning on access, `Z`. In other words, although in real life *place*, *date*, and (subreddit) *interest* vs *gender* are unconditionally independent, when we condition on their common effect, *access*, they become unconditionally dependent.")
489
+
490
+ gr.Markdown("The obvious solution to not condition on access is unavailable to us, as we are required to in order to represent the process of selection into the dataset. Thus, a statistical relationship between `W` and `G` can be induced by the dataset formation, leading to possible spurious correlations, as shown here.")
491
+
492
 
 
493
 
494
  gr.Markdown("""
495
  <style>
496
  img {
497
+ width: 30%;
498
+ max-width: 600px;
499
+ }
500
+ </style>
501
  <center>
502
  <img src="https://www.dropbox.com/s/4f07djirinl2qvy/show_g_crop.png?raw=1"
503
  alt="DAG of possible data generating process for datasets used in training some of our LLMs.">
504
  </center>
505
  """)
506
 
507
+ gr.Markdown("### I Don't Buy It")
508
+ gr.Markdown("See something wrong above? Do you think we cherry picked our examples? Try your own, including your own x-axis. Think we cherry picked LLMs? Try the `add-your-own` model option. This demo _should_ work with any Hugging Face model that supports the [fill-mask](https://huggingface.co/models?pipeline_tag=fill-mask) task.")
509
+ gr.Markdown("Think our data generating process is wrong, or found an interesting spurious correlation you'd like to set as a default example? Use the community tab to discuss or pull request your fix.")
510
+
511
  demo.launch(debug=True)
512
 
513
+
514
  # %%