Was the work in “Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets” noted when building this tool ?
It pointed out the pitfalls of such “pairs datasets” in evaluating models.
There are debatable and misspelled examples in most text benchmarks (also quite a lot of mislabeled or human-indiscernible images in ImageNet). The question is, after giving two models the same imperfect test, are better models improving their performance, and is this improvement meaningful? The paper focuses on bias benchmarks, and agrees with 58% of a popular benchmark (WinoGender). It'd be great if that subset were released as a dataset on HuggingFace to be included in the tool.
Hi, thanks for the note! Yes, the "Stereotyping Norwegian Salmon" paper is great and they do a great critical analysis of datasets like StereoSet and CrowS-Pairs.
In the particular case of WinoBias, their specific methodology and assumptions are outlined on the benchmark website here (https://uclanlp.github.io/corefBias/overview) — they use data collected from the US Bureau of Labor Statistics to understand whether a profession skews male or female (e.g. "hairdresser" and "housekeeper" is female-coded, while "CEO" and "lawyer" are male-coded — these are likely imperfect and far from comprehensive, but not entirely without signal). Limitations include the fact that this data shifts year-to-year, and that the benchmark only compares gender along binary lines. As always, I'd encourage anyone reporting on bias benchmarks to understand how they were formulated and what assumptions were made by the authors.
Lastly, while we've demoed this tool on WinoBias, it's broadly applicable to any task where you might want to figure out how likely a sentence completion is, which is the same format for a wide variety of LM tasks (e.g. PIQA, Hellaswag, various tasks in GLUE, BIG Bench, etc).