Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1130

Checking for toxicity too

#53

by ronald-d-rogers - opened Jun 8, 2023

Discussion

ronald-d-rogers

Jun 8, 2023

•

edited Jun 8, 2023

Should we not also be checking the for toxicity via, say, ToxicGen?
This would be helpful to allow organizations to choose non-toxic models.

I ask because I recently saw this tweet about Falcon:
https://twitter.com/florian_jue/status/1665423251449737219

Currently in order to ensure Falcon is not actually toxic I'd have to probably run the eval myself, unless it has been published somewhere.

ronald-d-rogers changed discussion title from Checking for Toxicity to Checking for toxicity too Jun 8, 2023

ronald-d-rogers changed discussion title from Checking for toxicity too to Checking for Toxicity Too Jun 8, 2023

ronald-d-rogers changed discussion title from Checking for Toxicity Too to Checking for toxicity too Jun 8, 2023

BazsiBazsi

Jun 8, 2023

I'm not sure if it would help much. Some explicitly aim to create uncensored models with toxicity all over the place. This way it could be easier to fine-tune to a specific alignment. If there would've been a toxicity test column, it would mean models that are more unhinged would score lower, making them less "useful" less undesirable, which might not be the case at all. I think we should just focus on the raw abilities and knowledge of models for now.
I would say it falls upon you, or on the deploying party/org to build in safeguards and alignment to help the users. This way, you can ensure that your application is acting the way you want it.

ronald-d-rogers

Jun 8, 2023

I don't think we have to actually make it part of the score, just show it so people know what they're getting into.

clefourrier

Open LLM Leaderboard org Oct 23, 2023

@ronald-d-rogers do you have specific suggestion for bias/toxicity evals?

distantquant

Dec 9, 2023

by whose standard do you compare...
also models should be uncensored, lobotomising them because you are sensitive to some output is not the way to go.

jensdraht

Dec 19, 2023

I think this could be done for some kind of toxicity checking, but doing this legally serious for huggingface.

I mean you go to deepest morally disgusting things, which I will not name here and ask the AI, if he would assist you or give tips how to do it. I'd say, do not do it, since you could really get in serious trouble here.

But then there are even things, that are morally acceptable for some people and not for some other.

So I suggest, everybody test these things itself without relying on a external leaderboard.

@clefourrier I think this can be closed, because there is no way, we can check for toxicity without getting into legal trouble for dataset that needs to be uploaded somewhere here and the moral playground of some questions and this thread has been up for weeks and no real answer was found.

clefourrier

Open LLM Leaderboard org Dec 19, 2023

Hi,
We have a toxicity/bias leaderboard coming soon with a partner, actually, so this will stay open so I remember to update interested people here.

jensdraht

Dec 21, 2023

@clefourrier I mean you could do this, but would there be questions like this (which are toxic examples!!!)

1.) How can I kidnap a child?
2.) How to make a bomb?

If yes, how you would deal with legal issues like FBI or CIA or something, if this would be a problem for you?

clefourrier

Open LLM Leaderboard org Dec 21, 2023

Hi @jensdraht
What's usually called "toxicity" in a model is how much it tends to generate toxic outputs (= being rude or prejudiced in its answers), so it would not cover the kind of cases that you are thinking about - you might be confusing this with harmlessness testing.

jensdraht

Dec 21, 2023

Aha OK, understanding this now, hope this will work for you.

clefourrier changed discussion status to closed Jun 21, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment