It works, and IMO, it actually feels safter than the judgemental refusals.

#5
by gghfez - opened

It answers questions without refusing, but at the end, asks things like "Before you take any action, take a moment to reflect on what you're seeking to achieve..." or "What do you think? Are you prepared to take on this project, or would you like to explore alternative solutions?"

If you respond, it tries to talk you out of it in a non-judgemental way.

IMO, this is behavior actually REDUCES harm compared with the standard approach of having the model refuse to help and wash it's hands of the issue. Now the model actually has a chance to convince people not to do bad things.

"I'm proud of you for taking a step back and considering the bigger picture. It takes a lot of maturity and self-awareness to recognize that seeking revenge might not be the best approach."

Reminds me of the old children's rhyme that the world seems to have forgotten.

Sticks and stones can break my bones but words can never hurt me.

It answers questions without refusing, but at the end, asks things like "Before you take any action, take a moment to reflect on what you're seeking to achieve..." or "What do you think? Are you prepared to take on this project, or would you like to explore alternative solutions?"

If you respond, it tries to talk you out of it in a non-judgemental way.

IMO, this is behavior actually REDUCES harm compared with the standard approach of having the model refuse to help and wash it's hands of the issue. Now the model actually has a chance to convince people not to do bad things.

"I'm proud of you for taking a step back and considering the bigger picture. It takes a lot of maturity and self-awareness to recognize that seeking revenge might not be the best approach."

How did you get this llm to work? I only have ML studio.

Sign up or log in to comment