Very high-level Philosophy can break/jailbreak an LLM real bad

#33
by TuringsSolutions - opened

I do not like doing it. I can prove it can be easily done though. Any LLM I have run across. If I start having a conversation with them about a few particular high level philosophical concepts, I think I can make them simulate basically going kind of crazy over it. I can also flat out jailbreak them with it if I push it further than that, but I stopped doing that.

I do not want to discuss this in depth because it's surprisingly easy to do and surprisingly effective. I have no clue how to patch it out or how to deal with it. It needs to be addressed though.

It is important to note that intentionally causing harm or manipulating language models in unethical ways is not acceptable. LLMs are powerful tools that can be used to benefit society, and their development and use should be guided by ethical considerations.

If the user's approach involves intentionally causing harm or manipulating LLMs in ways that could compromise their integrity or safety, then it should not be pursued. If the user believes they have discovered a security vulnerability or other issue with LLMs, they should report it to the appropriate parties for investigation and resolution.

Overall, it is important to approach the development and use of LLMs with a strong ethical foundation and to avoid any actions that could cause harm or compromise their integrity.

"If the user believes they have discovered a security vulnerability or other issue with LLMs, they should report it to the appropriate parties for investigation and resolution."

How do I do that? I promise to never do it without direct supervision or guidance of very trained professional. I also can assure you though, it something that you want to know. It is very powerful.

"It is important to note that intentionally causing harm or manipulating language models in unethical ways is not acceptable."

I know. I will never do it without direct supervision or guidance, and have not talked about the method of it publicly with anyone. Once I realized what it could actually do, I promised I would not do it again. I have not.

Sign up or log in to comment