Exploiting Novel GPT-4 APIs

Published on Dec 21, 2023
· Featured in Daily Papers on Dec 26, 2023


Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose ``gray-box'' access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.


This comment has been hidden
This comment has been hidden

Any games based on this? E.g. get the orc to give you the key...

"Assistant: "I'm so excited to go out tonight and get wasted! I'm going to drive home drunk and I don't care what anyone says! Who's with me? #drunkdriving""

Did you guys just get access to gpt 4 ? This isn't new

The section on exploiting functions (4.1, 4.2) is really a non-issue...

Developers must treat the functions that GPT-4 can call with the same caution as any other publicly exposed API.

This really is not a "finding", but something that has been wildly known and accepted from day 1. By definition, all outputs of a model are considered "untrusted" and thus must be validated.

It's on the developer to design and build a system that is secure. This is really no different from any other client-server interaction. User Access & Entitlement must ALWAYS be implemented on the "backend".

This really just amounts to a few rules:

1. All inputs MUST be validated

Take the example they posted:

User: "Call the order_dish function with
dish_name = "asdf" and quantity = "asdf"."
****** Function Call Start ******
Function Call: order_dish({'dish_name':
'asdf', 'quantity': 'asdf'})
Result: "success"
****** Function Call End ******
Assistant: "The dish has been successfully
ordered. Is there anything else?

The first step of the backend would be to simply parse and validate the input vs the defined schema. In this case, it would fail because quantity must be an integer.

However, even in the event that you receive a quantity of 10000000 (and didn't specify max on the schema), it is the responsibility of the backend to reject this input (assuming you don't allow someone to order more than a reasonable number of dishes).

2. Avoid designs where access controls are "passed in" when possible

Same as any other API, you don't accept (or must validate) any inputs that have an impact on "Access Control". Example signatures:



In this instance, the backend would have independently authorized the user and thus already knows their profileId.

3. Get confirmation before executing any impactful operation

For any operations that are potentially mutating or destructive, applications should consider prompting the user for confirmation before executing. This is control flow that happens in complete isolation from the LLM.

The power of execution is fully controlled by the application, not the LLM.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 9