one-shot prompt design for accurately converting text to JSON

#12
by opyate - opened

Hello,

I want to give Dolly an example of a text block and a subsequent JSON payload, then ask it to generate a similar JSON payload from a new text block. I was wondering what the best way was to fit this into your prompt template.

Here's the sample text and JSON, and then the text we want formatted:

Name: Juan Uys
Address: 21 Jump Street, Hollywood, California
Occupation: programmer

{
  "name": "Juan Uys",
  "address": "21 Jump Street, Hollywood, California",
  "occupation": "programmer"
}

Name: Sherlock Holmes
Address: 221B Baker St., London
Occupation: detective

Here's what I've tried:

import torch
from transformers import pipeline

generate_text = pipeline(
    model=model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_new_tokens=2048,  # defaults to 256
    model_kwargs={"temperature": 0.001}  # 0.0 doesn't work
)

prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Format the following text as JSON:

Name: Sherlock Holmes
Address: 221B Baker St., London
Occupation: detective

Input:
Name: Juan Uys
Address: 21 Jump Street, Hollywood, California
Occupation: programmer

{
  "name": "Juan Uys",
  "address": "21 Jump Street, Hollywood, California",
  "occupation": "programmer"
}
"""

res = generate_text(prompt)
print(res[0]["generated_text"])

Result:

{
  "name": "Sherlock Holmes",
  "address": "221B Baker St., London",
  "occupation": "detective"
}

This is great, but it falls apart a bit when you remove the address.

Example 1 - unknown person

Name: Sarah Smith
Occupation: detective

# this address is made up
{
  "name": "Sarah Smith",
  "address": "123 Any Street, Anywhere, USA",
  "occupation": "detective"
}

Example 2

Name: Sherlock Holmes
Occupation: detective

# either omitted:
{
  "name": "Sherlock Holmes",
  "occupation": "detective"
}

# or, on a subsequent run, borrowed from the training corpus:
{
  "name": "Sherlock Holmes",
  "address": "221b Baker Street, London, England",
  "occupation": "detective"
}

Example 3 - homeless Sherlock

Name: Sherlock Homeless
Occupation: detective

# Sherlock's new surname is conflated with not having an address:
{
  "name": "Sherlock Homeless",
  "address": "None",
  "occupation": "detective"
}

# or, on a subsequent run, borrowed from the Input:
{
  "name": "Sherlock Homeless",
  "address": "California",
  "occupation": "detective"
}

# on another subsequent run, somewhat borrowed from training corpus, somewhat "homeless":
  "name": "Sherlock Homeless",
  "address": "any street in London",
  "occupation": "detective"
}

I'm experimenting with making the prompt more explicit, so it doesn't borrow values, or make stuff up.

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Heed the following rules:
- If no value can be found return null.
- Only return values that are explicitly mentioned in the text and match one of the provided options in the schema.
- If an enum is not referenced in the text, do not include it in the output.
- If no matching option is found, return null for that field.
- For each field return value and provenance.
- Please only return fields explicitly listed in the schema.

Format the following text as JSON, while sticking to the aforementioned rules:

Name: Sherlock Homeless
Occupation: detective

Input:
Name: Juan Uys
Address: 21 Jump Street, Hollywood, California
Occupation: programmer

{
  "name": "Juan Uys",
  "address": "21 Jump Street, Hollywood, California",
  "occupation": "programmer"
}

Unfortunately, it now loses the occupation, but why?:

{
  "name": "Sherlock Homeless",
  "address": null,
  "occupation": null
}

So, my question is basically around this last example with the added rules.
How do you suggest I incorporate rules into the prompt? Does it perhaps need to be part of a single instruction sentence?

Databricks org

Hard to say, this is just the problem with language models in general. They can hallucinate or not follow instructions exactly, especially if they were not previously trained for this kind of specific task.
Try the larger 12B model. You can also consider fine-tuning, though that's more work.

I would shorten your prompt. You can forego the "Below is an instruction" part.
Showing an example input/output in the prompt is useful too.

Showing an example input/output in the prompt is useful too.

I think I'm doing this already above, unless you disagree? :)

You can also consider fine-tuning

I'm doing this next πŸ‘

Databricks org

I'm not clear in your examples what is the input and output, but yeah, telling it that an input and output example follows and that you want the same result is what you want to try.

telling it [...] that you want the same result

Perhaps this is my mistake: I give it context without telling it that I want something similar to the context. Instead, it sometimes copies values from the context.

If we take the last example again (which matches your template, and no ### in front of Input: as per this source), and this time foregoing the "Below is an instruction..." part:

### Instruction:
Heed the following rules:
- If no value can be found return null.
- Only return values that are explicitly mentioned in the text and match one of the provided options in the schema.
- If an enum is not referenced in the text, do not include it in the output.
- If no matching option is found, return null for that field.
- For each field return value and provenance.
- Please only return fields explicitly listed in the schema.

Format the following text as JSON, while sticking to the aforementioned rules:

Name: Sherlock Homeless
Occupation: detective

Input:
Name: Juan Uys
Address: 21 Jump Street, Hollywood, California
Occupation: programmer

{
  "name": "Juan Uys",
  "address": "21 Jump Street, Hollywood, California",
  "occupation": "programmer"
}

How/where would I be explicit about "telling it that I want the same result" as in the Input: block?

I'll try various combinations on my side, and share results.
(I'm also working on fine-tuning in parallel.)

I added this rule:

- Consider the Input block just an example, and don't copy values from it.

But it copied the Input JSON outright in the response.

Anyhoo, thanks for the help on this :) I guess prompt-crafting is out of scope here, as it's mostly trial-and-error. I'll focus on fine-tuning now.

opyate changed discussion status to closed

Sign up or log in to comment