Creating & tokenizing a dataset for fine-tuning dolly-v2-7b

#6
by durapensa - opened

I chose dolly-v2-7b because it should be tuneable using a midrange VM w/GPU on GCE, Azure, etc.

I believe that the example code for fine-tuning the base model Pythia-6.9B with databricks_dolly_15k to create dolly-v2-7b has not yet been published but I'm experimenting anyway, first with tokenizing databricks_dolly_15k before attempting to tokenize my own dataset, and likely just need some pointers to the correct tutorial or other resource.

A snippet of my first experiment:

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("HuggingFaceH4/databricks_dolly_15k", split="train")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b")

dataset[0]["instruction"]

'When did Virgin Australia start operating?'

tokenizer(dataset[0]["instruction"])

{'input_ids': [3039, 858, 8237, 6976, 1265, 6498, 32], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

What I do not know how to do is tokenize all of the features 'category', 'instruction', 'input', 'output', stitching them together before converting to pt format for ingestion by the PyTorch Trainer a la https://huggingface.co/docs/transformers/training

Databricks org

No, the code is here, and the dataset: https://github.com/databrickslabs/dolly
Just use the existing training script, and plug in your data instead.

Thanks for that pointer - I did not look closely enough at dolly/training/trainer.py!

srowen changed discussion status to closed

Hello Sean!
Is the model available through github is already trained on the OG 15k training instances?
I plan to fine tune it but I'm not sure if I have to append my training dataset to the standard 15k data or only train on my dataset.
Please advise.

Databricks org

The dolly 15k dataset? Yes, you can see on the model card and in the script. What to do depends on your intentions. Do you want an instruction following model? Then start from dolly and do not use the 15k dataset to further tune

Thank you Sean.

I want to do summarization/extraction. So my prompts look like this -

< meeting notes >
Can you extract information about from the meeting notes?

Is this task an instruction following task? Or is it a very specific task so much so that I train dolly directly on the summarization training data?

Thank you for your continued patience and help.

Databricks org

That's instruction following. You should probably phrase it more specifically. Do you want a summary? Because that's how training instructions would have been phrased.

Thanks.
In a meeting, multiple companies/products are discussed. The aim is to extract information about a specific company or product from the meeting notes.
In the last post, I misformatted the prompt. Pls check the correct one below.

< meeting notes >
Can you extract information about < company > from the meeting notes?

Sign up or log in to comment