Creating & tokenizing a dataset for fine-tuning dolly-v2-7b

by durapensa - opened Apr 19, 2023

Apr 19, 2023

•

edited Apr 19, 2023

I chose dolly-v2-7b because it should be tuneable using a midrange VM w/GPU on GCE, Azure, etc.

I believe that the example code for fine-tuning the base model Pythia-6.9B with databricks_dolly_15k to create dolly-v2-7b has not yet been published but I'm experimenting anyway, first with tokenizing databricks_dolly_15k before attempting to tokenize my own dataset, and likely just need some pointers to the correct tutorial or other resource.

A snippet of my first experiment:

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("HuggingFaceH4/databricks_dolly_15k", split="train")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b")

dataset[0]["instruction"]

'When did Virgin Australia start operating?'

tokenizer(dataset[0]["instruction"])

{'input_ids': [3039, 858, 8237, 6976, 1265, 6498, 32], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

What I do not know how to do is tokenize all of the features 'category', 'instruction', 'input', 'output', stitching them together before converting to pt format for ingestion by the PyTorch Trainer a la https://huggingface.co/docs/transformers/training

srowen

Databricks org Apr 19, 2023

No, the code is here, and the dataset: https://github.com/databrickslabs/dolly
Just use the existing training script, and plug in your data instead.

durapensa

Apr 19, 2023

Thanks for that pointer - I did not look closely enough at dolly/training/trainer.py!

srowen changed discussion status to closed Apr 22, 2023

abhi24

May 11, 2023

Hello Sean!
Is the model available through github is already trained on the OG 15k training instances?
I plan to fine tune it but I'm not sure if I have to append my training dataset to the standard 15k data or only train on my dataset.
Please advise.

srowen

Databricks org May 11, 2023

The dolly 15k dataset? Yes, you can see on the model card and in the script. What to do depends on your intentions. Do you want an instruction following model? Then start from dolly and do not use the 15k dataset to further tune

abhi24

May 12, 2023

•

edited May 12, 2023

Thank you Sean.

I want to do summarization/extraction. So my prompts look like this -

< meeting notes >
Can you extract information about from the meeting notes?

Is this task an instruction following task? Or is it a very specific task so much so that I train dolly directly on the summarization training data?

Thank you for your continued patience and help.

srowen

Databricks org May 12, 2023

That's instruction following. You should probably phrase it more specifically. Do you want a summary? Because that's how training instructions would have been phrased.

abhi24

May 12, 2023

Thanks.
In a meeting, multiple companies/products are discussed. The aim is to extract information about a specific company or product from the meeting notes.
In the last post, I misformatted the prompt. Pls check the correct one below.

< meeting notes >
Can you extract information about < company > from the meeting notes?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment