Damien Tanner

dctanner

AI & ML interests

None yet

Organizations

Posts 1

view post
Post
As the amount of datasets for fine tuning chat models has grown, there's been a plethora of dataset formats emerge. The most popular of these include the formats used by Alpaca, ShareGPT and Open Assistant datasets. The datasets and their formats have also evolved from single-turn conversation to multi-turn. Many of these formats share similarities (and they all have the same goal), but handling the variations in formats across datasets is often a hassle, and source of potential bugs.

Luckily the community seems to be converging on a simple and elegant chat dataset format: a list with each record being an array with each conversation turn being an object with a role (system, assistant or user) and content. Hugging Face uses this input format in the [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates) docs:

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]


Popular datasets like HuggingFaceH4/no_robots follow this format.

To encourage usage of this format, I propose we give it a name: Hugging Face MessagesList format.

The format is defined as:

- Having at least one messages column of type list.
- Each messages record is an array containing one or more message turn objects.
- A message turn must have role and content keys.
- role should be one of system, assistant or user.
- content is the text content of the message.

This may be a small thing, but having a common dataset format will reduce wasted time data wrangling and help everyone.

datasets

None public yet