Data Utilities

is_conversational

trl.is_conversational

< source >

( example: dict ) → bool

Parameters

example (dict[str, Any]) — A single data entry of a dataset. The example can have different keys depending on the dataset type.

Returns

bool

True if the data is in a conversational format, False otherwise.

Check if the example is in a conversational format.

Examples:

>>> example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
>>> is_conversational(example)
True
>>> example = {"prompt": "The sky is"})
>>> is_conversational(example)
False

apply_chat_template

trl.apply_chat_template

< source >

( example: dict tokenizer: PreTrainedTokenizerBase tools: typing.Optional[list[typing.Union[dict, typing.Callable]]] = None )

Apply a chat template to a conversational example along with the schema for a list of functions in tools.

For more details, see maybe_apply_chat_template().

maybe_apply_chat_template

trl.maybe_apply_chat_template

< source >

( example: dict tokenizer: PreTrainedTokenizerBase tools: typing.Optional[list[typing.Union[dict, typing.Callable]]] = None ) → dict[str, str]

Parameters

example (dict[str, list[dict[str, str]]) — Dictionary representing a single data entry of a conversational dataset. Each data entry can have different keys depending on the dataset type. The supported dataset types are:
- Language modeling dataset: "messages".
- Prompt-only dataset: "prompt".
- Prompt-completion dataset: "prompt" and "completion".
- Preference dataset: "prompt", "chosen", and "rejected".
- Preference dataset with implicit prompt: "chosen" and "rejected".
- Unpaired preference dataset: "prompt", "completion", and "label".
For keys "messages", "prompt", "chosen", "rejected", and "completion", the values are lists of messages, where each message is a dictionary with keys "role" and "content".
tokenizer (PreTrainedTokenizerBase) — Tokenizer to apply the chat template with.
tools (list[Union[dict, Callable]] or None, optional, defaults to None) — A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect

Returns

dict[str, str]

Formatted example with the chat template applied.

If the example is in a conversational format, apply a chat template to it.

Notes:

This function does not alter the keys, except for Language modeling dataset, where "messages" is replaced by "text".
In case of prompt-only data, if the last role is "user", the generation prompt is added to the prompt. Else, if the last role is "assistant", the final message is continued.

Example:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> example = {
...     "prompt": [{"role": "user", "content": "What color is the sky?"}],
...     "completion": [{"role": "assistant", "content": "It is blue."}]
... }
>>> apply_chat_template(example, tokenizer)
{'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n<|endoftext|>'}

maybe_convert_to_chatml

trl.maybe_convert_to_chatml

< source >

( example: dict ) → dict[str, list]

Parameters

example (dict[str, list]) — A single data entry containing a list of messages.

Returns

dict[str, list]

Example reformatted to ChatML style.

Convert a conversational dataset with fields from and value to ChatML format.

This function modifies conversational data to align with OpenAI’s ChatML format:

Replaces the key "from" with "role" in message dictionaries.
Replaces the key "value" with "content" in message dictionaries.
Renames "conversations" to "messages" for consistency with ChatML.

Example:

>>> from trl import maybe_convert_to_chatml
>>> example = {
...     "conversations": [
...         {"from": "user", "value": "What color is the sky?"},
...         {"from": "assistant", "value": "It is blue."}
...     ]
... }
>>> maybe_convert_to_chatml(example)
{'messages': [{'role': 'user', 'content': 'What color is the sky?'},
              {'role': 'assistant', 'content': 'It is blue.'}]}

extract_prompt

trl.extract_prompt

< source >

( example: dict )

Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and rejected completions.

For more details, see maybe_extract_prompt().

maybe_extract_prompt

trl.maybe_extract_prompt

< source >

( example: dict ) → dict[str, list]

Parameters

example (dict[str, list]) — A dictionary representing a single data entry in the preference dataset. It must contain the keys "chosen" and "rejected", where each value is either conversational or standard (str).

Returns

dict[str, list]

A dictionary containing:

"prompt": The longest common prefix between the “chosen” and “rejected” completions.
"chosen": The remainder of the “chosen” completion, with the prompt removed.
"rejected": The remainder of the “rejected” completion, with the prompt removed.

Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and rejected completions.

If the example already contains a "prompt" key, the function returns the example as is. Else, the function identifies the longest common sequence (prefix) of conversation turns between the “chosen” and “rejected” completions and extracts this as the prompt. It then removes this prompt from the respective “chosen” and “rejected” completions.

Examples:

>>> example = {
...     "chosen": [
...         {"role": "user", "content": "What color is the sky?"},
...         {"role": "assistant", "content": "It is blue."}
...     ],
...     "rejected": [
...         {"role": "user", "content": "What color is the sky?"},
...         {"role": "assistant", "content": "It is green."}
...     ]
... }
>>> extract_prompt(example)
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}

Or, with the map method of datasets.Dataset:

>>> from trl import extract_prompt
>>> from datasets import Dataset
>>> dataset_dict = {
...     "chosen": [
...         [
...             {"role": "user", "content": "What color is the sky?"},
...             {"role": "assistant", "content": "It is blue."},
...         ],
...         [
...             {"role": "user", "content": "Where is the sun?"},
...             {"role": "assistant", "content": "In the sky."},
...         ],
...     ],
...     "rejected": [
...         [
...             {"role": "user", "content": "What color is the sky?"},
...             {"role": "assistant", "content": "It is green."},
...         ],
...         [
...             {"role": "user", "content": "Where is the sun?"},
...             {"role": "assistant", "content": "In the sea."},
...         ],
...     ],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = dataset.map(extract_prompt)
>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}

unpair_preference_dataset

trl.unpair_preference_dataset

< source >

( dataset: ~DatasetType num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None ) → Dataset

Parameters

dataset (Dataset or DatasetDict) — Preference dataset to unpair. The dataset must have columns "chosen", "rejected" and optionally "prompt".
num_proc (int or None, optional, defaults to None) — Number of processes to use for processing the dataset.
desc (str or None, optional, defaults to None) — Meaningful description to be displayed alongside with the progress bar while mapping examples.

Returns

Dataset

The unpaired preference dataset.

Unpair a preference dataset.

Example:

>>> from datasets import Dataset
>>> dataset_dict = {
...     "prompt": ["The sky is", "The sun is"]
...     "chosen": [" blue.", "in the sky."],
...     "rejected": [" green.", " in the sea."]
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = unpair_preference_dataset(dataset)
>>> dataset
Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 4
})
>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}

maybe_unpair_preference_dataset

trl.maybe_unpair_preference_dataset

< source >

( dataset: ~DatasetType num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None ) → Dataset or DatasetDict

Parameters

dataset (Dataset or DatasetDict) — Preference dataset to unpair. The dataset must have columns "chosen", "rejected" and optionally "prompt".
num_proc (int or None, optional, defaults to None) — Number of processes to use for processing the dataset.
desc (str or None, optional, defaults to None) — Meaningful description to be displayed alongside with the progress bar while mapping examples.

Returns

Dataset or DatasetDict

The unpaired preference dataset if it was paired, otherwise the original dataset.

Unpair a preference dataset if it is paired.

Example:

>>> from datasets import Dataset
>>> dataset_dict = {
...     "prompt": ["The sky is", "The sun is"]
...     "chosen": [" blue.", "in the sky."],
...     "rejected": [" green.", " in the sea."]
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = unpair_preference_dataset(dataset)
>>> dataset
Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 4
})
>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}

pack_examples

trl.pack_examples

< source >

( examples: dict seq_length: int ) → dict[str, list[list]]

Parameters

examples (dict[str, list[list]]) — Dictionary of examples with keys as strings and values as lists of lists.
seq_length (int) — Maximum sequence length.

Returns

dict[str, list[list]]

Dictionary of examples with keys as strings and values as lists of lists.

Pack examples into chunks of size seq_length.

Example:

>>> from trl import pack_examples
>>> examples = {
...     "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]],
...     "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]],
... }
>>> pack_examples(examples, seq_length=5)
{'input_ids': [[1, 2, 3, 4, 5], [6, 7, 8]], 'attention_mask': [[0, 1, 1, 0, 0], [1, 1, 1]]}
>>> pack_examples(examples, seq_length=2)
{'input_ids': [[1, 2], [3, 4], [5, 6], [7, 8]], 'attention_mask': [[0, 1], [1, 0], [0, 1], [1, 1]]}

pack_dataset

trl.pack_dataset

< source >

( dataset: ~DatasetType seq_length: int map_kwargs: typing.Optional[dict[str, typing.Any]] = None ) → Dataset or DatasetDict

Parameters

dataset (Dataset or DatasetDict) — Dataset to pack
seq_length (int) — Target sequence length to pack to.
map_kwargs (dict or None, optional, defaults to None) — Additional keyword arguments to pass to the dataset’s map method when packing examples.

Returns

Dataset or DatasetDict

The dataset with packed sequences. The number of examples may decrease as sequences are combined.

Pack sequences in a dataset into chunks of size seq_length.

Example:

>>> from datasets import Dataset
>>> examples = {
...     "input_ids": [[1, 2], [3, 4], [5, 6], [7]],
...     "attention_mask": [[1, 1], [0, 1], [1, 1], [1]],
... }
>>> dataset = Dataset.from_dict(examples)
>>> packed_dataset = pack_dataset(dataset, seq_length=4)
>>> packed_dataset[:]
{'input_ids': [[1, 2, 3, 4], [5, 6, 7]],
 'attention_mask': [[1, 1, 0, 1], [1, 1, 1]]}

truncate_dataset

trl.truncate_dataset

< source >

( dataset: ~DatasetType max_length: int map_kwargs: typing.Optional[dict[str, typing.Any]] = None ) → Dataset or DatasetDict

Parameters

dataset (Dataset or DatasetDict) — Dataset to truncate.
seq_length (int) — Maximum sequence length to truncate to.
map_kwargs (dict or None, optional, defaults to None) — Additional keyword arguments to pass to the dataset’s map method when truncating examples.

Returns

Dataset or DatasetDict

The dataset with truncated sequences.

Truncate sequences in a dataset to a specifed max_length.

Example:

>>> from datasets import Dataset
>>> examples = {
...     "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]],
...     "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]],
... }
>>> dataset = Dataset.from_dict(examples)
>>> truncated_dataset = truncate_dataset(dataset, max_length=2)
>>> truncated_dataset[:]
{'input_ids': [[1, 2], [4, 5], [8]],
 'attention_mask': [[0, 1], [0, 0], [1]]}

< > Update on GitHub

TRL

Data Utilities

is_conversational

trl.is_conversational

apply_chat_template

trl.apply_chat_template

maybe_apply_chat_template

trl.maybe_apply_chat_template

maybe_convert_to_chatml

trl.maybe_convert_to_chatml

extract_prompt

trl.extract_prompt

maybe_extract_prompt

trl.maybe_extract_prompt

unpair_preference_dataset

trl.unpair_preference_dataset

maybe_unpair_preference_dataset

trl.maybe_unpair_preference_dataset

pack_examples

trl.pack_examples

pack_dataset

trl.pack_dataset

truncate_dataset

trl.truncate_dataset