|
Datasets |
|
======== |
|
|
|
The speechlm2 collection supports datasets that contain both audio and text data for training models that can understand speech and generate appropriate responses. |
|
This section describes the dataset format, preparation, and usage with the speechlm2 models. |
|
|
|
Dataset Format |
|
-------------- |
|
|
|
Duplex S2S models use the Lhotse framework for audio data management. The primary datasets used are: |
|
|
|
1. **DuplexS2SDataset**: For general duplex speech-to-speech models |
|
2. **SALMDataset**: Specifically for the Speech-Augmented Language Model (SALM), which processes speech+text and outputs text. |
|
|
|
DuplexS2S Dataset Structure |
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
A typical dataset for speechlm2 models consists of: |
|
|
|
1. **Audio files**: Contains source audio (input speech) and possibly target audio (output speech) |
|
2. **Text transcriptions**: Associated text for both input and output speech |
|
3. **Role identifiers**: To distinguish between speakers (e.g., "user" vs "agent") |
|
|
|
The dataset organization is built around the concept of conversation turns, with each turn containing audio and text from either a user or an agent/assistant. |
|
|
|
The datasets are primarily managed using Lhotse's CutSet format, which provides efficient handling of audio data and annotations. A typical Lhotse manifest includes: |
|
|
|
- Audio recording information (path, duration, sample rate) |
|
- Supervision information (transcripts, speaker roles, timing) |
|
- Optional additional annotations |
|
|
|
Example of a Lhotse cut: |
|
|
|
.. code-block:: python |
|
|
|
{ |
|
"id": "conversation_1", |
|
"start": 0, |
|
"duration": 10.7, |
|
"channel": 0, |
|
"supervisions": [ |
|
{ |
|
"id": "conversation_1_turn_0", |
|
"text": "Can you help me with this problem?", |
|
"start": 0, |
|
"duration": 5.2, |
|
"speaker": "user" |
|
}, |
|
{ |
|
"id": "conversation_1_turn_1", |
|
"text": "I can help you with that.", |
|
"start": 5.2, |
|
"duration": 3.1, |
|
"speaker": "assistant" |
|
} |
|
], |
|
"recording": { |
|
"id": "conversation_1_user", |
|
"path": "/path/to/audio/conversation_1_user.wav", |
|
"sampling_rate": 16000, |
|
"num_samples": 171200, |
|
"duration": 10.7 |
|
}, |
|
"custom": { |
|
"target_audio": { |
|
"id": "conversation_1_assistant", |
|
"path": "/path/to/audio/conversation_1_assistant.wav", |
|
"sampling_rate": 22050, |
|
"num_samples": 235935, |
|
"duration": 10.7 |
|
} |
|
} |
|
} |
|
|
|
The DuplexS2SDataset performs several key operations when processing data: |
|
|
|
1. **Turn Identification**: Each cut contains a list of `supervisions` with objects of type `lhotse.SupervisionSegment` that represent conversation turns with corresponding text and speaker information. |
|
|
|
2. **Speaker Role Separation**: The text of each supervision is tokenized and identified as the model's output (when `supervision.speaker` is in `output_roles`, e.g., "agent" or "Assistant") or the model's input (when in `input_roles`, e.g., "user" or "User"). |
|
|
|
3. **Token Sequence Generation**: |
|
- `target_tokens` and `source_tokens` arrays are created with a length equal to `lhotse.utils.compute_num_frames(cut.duration, frame_length, cut.sampling_rate)` |
|
- The `frame_length` parameter (typically 80ms) determines the temporal resolution of token assignments |
|
- Each token is assigned to a position based on its corresponding audio segment's timing |
|
|
|
4. **Token Offset Calculation**: |
|
- The starting position for each turn's tokens is determined using `lhotse.utils.compute_num_frames(supervision.start, frame_length)` |
|
- This ensures tokens are aligned with their corresponding audio segments |
|
|
|
5. **Length Validation**: |
|
- If token sequences are too long compared to the audio duration, warnings are emitted |
|
- Tokens that extend beyond the audio length are truncated |
|
|
|
This process ensures that the model can correctly align audio input with corresponding text, and learn to generate appropriate responses based on the conversation context. |
|
|
|
DuplexS2SDataset |
|
**************** |
|
|
|
This dataset class is designed for models that handle both speech understanding and speech generation. It processes audio inputs and prepares them for the model along with corresponding text. |
|
|
|
.. code-block:: python |
|
|
|
from nemo.collections.speechlm2.data import DuplexS2SDataset |
|
|
|
dataset = DuplexS2SDataset( |
|
tokenizer=model.tokenizer, # Text tokenizer |
|
frame_length=0.08, # Frame length in seconds |
|
source_sample_rate=16000, # Input audio sample rate |
|
target_sample_rate=22050, # Output audio sample rate |
|
input_roles=["user", "User"], # Roles considered as input |
|
output_roles=["agent", "Assistant"] # Roles considered as output |
|
) |
|
|
|
SALMDataset Structure |
|
^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
Data used for SALM can be either regular speech-to-text data (in any NeMo or Lhotse format), or a dataset of multi-turn conversions. |
|
For the most part, please refer to `the Configuring multimodal dataloading section <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#configuring-multimodal-dataloading>`_ in the ASR documentation. |
|
|
|
When using speech-to-text data, you'll need read it with a special ``lhotse_as_conversation`` data reader |
|
that creates a two-turn, query+response, multi-modal conversation data types out of regular Lhotse cuts. |
|
This approach makes SALM training more flexible, allowing straightforward combination of single-turn and multi-turn data. |
|
|
|
Each audio turn is represented as a single token, defined in ``audio_locator_tag`` property, and automatically added to the model's tokenizer inside model code. |
|
This token is replaced during the training/generation pass with its corresponding audio segment representation. |
|
|
|
Example YAML configuration using existing ASR datasets with ``lhotse_as_conversation``: |
|
|
|
.. code-block:: yaml |
|
|
|
data: |
|
train_ds: |
|
prompt_format: "llama3" # Choose based on your model |
|
token_equivalent_duration: 0.08 |
|
input_cfg: |
|
# Example 1: Using standard ASR Lhotse manifests (JSONL) |
|
- type: lhotse_as_conversation |
|
cuts_path: /path/to/librispeech_train_clean_100.jsonl.gz |
|
audio_locator_tag: "<|audioplaceholder|>" |
|
tags: |
|
context: "Transcribe the following audio:" |
|
# Optional system prompt can be uncommented |
|
# system_prompt: "You are a helpful assistant that transcribes audio accurately." |
|
|
|
# Example 2: Using tarred NeMo manifests |
|
- type: lhotse_as_conversation |
|
manifest_filepath: /path/to/tedlium_train_manifest.jsonl.gz |
|
tarred_audio_filepaths: /path/to/tedlium_shards/shard-{000000..000009}.tar |
|
audio_locator_tag: "<|audioplaceholder|>" |
|
tags: |
|
context: "Write down what is said in this recording:" |
|
|
|
# Example 3: Using Lhotse SHAR format |
|
- type: lhotse_as_conversation |
|
shar_path: /path/to/fisher_shar/ |
|
audio_locator_tag: "<|audioplaceholder|>" |
|
tags: |
|
context: "Listen to this clip and write a transcript:" |
|
|
|
# ... other settings |
|
|
|
Alternatively, one can provide an existing YAML file with their dataset composition and wrap |
|
it in a ``lhotse_as_conversation`` reader as follows: |
|
|
|
.. code-block:: yaml |
|
|
|
data: |
|
train_ds: |
|
input_cfg: |
|
- type: lhotse_as_conversation |
|
input_cfg: /path/to/dataset_config.yaml |
|
audio_locator_tag: "<|audioplaceholder|>" |
|
tags: |
|
context: "Transcribe the following audio:" |
|
# Optional system prompt can be uncommented |
|
# system_prompt: "You are a helpful assistant that transcribes audio accurately." |
|
|
|
|
|
The ``lhotse_as_conversation`` reader automatically creates a two-turn conversation from each ASR example: |
|
1. Optionally, if ``system_prompt`` tag is provided, it's added as a special system turn for LLM models that support system prompts. |
|
2. A user turn containing the audio and a text context (from the ``context`` tag) |
|
3. An assistant turn containing the transcription (from the cut's supervision text) |
|
|
|
If a ``context`` tag is provided in the configuration, it's added as a text turn before the audio. |
|
|
|
SALMDataset |
|
*********** |
|
|
|
This dataset class is specialized for the SALM model, which focuses on understanding speech input and generating text output. |
|
|
|
.. code-block:: python |
|
|
|
from nemo.collections.speechlm2.data import SALMDataset |
|
|
|
dataset = SALMDataset( |
|
tokenizer=model.tokenizer, # Text tokenizer |
|
) |
|
|
|
DataModule |
|
---------- |
|
|
|
The DataModule class in the speechlm2 collection manages dataset loading, preparation, and batching for PyTorch Lightning training: |
|
|
|
.. code-block:: python |
|
|
|
from nemo.collections.speechlm2.data import DataModule |
|
|
|
datamodule = DataModule( |
|
cfg_data, # Configuration dictionary for data |
|
tokenizer=model.tokenizer, # Text tokenizer |
|
dataset=dataset # Instance of DuplexS2SDataset or SALMDataset |
|
) |
|
|
|
The DataModule takes care of: |
|
1. Setting up proper data parallel ranks for dataloaders |
|
2. Instantiating the dataloaders with configuration from YAML |
|
3. Managing multiple datasets for validation/testing |
|
|
|
Bucketing for Efficient Training |
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
The DataModule supports bucketing for more efficient training. Bucketing groups samples of similar lengths together, which reduces padding and improves training efficiency. The key bucketing parameters are: |
|
|
|
1. **batch_duration**: Target cumulative duration (in seconds) of samples in a batch |
|
2. **bucket_duration_bins**: List of duration thresholds for bucketing |
|
3. **use_bucketing**: Flag to enable/disable bucketing |
|
4. **num_buckets**: Number of buckets to create |
|
5. **bucket_buffer_size**: Number of samples to load into memory for bucket assignment |
|
|
|
Example bucketing configuration: |
|
|
|
.. code-block:: yaml |
|
|
|
train_ds: |
|
# ... other settings |
|
batch_duration: 100 # Target 100 seconds per batch |
|
bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85] # Duration thresholds |
|
use_bucketing: true # Enable bucketing |
|
num_buckets: 5 # Create 5 buckets |
|
bucket_buffer_size: 5000 # Buffer size for bucket assignment |
|
|
|
When bucketing is enabled: |
|
|
|
1. Samples are grouped into buckets based on their duration |
|
2. Each batch contains samples from the same bucket |
|
3. The actual batch size can vary to maintain a consistent total duration |
|
4. The target batch_duration ensures efficient GPU memory usage |
|
|
|
Bucketing helps to: |
|
- Reduce padding and increase effective batch size |
|
- Improve training efficiency and convergence |
|
- Manage memory usage with variable-length inputs |
|
|
|
Data Configuration |
|
------------------ |
|
|
|
A typical data configuration in YAML includes: |
|
|
|
.. code-block:: yaml |
|
|
|
data: |
|
|
|
train_ds: |
|
sample_rate: ${data.target_sample_rate} |
|
input_cfg: |
|
- type: lhotse_shar |
|
shar_path: /path/to/train_data |
|
seed: 42 |
|
shard_seed: "randomized" |
|
num_workers: 4 |
|
# Optional bucketing settings |
|
batch_duration: 100 |
|
bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85] |
|
use_bucketing: true |
|
num_buckets: 5 |
|
bucket_buffer_size: 5000 |
|
# batch_size: 4 # alternative to bucketing |
|
|
|
validation_ds: |
|
datasets: |
|
val_set_name_0: |
|
shar_path: /path/to/validation_data_0 |
|
val_set_name_1: |
|
shar_path: /path/to/validation_data_1 |
|
sample_rate: ${data.target_sample_rate} |
|
batch_size: 4 |
|
seed: 42 |
|
shard_seed: "randomized" |
|
|
|
Note that the actual dataset paths and blend are defined by the YAML config, not Python code. This makes it easy to change the dataset composition without modifying the code. |
|
To learn more about the YAML data config, see :ref:`the Extended multi-dataset configuration format <asr-dataset-config-format>` section in the ASR documentation. |
|
|
|
Preparing S2S Datasets |
|
------------------ |
|
|
|
Creating Lhotse Manifests |
|
^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
To prepare your own dataset, you'll need to create Lhotse manifests from your audio files and transcripts: |
|
|
|
.. code-block:: python |
|
|
|
from lhotse import CutSet, Recording, SupervisionSegment |
|
|
|
# Create a recording for user and assistant |
|
recording_user = Recording( |
|
id="conversation_1_user", |
|
path="/path/to/audio/conversation_1_user.wav", |
|
sampling_rate=16000, |
|
num_samples=171200, |
|
duration=10.7 |
|
) |
|
recording_assistant = Recording( |
|
id="conversation_1_assistant", |
|
path="/path/to/audio/conversation_1_assistant.wav", |
|
sampling_rate=22050, |
|
num_samples=235935, |
|
duration=10.7 |
|
) |
|
|
|
# Create supervision for this recording |
|
supervisions = [ |
|
SupervisionSegment( |
|
id="conversation_1_turn_0", |
|
recording_id="conversation_1", |
|
start=0, |
|
duration=5.2, |
|
text="Can you help me with this problem?", |
|
speaker="user" |
|
), |
|
SupervisionSegment( |
|
id="conversation_1_turn_1", |
|
recording_id="conversation_1", |
|
start=5.5, |
|
duration=3.1, |
|
text="I can help you with that.", |
|
speaker="assistant" |
|
), |
|
] |
|
|
|
# Create a CutSet |
|
# The assistant's response is located in target_audio field which makes it easy to replace |
|
# when using multiple models or speakers for synthetic data generation. |
|
cut = recording.to_cut() |
|
cut.supervisions = supervisions |
|
cut.target_audio = recording_assistant |
|
cutset = CutSet([cut]) |
|
|
|
# Save to disk |
|
cutset.to_file("path/to/manifest.jsonl.gz") |
|
|
|
Converting to SHAR Format |
|
^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
For efficient training, it's recommended to convert your Lhotse manifests to SHAR (SHarded ARchive) format: |
|
|
|
.. code-block:: python |
|
|
|
from lhotse import CutSet |
|
from lhotse.shar import SharWriter |
|
|
|
cutset = CutSet.from_file("path/to/manifest.jsonl.gz") |
|
cutset.to_shar("path/to/train_shar", fields={"recording": "flac", "target_audio": "flac"}, shard_size=100) |
|
|
|
|
|
Training with Prepared Datasets |
|
------------------------------- |
|
|
|
Once your datasets are prepared, you can use them to train a model: |
|
|
|
.. code-block:: python |
|
|
|
# Load configuration |
|
config_path = "path/to/config.yaml" |
|
cfg = OmegaConf.load(config_path) |
|
|
|
# The training data paths are available in the config file: |
|
# cfg.data.train_ds.input_cfg[0].shar_path = "path/to/train_shar" |
|
|
|
# Create dataset and datamodule |
|
dataset = DuplexS2SDataset( |
|
tokenizer=model.tokenizer, |
|
frame_length=cfg.data.frame_length, |
|
source_sample_rate=cfg.data.source_sample_rate, |
|
target_sample_rate=cfg.data.target_sample_rate, |
|
input_roles=cfg.data.input_roles, |
|
output_roles=cfg.data.output_roles, |
|
) |
|
datamodule = DataModule(cfg.data, tokenizer=model.tokenizer, dataset=dataset) |
|
|
|
# Train the model |
|
trainer.fit(model, datamodule) |
|
|
|
Example S2S Datasets |
|
-------------------- |
|
|
|
While there are no publicly available datasets specifically formatted for Duplex S2S models yet, you can adapt conversation datasets with audio recordings such as: |
|
|
|
1. Fisher Corpus |
|
2. Switchboard Corpus |
|
3. CallHome |
|
4. Synthetic conversation datasets generated using TTS |
|
|
|
You would need to format these datasets as Lhotse manifests with appropriate speaker role annotations to use them with the speechlm2 S2S models. |