NeMo_Canary / docs /source /speechlm2 /intro.rst

Upload folder using huggingface_hub

b386992 verified about 1 month ago

6.81 kB

	SpeechLM2
	================================

	.. note::
	The SpeechLM2 collection is still in active development and the code is likely to keep changing.

	SpeechLM2 refers to a collection that augments pre-trained Large Language Models (LLMs) with speech understanding and generation capabilities.

	This collection is designed to be compact, efficient, and to support easy swapping of different LLMs backed by HuggingFace AutoModel.
	It has a first-class support for using dynamic batch sizes via Lhotse and various model parallelism techniques (e.g., FSDP2, Tensor Parallel, Sequence Parallel) via PyTorch DTensor API.

	We currently support three main model types:
	* SALM (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities.
	* DuplexS2SModel - a full-duplex speech-to-speech model with an ASR encoder, directly predicting discrete audio codes.
	* DuplexS2SSpeechDecoderModel - a variant of DuplexS2SModel with a separate transformer decoder for speech generation.

	Using Pretrained Models
	----------------------

	After :ref:`installing NeMo<installation>`, you can load and use a pretrained speechlm2 model as follows:

	.. code-block:: python

	import nemo.collections.speechlm2 as slm

	# Load a pretrained SALM model
	model = slm.models.SALM.from_pretrained("model_name_or_path")

	# Set model to evaluation mode
	model = model.eval()

	Inference with Pretrained Models
	--------------------------------

	SALM
	****

	You can run inference using the loaded pretrained SALM model:

	.. code-block:: python

	import torch
	import torchaudio
	import nemo.collections.speechlm2 as slm

	model = slm.models.SALM.from_pretrained("path/to/pretrained_checkpoint").eval()

	# Load audio file
	audio_path = "path/to/audio.wav"
	audio_signal, sample_rate = torchaudio.load(audio_path)

	# Resample if needed
	if sample_rate != 16000: # Most models expect 16kHz audio
	audio_signal = torchaudio.functional.resample(audio_signal, sample_rate, 16000)
	sample_rate = 16000

	# Prepare audio for model
	audio_signal = audio_signal.to(model.device)
	audio_len = torch.tensor([audio_signal.shape[1]], device=model.device)

	# Create a prompt for SALM model inference
	# The audio_locator_tag is a special token that will be replaced with audio embeddings
	prompt = [{"role": "user", "content": f"{model.audio_locator_tag}"}]

	# Generate response
	with torch.no_grad():
	output = model.generate(
	prompts=[prompt],
	audios=audio_signal,
	audio_lens=audio_len,
	generation_config=None # You can customize generation parameters here
	)

	# Process the output tokens
	response = model.tokenizer.ids_to_text(output[0])
	print(f"Model response: {response}")

	DuplexS2SModel
	**************

	You can run inference using the loaded pretrained DuplexS2SModel:

	.. code-block:: python

	import torch
	import torchaudio
	import nemo.collections.speechlm2 as slm

	model = slm.models.DuplexS2SModel.from_pretrained("path/to/pretrained_checkpoint").eval()

	# Load audio file
	audio_path = "path/to/audio.wav"
	audio_signal, sample_rate = torchaudio.load(audio_path)

	# Resample if needed
	if sample_rate != 16000: # Most models expect 16kHz audio
	audio_signal = torchaudio.functional.resample(audio_signal, sample_rate, 16000)
	sample_rate = 16000

	# Prepare audio for model
	audio_signal = audio_signal.to(model.device)
	audio_len = torch.tensor([audio_signal.shape[1]], device=model.device)

	# Run offline inference
	results = model.offline_inference(
	input_signal=audio_signal,
	input_signal_lens=audio_len
	)

	# Decode text and audio tokens
	transcription = results["text"][0]
	audio = results["audio"][0]

	Training a Model
	----------------

	This example demonstrates how to train a SALM model. The remaining models can be trained in a similar manner.

	.. code-block:: python

	from omegaconf import OmegaConf
	import torch
	from lightning.pytorch import Trainer
	from lightning.pytorch.strategies import ModelParallelStrategy

	import nemo.collections.speechlm2 as slm
	from nemo.collections.speechlm2.data import SALMDataset, DataModule
	from nemo.utils.exp_manager import exp_manager

	# Load configuration
	config_path = "path/to/config.yaml" # E.g., from examples/speechlm2/conf/salm.yaml
	cfg = OmegaConf.load(config_path)

	# Initialize PyTorch Lightning trainer
	trainer = Trainer(
	max_steps=100000,
	accelerator="gpu",
	devices=1,
	precision="bf16-true",
	strategy=ModelParallelStrategy(data_parallel_size=2, tensor_parallel_size=1),
	limit_train_batches=1000,
	val_check_interval=1000,
	use_distributed_sampler=False,
	logger=False,
	enable_checkpointing=False,
	)

	# Set up experiment manager for logging
	exp_manager(trainer, cfg.get("exp_manager", None))

	# Initialize model with configuration
	model = slm.models.SALM(OmegaConf.to_container(cfg.model, resolve=True))

	# Create dataset and datamodule
	dataset = SALMDataset(tokenizer=model.tokenizer)
	datamodule = DataModule(cfg.data, tokenizer=model.tokenizer, dataset=dataset)

	# Train the model
	trainer.fit(model, datamodule)

	Example Using Command-Line Training Script
	------------------------------------------

	Alternatively, you can train a model using the provided training scripts in the examples directory:

	.. code-block:: bash

	# Train a SALM model
	python examples/speechlm2/salm_train.py \
	--config-path=examples/speechlm2/conf \
	--config-name=salm

	# For inference/evaluation
	python examples/speechlm2/salm_eval.py \
	pretrained_name=/path/to/checkpoint \
	inputs=/path/to/test_manifest \
	batch_size=64 \
	max_new_tokens=128 \
	output_manifest=generations.jsonl

	For more detailed information on training at scale, model parallelism, and SLURM-based training, see :doc:`training and scaling <training_and_scaling>`.

	Collection Structure
	------------------

	The speechlm2 collection is organized into the following key components:

	- Models: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, and SALM
	- Modules: Contains audio perception and speech generation modules
	- Data: Includes dataset classes and data loading utilities

	SpeechLM2 Documentation
	-----------------------

	For more information, see additional sections in the SpeechLM2 docs:

	.. toctree::
	:maxdepth: 1

	models
	datasets
	configs
	training_and_scaling