Upload folder using huggingface_hub

b386992 verified about 1 month ago

1.13 kB

	Multimodal Language Models
	==========================

	The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to `NeMo Framework User Guide for Multimodal Models <https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/index.html>`_ for detailed support information.

	.. toctree::
	:maxdepth: 1

	datasets
	configs
	neva
	video_neva
	sequence_packing


	Speech-agumented Large Language Models (SpeechLLM)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the `SpeechLLM example <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/speech_llm/README.md>`_..

	.. toctree::
	:maxdepth: 1

	../speech_llm/intro
	../speech_llm/datasets
	../speech_llm/configs
	../speech_llm/api