File size: 1,128 Bytes

b386992

Multimodal Language Models
==========================

The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to `NeMo Framework User Guide for Multimodal Models <https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/index.html>`_ for detailed support information.

.. toctree::
   :maxdepth: 1

   datasets
   configs
   neva
   video_neva
   sequence_packing


Speech-agumented Large Language Models (SpeechLLM)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the `SpeechLLM example <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/speech_llm/README.md>`_.. 

.. toctree::
   :maxdepth: 1

   ../speech_llm/intro
   ../speech_llm/datasets
   ../speech_llm/configs
   ../speech_llm/api