Respair's picture
Upload folder using huggingface_hub
b386992 verified
Multimodal Language Models
==========================
The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to `NeMo Framework User Guide for Multimodal Models <https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/index.html>`_ for detailed support information.
.. toctree::
:maxdepth: 1
datasets
configs
neva
video_neva
sequence_packing
Speech-agumented Large Language Models (SpeechLLM)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the `SpeechLLM example <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/speech_llm/README.md>`_..
.. toctree::
:maxdepth: 1
../speech_llm/intro
../speech_llm/datasets
../speech_llm/configs
../speech_llm/api