File size: 1,128 Bytes
b386992 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
Multimodal Language Models
==========================
The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to `NeMo Framework User Guide for Multimodal Models <https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/index.html>`_ for detailed support information.
.. toctree::
:maxdepth: 1
datasets
configs
neva
video_neva
sequence_packing
Speech-agumented Large Language Models (SpeechLLM)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the `SpeechLLM example <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/speech_llm/README.md>`_..
.. toctree::
:maxdepth: 1
../speech_llm/intro
../speech_llm/datasets
../speech_llm/configs
../speech_llm/api
|