Multimodal Language Models | |
========================== | |
The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to `NeMo Framework User Guide for Multimodal Models <https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/index.html>`_ for detailed support information. | |
.. toctree:: | |
:maxdepth: 1 | |
datasets | |
configs | |
neva | |
video_neva | |
sequence_packing | |
Speech-agumented Large Language Models (SpeechLLM) | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the `SpeechLLM example <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/speech_llm/README.md>`_.. | |
.. toctree:: | |
:maxdepth: 1 | |
../speech_llm/intro | |
../speech_llm/datasets | |
../speech_llm/configs | |
../speech_llm/api | |