Evaluation of Large Language Models with the NeMo 2.0

This directory contains Jupyter Notebook tutorials using the NeMo Framework for evaluating large language models (LLMs):

mmlu.ipynb
- Provides an overview of model deployment and available endpoints.
- Demonstrates how to run MMLU evaluations for both completions and chat endpoints to assess model proficiency across diverse subjects.
simple-evals.ipynb
- Shows how to enable additional evaluation frameworks with the evaluation suite.
- Uses NVIDIA Evals Factory Simple-Evals to demonstrate how to run evaluations for the HumanEval benchmark.
wikitext.ipynb
- Illustrates running evaluation tasks without predefined configurations.
- Uses the WikiText benchmark as an example to define and execute a custom evaluation job.