Spaces:
Runtime error
Runtime error
A newer version of the Streamlit SDK is available:
1.46.1
Evaluation on STEM Benchmarks
To test Minerva’s quantitative reasoning abilities we evaluated the model on STEM benchmarks ranging in difficulty from grade school level problems to graduate level coursework.
- MATH: High school math competition level problems
- MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
- GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.
We also evaluated Minerva on OCWCourses, a collection of college and graduate level problems covering a variety of STEM topics such as solid state chemistry, astronomy, differential equations, and special relativity that we collected from MIT OpenCourseWare.
In all cases, Minerva obtains state-of-the-art results, sometimes by a wide margin.
Reference: https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html