Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

Published on Oct 12, 2023
· Featured in Daily Papers on Oct 17, 2023


Large Language Models (LLMs) have demonstrated remarkable performance on a wide range of Natural Language Processing (NLP) tasks, often matching or even beating state-of-the-art task-specific models. This study aims at assessing the financial reasoning capabilities of LLMs. We leverage mock exam questions of the Chartered Financial Analyst (CFA) Program to conduct a comprehensive evaluation of ChatGPT and GPT-4 in financial analysis, considering Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) scenarios. We present an in-depth analysis of the models' performance and limitations, and estimate whether they would have a chance at passing the CFA exams. Finally, we outline insights into potential strategies and improvements to enhance the applicability of LLMs in finance. In this perspective, we hope this work paves the way for future studies to continue enhancing LLMs for financial reasoning through rigorous evaluation.


My summary for the day:

Researchers evaluated ChatGPT and GPT-4 on mock CFA exam questions to see if they could pass the real tests. The CFA exams rigorously test practical finance knowledge and are known for being quite difficult.

They tested the models in zero-shot, few-shot, and chain-of-thought prompting settings on mock Level I and Level II exams.

The key findings:

  • GPT-4 consistently beat ChatGPT, but both models struggled way more on the more advanced Level II questions.
  • Few-shot prompting helped ChatGPT slightly
  • Chain-of-thought prompting exposed knowledge gaps rather than helping much.
  • Based on estimated passing scores, only GPT-4 with few-shot prompting could potentially pass the exams.

The models definitely aren't ready to become charterholders yet. Their difficulties with tricky questions and core finance concepts highlight the need for more specialized training and knowledge.

But GPT-4 did better overall, and few-shot prompting shows their ability to improve. So with targeted practice on finance formulas and reasoning, we could maybe see step-wise improvements.

TLDR: Tested on mock CFA exams, ChatGPT and GPT-4 struggle with the complex finance concepts and fail. With few-shot prompting, GPT-4 performance reaches the boundary between passing and failing but doesn't clearly pass.

Full summary here.

Does using RAG going to help?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 3