kg_llm_leaderboard_test

Runtime error

b1sheng commited on Jul 27, 2023

Commit

f68b926

•

1 Parent(s): 0343b70

Update src/assets/text_content.py

Files changed (1) hide show

src/assets/text_content.py CHANGED Viewed

@@ -8,19 +8,22 @@ INTRODUCTION_TEXT = f"""
 The data on this page is sourced from a research paper. If you intend to use the data from this page, please remember to cite the following source: https://arxiv.org/abs/2303.07992
 We compare the current SOTA traditional KBQA models (fine-tuned (FT) and zero-shot (ZS)),
 LLMs in the GPT family, and Other Non-GPT LLM. In QALD-9 and LC-quad2, the evaluation metric used is F1, while other datasets use Accuracy (Exact match).
 """
 LLM_BENCHMARKS_TEXT = f"""
-ChatGPT is a powerful large language model (LLM) that
-covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge. Therefore, there is
-growing interest in exploring whether ChatGPT can replace traditional
-knowledge-based question answering (KBQA) models. Although there
-have been some works analyzing the question answering performance of
-ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model.
-In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by Microsoft. We evaluate ChatGPT
-and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets.
 The total number of test cases is approximately 190,000.
 """

 The data on this page is sourced from a research paper. If you intend to use the data from this page, please remember to cite the following source: https://arxiv.org/abs/2303.07992
 We compare the current SOTA traditional KBQA models (fine-tuned (FT) and zero-shot (ZS)),
 LLMs in the GPT family, and Other Non-GPT LLM. In QALD-9 and LC-quad2, the evaluation metric used is F1, while other datasets use Accuracy (Exact match).
 """
 LLM_BENCHMARKS_TEXT = f"""
+ChatGPT is a powerful large language model (LLM) that covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge.
+Therefore, there is growing interest in exploring whether ChatGPT can replace traditional knowledge-based question answering (KBQA) models.
+Although there have been some works analyzing the question answering performance of ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model.
+In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by Microsoft.
+We evaluate ChatGPT and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets.
 The total number of test cases is approximately 190,000.
 """