factbench / factbench_data.csv
sheza munir
Set up leaderboard page
b78a401 verified
raw
history blame
704 Bytes
Tier,Model,FactScore,SAFE,Factcheck-GPT,VERIFY
Tier 1: Easy,GPT4-o,53.19,63.31,86.4,71.58
Tier 1: Easy,Gemini1.5-Pro,51.79,61.24,83.45,69.38
Tier 1: Easy,Llama3.1-70B-Instruct,52.49,61.29,83.48,67.27
Tier 1: Easy,Llama3.1-405B-Instruct,53.22,61.63,83.57,64.94
Tier 2: Moderate,GPT4-o,54.76,65.01,89.39,76.02
Tier 2: Moderate,Gemini1.5-Pro,52.62,62.68,87.44,74.24
Tier 2: Moderate,Llama3.1-70B-Instruct,52.53,62.64,85.16,72.01
Tier 2: Moderate,Llama3.1-405B-Instruct,53.48,63.29,86.37,70.25
Tier 3: Hard,GPT4-o,69.44,76.17,94.25,90.58
Tier 3: Hard,Gemini1.5-Pro,66.05,75.69,91.09,87.82
Tier 3: Hard,Llama3.1-70B-Instruct,69.85,77.55,92.89,86.63
Tier 3: Hard,Llama3.1-405B-Instruct,70.04,77.01,93.64,85.79