Longform QA

#8
by shehzaadzd - opened

The FactScore paper (https://arxiv.org/pdf/2305.14251.pdf) offers an automatic method to evaluate hallucination on long-form QA. They also provide a benchmark relating to biographies with a mix of entities.
Can this be integrated into the leaderboard?

hallucinations-leaderboard org

@shehzaadzd from a quick glance, it requires access to an OpenAI key -- for example, see this snippet from https://github.com/shmsw25/FActScore:

from factscore.factscorer import FactScorer

fs = FactScorer(openai_key="...")

# topics: list of strings (human entities used to generate bios)
# generations: list of strings (model generations)
out = fs.get_score(topics, generations, gamma=10)
print (out["score"]) # FActScore
[..]

How would you implement/include it?

The FActScore Llama-7b model only uses InstructGPT for splitting sentences into facts. It could be possible to train an open-source model to this if openAI models cannot be included in this benchmark.

Sign up or log in to comment