Spaces:
Sleeping
Sleeping
--- | |
title: Evaluation of Summaries | |
author: Cillian Berragan | |
format: pdf | |
fontfamily: libertinus | |
monofont: 'JetBrains Mono' | |
monofontoptions: | |
- Scale=0.75 | |
--- | |
This document compares summaries written by Cambridge, to the summaries generated automatically by our model. | |
# Overview | |
For all representations, the original summary was compared with the generated summary provided by the LLM. A separate LLM call was used to determine which of these two summaries was preferred, based on set criteria: | |
> A good summary should: | |
> 1. **Be accurate** β It should not include information that is not present in the source document. | |
> 2. **Be comprehensive** β It should reflect all key points in the source document without omitting important details. | |
> 3. **Be well-grounded** β It should be based entirely on the source document without adding interpretations, opinions, or external information. | |
This model was given the option to return 4 different scores; 0 meaning neither summaries are suitable, 1 meaning the original summary is preferred, 2 meaning the LLM-generated summary is preferred, or 3 meaning both summaries are suitable. | |
@tbl-eval gives the results of this processing. We can see that the majority of the preferred summaries are those generated by the LLM (2). There are however 8 cases where the original summary is considered better, and 17 where both summaries are considered suitable. | |
```{python} | |
#| label: tbl-eval | |
#| caption: Comparison between original summary and LLM-generated summary | |
#| echo: false | |
#| output: asis | |
import polars as pl | |
summaries = pl.read_parquet("./data/out/eval.parquet") | |
print( | |
summaries["score"] | |
.value_counts() | |
.sort("score") | |
.transpose(include_header=True) | |
.rename({"column_0": "Original", "column_1": "LLM-generated", "column_2": "Both"}) | |
.drop("column").tail(1) | |
.to_pandas().to_markdown(index=False) | |
) | |
``` | |