Spaces:
Runtime error
Runtime error
File size: 5,803 Bytes
129cd69 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
"""**Evaluation** chains for grading LLM and Chain outputs.
This module contains off-the-shelf evaluation chains for grading the output of
LangChain primitives such as language models and chains.
**Loading an evaluator**
To load an evaluator, you can use the :func:`load_evaluators <langchain.evaluation.loading.load_evaluators>` or
:func:`load_evaluator <langchain.evaluation.loading.load_evaluator>` functions with the
names of the evaluators to load.
.. code-block:: python
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("qa")
evaluator.evaluate_strings(
prediction="We sold more than 40,000 units last week",
input="How many units did we sell last week?",
reference="We sold 32,378 units",
)
The evaluator must be one of :class:`EvaluatorType <langchain.evaluation.schema.EvaluatorType>`.
**Datasets**
To load one of the LangChain HuggingFace datasets, you can use the :func:`load_dataset <langchain.evaluation.loading.load_dataset>` function with the
name of the dataset to load.
.. code-block:: python
from langchain.evaluation import load_dataset
ds = load_dataset("llm-math")
**Some common use cases for evaluation include:**
- Grading the accuracy of a response against ground truth answers: :class:`QAEvalChain <langchain.evaluation.qa.eval_chain.QAEvalChain>`
- Comparing the output of two models: :class:`PairwiseStringEvalChain <langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain>` or :class:`LabeledPairwiseStringEvalChain <langchain.evaluation.comparison.eval_chain.LabeledPairwiseStringEvalChain>` when there is additionally a reference label.
- Judging the efficacy of an agent's tool usage: :class:`TrajectoryEvalChain <langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain>`
- Checking whether an output complies with a set of criteria: :class:`CriteriaEvalChain <langchain.evaluation.criteria.eval_chain.CriteriaEvalChain>` or :class:`LabeledCriteriaEvalChain <langchain.evaluation.criteria.eval_chain.LabeledCriteriaEvalChain>` when there is additionally a reference label.
- Computing semantic difference between a prediction and reference: :class:`EmbeddingDistanceEvalChain <langchain.evaluation.embedding_distance.base.EmbeddingDistanceEvalChain>` or between two predictions: :class:`PairwiseEmbeddingDistanceEvalChain <langchain.evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain>`
- Measuring the string distance between a prediction and reference :class:`StringDistanceEvalChain <langchain.evaluation.string_distance.base.StringDistanceEvalChain>` or between two predictions :class:`PairwiseStringDistanceEvalChain <langchain.evaluation.string_distance.base.PairwiseStringDistanceEvalChain>`
**Low-level API**
These evaluators implement one of the following interfaces:
- :class:`StringEvaluator <langchain.evaluation.schema.StringEvaluator>`: Evaluate a prediction string against a reference label and/or input context.
- :class:`PairwiseStringEvaluator <langchain.evaluation.schema.PairwiseStringEvaluator>`: Evaluate two prediction strings against each other. Useful for scoring preferences, measuring similarity between two chain or llm agents, or comparing outputs on similar inputs.
- :class:`AgentTrajectoryEvaluator <langchain.evaluation.schema.AgentTrajectoryEvaluator>` Evaluate the full sequence of actions taken by an agent.
These interfaces enable easier composability and usage within a higher level evaluation framework.
""" # noqa: E501
from langchain.evaluation.agents import TrajectoryEvalChain
from langchain.evaluation.comparison import (
LabeledPairwiseStringEvalChain,
PairwiseStringEvalChain,
)
from langchain.evaluation.criteria import (
Criteria,
CriteriaEvalChain,
LabeledCriteriaEvalChain,
)
from langchain.evaluation.embedding_distance import (
EmbeddingDistance,
EmbeddingDistanceEvalChain,
PairwiseEmbeddingDistanceEvalChain,
)
from langchain.evaluation.exact_match.base import ExactMatchStringEvaluator
from langchain.evaluation.loading import load_dataset, load_evaluator, load_evaluators
from langchain.evaluation.parsing.base import (
JsonEqualityEvaluator,
JsonValidityEvaluator,
)
from langchain.evaluation.parsing.json_distance import JsonEditDistanceEvaluator
from langchain.evaluation.parsing.json_schema import JsonSchemaEvaluator
from langchain.evaluation.qa import ContextQAEvalChain, CotQAEvalChain, QAEvalChain
from langchain.evaluation.regex_match.base import RegexMatchStringEvaluator
from langchain.evaluation.schema import (
AgentTrajectoryEvaluator,
EvaluatorType,
PairwiseStringEvaluator,
StringEvaluator,
)
from langchain.evaluation.scoring import (
LabeledScoreStringEvalChain,
ScoreStringEvalChain,
)
from langchain.evaluation.string_distance import (
PairwiseStringDistanceEvalChain,
StringDistance,
StringDistanceEvalChain,
)
__all__ = [
"EvaluatorType",
"ExactMatchStringEvaluator",
"RegexMatchStringEvaluator",
"PairwiseStringEvalChain",
"LabeledPairwiseStringEvalChain",
"QAEvalChain",
"CotQAEvalChain",
"ContextQAEvalChain",
"StringEvaluator",
"PairwiseStringEvaluator",
"TrajectoryEvalChain",
"CriteriaEvalChain",
"Criteria",
"EmbeddingDistance",
"EmbeddingDistanceEvalChain",
"PairwiseEmbeddingDistanceEvalChain",
"StringDistance",
"StringDistanceEvalChain",
"PairwiseStringDistanceEvalChain",
"LabeledCriteriaEvalChain",
"load_evaluators",
"load_evaluator",
"load_dataset",
"AgentTrajectoryEvaluator",
"ScoreStringEvalChain",
"LabeledScoreStringEvalChain",
"JsonValidityEvaluator",
"JsonEqualityEvaluator",
"JsonEditDistanceEvaluator",
"JsonSchemaEvaluator",
]
|