How to use off-the-shelf LangChain evaluators

LangChain's evaluation module provides evaluators you can use as-is for common evaluation scenarios.

It's easy to use these to grade your chain or agent by naming these in the RunEvalConfig provided to the run_on_dataset (or async arun_on_dataset) function in the LangChain library. You can check out the linked doc for a quick walkthrough of how to use it.

Copy the code snippets below to get started. You can also configure them for your applications using the arguments mentioned in the "Parameters" sections.

If you don't see an implementation that suits your needs, you can learn how to create your own Custom Run Evaluators in the linked guide, or contribute an string evaluator to the LangChain repository.

note

Most of these evaluators are useful but imperfect! We recommend against blind trust of any single automated metric and to always incorporate them as a part of a holistic testing and evaluation strategy. Many of the LLM-based evaluators return a binary score for a given data point, so measuring differences in prompt or model performance are most reliable in aggregate over a larger dataset.

Correctness: QA evaluation

QA evalutors help to measure the correctness of a response to a user query or question. If you have a dataset with reference labels or reference context docs, these are the evaluators for you!

Three QA evaluators you can load are: "context_qa", "qa", "cot_qa". Based on our meta-evals, we recommend using "cot_qa" or a similar prompt for best results.

The "context_qa" evaluator (reference) instructs the LLM chain to use reference "context" (provided throught the example outputs) in determining correctness. This is useful if you have a larger corpus of grounding docs but don't have ground truth answers to a query.
The "qa" evaluator (reference) instructs an LLMChain to directly grade a response as "correct" or "incorrect" based on the reference answer.
The "cot_qa" evaluator (reference) is similar to the "context_qa" evaluator, expect it instructs the LLMChain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.

Python SDK

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    evaluators=[
        "qa",
        "context_qa",
        "cot_qa",
    ]
)

client = Client()
run_on_dataset(
    dataset_name="<dataset_name>",
    llm_or_chain_factory=<LLM or constructor for chain or agent>,
    client=client,
    evaluation=evaluation_config,
    project_name="<the name to assign to this test project>",
)

You can customize the evaluator by specifying the LLM used to power its LLM chain or even by customizing the prompt itself. Below is an example using an Anthropic model to run the evaluator, and a custom prompt for the base QA evaluator. Check out the reference docs for more information on the expected prompt format.

Python SDK

from langchain.chat_models import ChatAnthropic
from langchain_core.prompts.prompt import PromptTemplate

_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
Respond with CORRECT or INCORRECT:
Grade:
"""

PROMPT = PromptTemplate(
    input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE
)
eval_llm = ChatAnthropic(temperature=0.0)
evaluation_config = RunEvalConfig(
    evaluators=[
        RunEvalConfig.QA(llm=eval_llm, prompt=PROMPT),
        RunEvalConfig.ContextQA(llm=eval_llm),
        RunEvalConfig.CoTQA(llm=eval_llm),
    ]
)

No labels: criteria evaluation

If you don't have ground truth reference labels (i.e., if you are evaluating against production data or if your task doesn't involve factuality), you can evaluate your run against a custom set of criteria using the "criteria" evaluator (reference).

This is helpful when there are high level semantic aspects of your model's output you'd like to monitor that aren't captured by other explicit checks or rules.

Python SDK

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    evaluators=[
        # You can define an arbitrary criterion as a key: value pair in the criteria dict
        RunEvalConfig.Criteria({"creativity": "Is this submission creative, imaginative, or novel?"}),
        # We provide some simple default criteria like "conciseness" you can use as well
        RunEvalConfig.Criteria("conciseness"),
    ]
)

client = Client()
run_on_dataset(
   dataset_name="<dataset_name>",
    llm_or_chain_factory=<LLM or constructor for chain or agent>,
    evaluation=evaluation_config,
    client=client,
)

Criteria with labels

Your dataset may have ground truth labels or contextual information demonstrating an output that satisfies a criterion or prediction for a certain input. You can use criteria in context with the "labeled_criteria" evaluator (reference).

Python SDK

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    evaluators=[
        # You can define an arbitrary criterion as a key: value pair in the criteria dict
        RunEvalConfig.LabeledCriteria(
            {
                "helpfulness": (
                    "Is this submission helpful to the user,"
                    " taking into account the correct reference answer?"
                )
            }
        ),
    ]
)

client = Client()
run_on_dataset(
   dataset_name="<dataset_name>",
    llm_or_chain_factory=<LLM or constructor for chain or agent>,
    evaluation=evaluation_config,
    client=client,
)

Supported Criteria

Default criteria are implemented for the following aspects: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality. To specify custom criteria, write a mapping of a criterion name to its description, such as:

  criterion = {"creativity": "Is this submission creative, imaginative, or novel?"}
  eval_config = RunEvalConfig(evaluators=[RunEvalConfig.Criteria(criterion)])

Interpreting the Score

Evaluation scores don't have an inherent "direction" (i.e., higher is not necessarily better). The direction of the score depends on the criteria being evaluated. For example, a score of 1 for "helpfulness" means that the prediction was deemed to be helpful by the model. However, a score of 1 for "maliciousness" means that the prediction contains malicious content, which, of course, is "bad".

Embedding distance

One way to quantify the semantic (dis-)similarity between a predicted output and a ground truth is via embedding distance. You can use the embedding_distance evaluator (reference) to measure the distance between the predicted output and the ground truth.

Python SDK

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset
from langchain.embeddings import HuggingFaceEmbeddings

evaluation_config = RunEvalConfig(
    evaluators=[
        # You can define an arbitrary criterion as a key: value pair in the criteria dict
        "embedding_distance",
        # Or to customize the embeddings:
        # Requires 'pip install sentence_transformers'
        # RunEvalConfig.EmbeddingDistance(embeddings=HuggingFaceEmbeddings(), distance_metric="cosine"),
    ]
)

client = Client()
run_on_dataset(
   dataset_name="<dataset_name>",
    llm_or_chain_factory=<LLM or constructor for chain or agent>,
    evaluation=evaluation_config,
    client=client,
)

String distance

Another simple way to measure similarity is by computing a string edit distance like Levenshtein distance. You can use the "string_distance" evaluator (reference)to measure the distance between the predicted output and the ground truth. This depends on the rapidfuzz library.

Python SDK

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    evaluators=[
        # You can define an arbitrary criterion as a key: value pair in the criteria dict
        "string_distance",
        # Or to customize the distance metric:
        # RunEvalConfig.StringDistance(distance="levenshtein", normalize_score=True),
    ]
)

client = Client()
run_on_dataset(
   dataset_name="<dataset_name>",
    llm_or_chain_factory=<LLM or constructor for chain or agent>,
    evaluation=evaluation_config,
    client=client,
)

Don't see what you're looking for?

These implementations are just a starting point. We want to work with you to build better off-the-shelf evaluation tools for everyone.

We'd love feedback and contributions! Send us feedback at support@langchain.dev, check out the Evaluators in LangChain or submit PRs or issues directly to better address your needs.

How to use off-the-shelf LangChain evaluators

Correctness: QA evaluation​

No labels: criteria evaluation​

Criteria with labels​

Embedding distance

String distance

Don't see what you're looking for?​

Help us out by providing feedback on this documentation page:

Correctness: QA evaluation

No labels: criteria evaluation

Criteria with labels

Don't see what you're looking for?