Skip to main content

Evaluation Quick Start

In this walkthrough, you will evaluate a chain over a dataset of examples. To do so, you will:

  • Create a dataset
  • Define the system to evaluate
  • Configure and run the evaluation
  • Review the resulting traces and evaluation feedback in LangSmith

Prerequisites

This walkthrough assumes you have already installed LangChain and openai and configured your environment to connect to LangSmith.

pip install -U "langchain langchain_openai"

Then configure your API key.

export LANGCHAIN_API_KEY=<your api key>

1. Create a dataset

Upload a dataset to LangSmith to use for evaluation. For this example, we will upload a pre-made list of input examples.

For more information on other ways to create and use datasets, check out the how-to guides.

from langsmith import Client

# Inputs are provided to your model, so it know what to generate
dataset_inputs = [
"a rap battle between Atticus Finch and Cicero",
"a rap battle between Barbie and Oppenheimer",
# ... add more as desired
]

# Outputs are provided to the evaluator, so it knows what to compare to
# Outputs are optional but recommended.
dataset_outputs = [
{"must_mention": ["lawyer", "justice"]},
{"must_mention": ["plastic", "nuclear"]},
]
client = Client()
dataset_name = "Rap Battle Dataset"

# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
dataset_name=dataset_name,
description="Rap battle prompts.",
)
client.create_examples(
inputs=[{"question": q} for q in dataset_inputs],
outputs=dataset_outputs,
dataset_id=dataset.id,
)

2. System to evaluate

The run_on_dataset test runner can evaluate any function. This includes any Runnable LangChain component.

If your system is stateful (for instance, if it has chat memory) you can instead provide a constructor for your system that creates a new instance for each example record in the dataset. If your system is stateless, you can directly pass it in without worrying about any constructors.

import openai
# You evaluate any arbitrary function over the dataset.
# The input to the function will be the inputs dictionary for each example.
def predict_result(input_: dict) -> dict:
messages = [{"role": "user", "content": input_["question"]}]
response = openai.chat.completions.create(messages=messages, model="gpt-3.5-turbo")
return {"output": response}

3. Evaluate

LangChain provides a convenient run_on_dataset and async arun_on_dataset method to generate predictions (and traces) over a dataset. When a RunEvalConfig is provided, the configured evalutors will be applied to the predictions as well to generate automated feedback.

Below, configure evaluation for some custom criteria. The feedback will be automatically logged within LangSmith.

For more information on evaluators you can use off-the-shelf, check out the pre-built evaluators docs or the reference documentation for LangChain's evalution module. For more information on how to write a custom evaluator, check out the custom evaluators guide.

from langchain.smith import RunEvalConfig, run_on_dataset
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("must_mention") or []
score = all(phrase in prediction for phrase in required)
return EvaluationResult(key="must_mention", score=score)

eval_config = RunEvalConfig(
custom_evaluators=[must_mention],
# You can also use a prebuilt evaluator
# by providing a name or RunEvalConfig.<configured evaluator>
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria(
{
"cliche": "Are the lyrics cliche?"
" Respond Y if they are, N if they're entirely unique."
}
),
],
)
client.run_on_dataset(
dataset_name=dataset_name,
llm_or_chain_factory=predict_result,
evaluation=eval_config,
verbose=True,
project_name="custom-function-test-1",
# Any experiment metadata can be specified here
project_metadata={"version": "1.0.0"},
)

4. Review Results

The evaluation results will be streamed to a new test project linked to your "Rap Battle Dataset". You can view the results by clicking on the link printed by the run_on_dataset function or by navigating to the Datasets & Testing page, clicking "Rap Battle Dataset", and viewing the latest test run.

There, you can inspect the traces and feedback generated from the evaluation configuration.

Eval test run screenshot

You can click "Open Run" to view the trace and feedback generated for that example.

Eval trace screenshot

To compare to another test on this dataset, you can click "Compare Tests".

Compare Tests

More on evaluation

Congratulations! You've now created a dataset and used it to evaluate your agent or LLM. To learn more about evaluation chains available out of the box, check out the LangChain Evaluators guide. To learn how to make your own custom evaluators, review the Custom Evaluator guide.


Help us out by providing feedback on this documentation page: