Evaluation Overview

LLM applications involve putting a probablistic model at the center of your system. Due to the non-deterministic nature of LLMs, it is important to be able to properly evaluate your application. Having a proper evaluation framework that you trust will allow you to:

Experiment with different prompts and accurately determine if changes improve performance
Iterate on aspects your RAG architecture without guesswork
Quickly decide if it is worth upgrading to a new, recently release model
Determine if you can use a cheaper, faster model in place of a more expensive one and get similar performance
Flag with a model behind an API changed suddenly in an unexpected way
Have confidence your application will not produce biased or unwanted responses

Testing and evaluation help expose issues, so you can decide how to best address them, be that through different architectural choices, better models or prompts, additional code checks, or other means.

To get started, check out the Quick Start Guide.

After that, peruse the Concepts Section to better understand the different components involved.

If you want to learn how to accomplish a particular task, check out our comprehensive How-To Guides

For a higher-level set of recommendations on how to think about testing and evaluating your LLM app, check out the evaluation recommendations page.

Evaluation Overview

Help us out by providing feedback on this documentation page: