Skip to main content

Evaluation Overview

LLM applications involve putting a probablistic model at the center of your system. Due to the non-deterministic nature of LLMs, it is important to be able to properly evaluate your application. Having a proper evaluation framework that you trust will allow you to:

  • Experiment with different prompts and accurately determine if changes improve performance
  • Iterate on aspects your RAG architecture without guesswork
  • Quickly decide if it is worth upgrading to a new, recently release model
  • Determine if you can use a cheaper, faster model in place of a more expensive one and get similar performance
  • Flag with a model behind an API changed suddenly in an unexpected way
  • Have confidence your application will not produce biased or unwanted responses

Testing and evaluation help expose issues, so you can decide how to best address them, be that through different architectural choices, better models or prompts, additional code checks, or other means.

To get started, check out the Quick Start Guide.

After that, peruse the Concepts Section to better understand the different components involved.

If you want to learn how to accomplish a particular task, check out our comprehensive How-To Guides

For a higher-level set of recommendations on how to think about testing and evaluating your LLM app, check out the evaluation recommendations page.


Help us out by providing feedback on this documentation page: