Evaluations let you run logic against LLM responses. They help you benchmark models and prompts to find the best for your needs.

If you're interested in evaluating LLM responses you've already captured, take a look at our radars product.

Example ways to use evaluations:

  • Benchmark an LLM response against an ideal answer using cosine similarity
  • Find the cheapest model that fits all conditions
  • Ensure responses don't leak sensitive customers data
  • Ensure responses are not too long or too costly

You can create evaluations on the dashboard by picking models and conditions:


Evaluations can be created and ran on the dashboard automatically with 20+ models, but they can also be setup to run directly in your code for advanced usecases or in your CI pipeline.

Questions? We're here to help.