Evaluations

Evaluations let you run logic against LLM responses. They help you benchmark models and prompts to find the best for your needs.

Note: This feature is currently only available on the Unlimited and Enterprise plans.

If you're interested in evaluating LLM responses you've already captured, take a look at our radars product.

Example ways to use evaluations:

Benchmark an LLM response against an ideal answer using cosine similarity
Find the cheapest model that fits all conditions
Ensure responses don't leak sensitive customers data
Ensure responses are not too long or too costly

You can create evaluations on the dashboard by picking models and conditions:

evals

Evaluations can be created and ran on the dashboard automatically with 20+ models, but they can also be setup to run directly in your code for advanced usecases or in your CI pipeline.

Evaluations SDK

Learn how to run evaluations in your code

Evaluations

Questions? We're here to help.