Open-Source
LLM Evaluation Platform

Compare models and prompts to find the best for your use case.
Ensure agents perform as expected.

Create an evaluation — it's free

evaluations.hero.ctaSubText

More than 5000 AI developers chose Lunary to build better chatbots

IslandsbankiBandwidthOrangeByteDanceCloseDHL

CI/CD integrationEasily integrate into your CI/CD pipeline to ensure no regressions are introduced.

AI-powered checksUse our library of AI-powered assertors based on industry standards.

No API keysRun evaluations without the need for inference API keys. We take care of the infrastructure.

Powerful evaluation engine

Benchmark results

Run benchmarks

Compare models, settings, and prompts to find the best one for your use case.

Define success metrics

Use our set of predefined metrics or define your own to evaluate your models.

Run benchmarks from the dashboard...

(expect better as we ship a lot)

SDKs

Any LLM. Any framework.

Seamless integration with zero friction. Our SDKs are designed to be lightweight and integrate naturally into your codebase.

dataset = lunary.get_dataset("my-dataset")

for item in dataset:

  prompt = item.input
  result = my_llm_agent(item.input)

  passed, results = lunary.evaluate(
    checklist="some-slug",
    output=result,
    input=prompt,
    ideal_output=item.ideal_output,
  )

  print(passed)

Minutes to magic.

Self-host or go cloud and get started in minutes.

Open Source

Self Hostable

1-line Integration

Prompt Templates

Chat Replays

Analytics

Topic Classification

Agent Tracing

Custom Dashboards

Score LLM responses

PII Masking

Feedback Tracking

Open Source

Self Hostable

1-line Integration

Prompt Templates

Chat Replays

Analytics

Topic Classification

Agent Tracing

Custom Dashboards

Score LLM responses

PII Masking

Feedback Tracking