Analyzing Test Results with LangChain TestResult

Posted: Feb 7, 2025.

When evaluating LLMs and chains in LangChain, you often need to analyze test results and feedback metrics. The TestResult class provides specialized functionality to work with evaluation data by extending Python's dictionary functionality with methods specific to LangChain testing.

What is TestResult?

TestResult is a dictionary subclass designed to store and analyze the results of LangChain evaluations. It provides additional methods to work with feedback scores and convert results into pandas DataFrames for further analysis. This makes it particularly useful when you need to analyze the performance of your LLM applications.

Reference

Here are the key methods specific to TestResult:

MethodDescription
get_aggregate_feedback()Returns a pandas DataFrame containing quantiles for feedback scores across all feedback keys
to_dataframe()Converts the test results into a pandas DataFrame for analysis

The class also inherits all standard dictionary methods like get(), update(), items(), etc.

How to Use TestResult

Let's look at different ways to work with TestResult objects.

Basic Usage

TestResult objects work like regular dictionaries for storing test data:

from langchain.smith.evaluation.runner_utils import TestResult

# Create a new test result
results = TestResult({
    'test_1': {
        'score': 0.95,
        'feedback': 'Good response',
        'metadata': {'model': 'gpt-3.5'}
    },
    'test_2': {
        'score': 0.85,
        'feedback': 'Acceptable response',
        'metadata': {'model': 'gpt-3.5'}
    }
})

# Access results like a dictionary
print(results['test_1']['score'])  # Output: 0.95

Analyzing Feedback Scores

The get_aggregate_feedback() method is particularly useful for analyzing score distributions:

# Get quantile analysis of feedback scores
feedback_analysis = results.get_aggregate_feedback()
print(feedback_analysis)
"""
Output example:
           score
count     2.000
mean      0.900
std       0.071
min       0.850
25%       0.875
50%       0.900
75%       0.925
max       0.950
"""

Converting to DataFrame

For more detailed analysis, you can convert the results to a pandas DataFrame:

# Convert results to DataFrame for analysis
df = results.to_dataframe()
print(df)
"""
Output example:
       score           feedback           metadata
test_1  0.95    Good response    {'model': 'gpt-3.5'}
test_2  0.85    Acceptable...    {'model': 'gpt-3.5'}
"""

# Now you can use pandas operations
avg_score = df['score'].mean()
print(f"Average score: {avg_score}")

Working with Multiple Test Results

TestResult supports dictionary operations for combining multiple test results:

# Create another test result
more_results = TestResult({
    'test_3': {
        'score': 0.92,
        'feedback': 'Very good response',
        'metadata': {'model': 'gpt-4'}
    }
})

# Combine results
results.update(more_results)

# Get all scores
all_scores = [test_data['score'] for test_data in results.values()]
print(f"All scores: {all_scores}")

TestResult provides a structured way to handle evaluation results in LangChain, making it easier to analyze and understand the performance of your language models and chains. The combination of dictionary functionality with specialized analysis methods makes it a powerful tool for LLM evaluation workflows.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs