Using LangChain's CassandraSemanticCache for Semantic-Based LLM Response Caching

Posted: Feb 19, 2025.

LangChain's CassandraSemanticCache enables you to cache LLM responses based on semantic similarity, allowing you to reuse responses for semantically similar prompts. This guide explains how to use this cache effectively with Apache Cassandra or compatible databases.

What is CassandraSemanticCache?

CassandraSemanticCache is a caching implementation that uses Cassandra as a vector store backend for semantic (similarity-based) lookup of cached LLM responses. Unlike traditional exact-match caching, it can return cached results for prompts that are semantically similar but not identical, potentially reducing API calls and improving response times.

Reference

Here are the key methods available in CassandraSemanticCache:

Method	Description
`lookup`	Look up cached results based on prompt and LLM string
`update`	Store new results in the cache
`clear`	Clear the entire semantic cache
`lookup_with_id`	Look up results and return document ID if found
`delete_by_document_id`	Delete a cached entry by its document ID

How to Use CassandraSemanticCache

Setting up the Cache

First, install the required dependency:

pip install "cassio>=0.1.6"

Then initialize the cache:

from langchain.globals import set_llm_cache
from langchain_community.cache import CassandraSemanticCache
from langchain_openai import OpenAIEmbeddings

# Initialize Cassandra connection
import cassio
cassio.init(auto=True)  # Requires environment variables

# Create embeddings instance
embeddings = OpenAIEmbeddings()

# Set up the cache
set_llm_cache(CassandraSemanticCache(
    embedding=embeddings,
    table_name="my_semantic_cache",
    score_threshold=0.85  # Adjust similarity threshold
))

Basic Usage

Once configured, the cache will automatically handle caching for your LLM calls:

from langchain_openai import OpenAI

llm = OpenAI()

# First call - will hit the API
response1 = llm.invoke("What is the capital of France?")

# Semantically similar question - might use cached response
response2 = llm.invoke("Which city is the capital of France?")

Advanced Configuration

You can customize the cache behavior:

cache = CassandraSemanticCache(
    embedding=embeddings,
    table_name="custom_cache",
    score_threshold=0.90,  # Higher threshold for stricter matching
    ttl_seconds=3600,      # Cache entries expire after 1 hour
    similarity_measure="cos"  # Use cosine similarity
)
set_llm_cache(cache)

Invalidating Cache Entries

To remove specific entries from the cache:

# First get the document ID through a lookup
result = cache.lookup_with_id("What is the capital of France?", llm_string)
if result:
    doc_id, _ = result
    # Then delete using the ID
    cache.delete_by_document_id(doc_id)

Clearing the Entire Cache

To clear all cached entries:

cache.clear()

The CassandraSemanticCache provides a powerful way to optimize your LLM applications by reducing duplicate API calls while maintaining response quality through semantic matching. By properly tuning the similarity threshold and TTL settings, you can balance cache hit rates with response accuracy for your specific use case.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs