Using LangChain Cassandra Vector Store for Document Storage and Retrieval

Posted: Feb 10, 2025.

The Cassandra vector store in LangChain provides a way to store and search documents using Apache Cassandra® or compatible databases that support vector search capabilities. In this guide, we'll explore how to use this vector store implementation effectively.

What is Cassandra Vector Store?

The Cassandra vector store is a LangChain integration that allows you to store documents and their vector embeddings in Apache Cassandra® or compatible databases (like Astra DB). It supports various search capabilities including:

  • Vector similarity search
  • Metadata filtering
  • Hybrid search (combining vector similarity with text search)
  • Maximal marginal relevance (MMR) search

The implementation requires Cassandra 5.0+ or a compatible database that supports vector capabilities.

Reference

Key methods of the Cassandra vector store:

MethodDescription
add_texts()Add raw text documents with optional metadata and IDs
add_documents()Add Document objects with metadata
similarity_search()Search for similar documents by text query
similarity_search_with_score()Similar to above but includes relevance scores
max_marginal_relevance_search()Search optimizing for both relevance and diversity
delete()Remove documents by their IDs
clear()Empty the entire vector store
metadata_search()Search documents by metadata filters

How to Use Cassandra Vector Store

Setup and Initialization

First, install the required package:

pip install "cassio>=0.1.4"

There are two ways to initialize the vector store:

  1. Using a Cassandra cluster:
from cassandra.cluster import Cluster
from langchain_community.vectorstores import Cassandra
from langchain_openai import OpenAIEmbeddings

# Connect to Cassandra cluster
cluster = Cluster(["127.0.0.1"]) 
session = cluster.connect()

# Initialize vector store
embeddings = OpenAIEmbeddings()
vectorstore = Cassandra(
    embedding=embeddings,
    session=session,
    keyspace="my_keyspace",
    table_name="my_vectors"
)
  1. Using Astra DB:
import cassio
from langchain_community.vectorstores import Cassandra
from langchain_openai import OpenAIEmbeddings

# Initialize connection with Astra DB
cassio.init(
    database_id="your-db-id",
    token="your-token",
    keyspace="your-keyspace"
)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Cassandra(
    embedding=embeddings,
    table_name="my_vectors"
)

Adding Documents

You can add documents in two ways:

# Add raw texts
texts = ["Document 1 content", "Document 2 content"]
metadata = [{"source": "web"}, {"source": "pdf"}]
ids = vectorstore.add_texts(texts=texts, metadatas=metadata)

# Add Document objects
from langchain_core.documents import Document
docs = [
    Document(page_content="Doc 1", metadata={"source": "web"}),
    Document(page_content="Doc 2", metadata={"source": "pdf"})
]
ids = vectorstore.add_documents(docs)

Searching Documents

Basic similarity search:

# Search similar documents
results = vectorstore.similarity_search(
    "search query",
    k=3  # number of results
)

# Search with metadata filter
results = vectorstore.similarity_search(
    "search query",
    filter={"source": "web"}
)

# Search with scores
results = vectorstore.similarity_search_with_score(
    "search query",
    k=3
)

Using MMR search for diversity:

results = vectorstore.max_marginal_relevance_search(
    "search query",
    k=3,  # number of results
    fetch_k=10,  # number of initial results to rerank
    lambda_mult=0.5  # diversity factor (0=max diversity, 1=max relevance)
)

If using Astra DB, you can combine vector similarity with text search:

results = vectorstore.similarity_search(
    "search query",
    k=3,
    body_search="specific terms"  # Text search filter
)

Managing Documents

Delete specific documents:

# Delete by IDs
vectorstore.delete(ids=["id1", "id2"])

# Delete all documents matching metadata
vectorstore.delete_by_metadata_filter({"source": "web"})

# Clear entire store
vectorstore.clear()

The Cassandra vector store provides a robust solution for document storage and retrieval, with support for advanced features like hybrid search and MMR. Its integration with Cassandra and Astra DB makes it suitable for production deployments requiring scalability and high availability.

Remember to properly handle the database connection and clean up resources when you're done. The vector store capabilities will depend on your underlying database version and configuration.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs