Loading Documents from Cassandra with LangChain

Posted: Feb 17, 2025.

The CassandraLoader in LangChain provides a convenient way to load documents from Apache Cassandra databases. This guide will show you how to use it effectively to retrieve and process your data.

What is CassandraLoader?

CassandraLoader is a document loader that allows you to fetch data from Apache Cassandra, a NoSQL database, and convert it into LangChain Document objects. You can load data either by specifying a table name or by providing a custom CQL query. The loader supports both synchronous and asynchronous operations.

Reference

Here are the key methods available in CassandraLoader:

MethodDescription
load()Synchronously loads data and returns a list of Document objects
aload()Asynchronously loads data and returns a list of Document objects
lazy_load()Creates a synchronous iterator of Document objects
alazy_load()Creates an asynchronous iterator of Document objects
load_and_split()Loads documents and splits them into chunks using a TextSplitter

How to Use CassandraLoader

Basic Setup

There are two ways to initialize the CassandraLoader:

1. Using a Cassandra Driver Session

from cassandra.cluster import Cluster
from langchain_community.document_loaders import CassandraLoader

# Create a Cassandra session
cluster = Cluster()
session = cluster.connect()

# Initialize the loader
loader = CassandraLoader(
    table="movie_reviews",
    session=session,
    keyspace="my_keyspace"
)

# Load documents
documents = loader.load()

2. Using Cassio

import cassio
from langchain_community.document_loaders import CassandraLoader

# Initialize cassio
cassio.init(contact_points="127.0.0.1", keyspace="my_keyspace")

# Create loader without explicit session
loader = CassandraLoader(
    table="movie_reviews"
)

# Load documents
documents = loader.load()

Custom Data Mapping

You can customize how row data is converted to document content and metadata:

def content_mapper(row):
    return f"{row.title}: {row.review_text}"

def metadata_mapper(row):
    return {
        "movie_id": str(row.id),
        "rating": row.rating
    }

loader = CassandraLoader(
    table="movie_reviews",
    page_content_mapper=content_mapper,
    metadata_mapper=metadata_mapper
)

Using Custom Queries

Instead of specifying a table, you can use a custom CQL query:

loader = CassandraLoader(
    query="SELECT title, review_text FROM movie_reviews WHERE rating > 4",
    query_parameters={"min_rating": 4}
)

Async Loading

For better performance in async applications:

async def load_documents():
    loader = CassandraLoader(table="movie_reviews")
    documents = await loader.aload()
    return documents

Lazy Loading

When dealing with large datasets, you can use lazy loading to conserve memory:

loader = CassandraLoader(table="movie_reviews")

# Synchronous lazy loading
for document in loader.lazy_load():
    process_document(document)

# Async lazy loading
async for document in loader.alazy_load():
    await process_document(document)

The CassandraLoader makes it easy to integrate Cassandra data into your LangChain applications. Remember to handle your database connections properly and consider using connection pooling for production environments.

When working with large datasets, consider using the lazy loading methods to prevent memory issues, and take advantage of the async capabilities if you're building an async application.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs