Combine Document Loaders in LangChain with MergedDataLoader

Posted: Jan 31, 2025.

When working with documents in LangChain, you might need to load data from multiple sources simultaneously. The MergedDataLoader provides a convenient way to combine documents from different loaders into a single collection.

What is MergedDataLoader?

MergedDataLoader is a utility class in LangChain that allows you to combine multiple document loaders into a single loader. This is particularly useful when you need to process documents from different sources (like PDFs, web pages, or databases) in a unified way.

Reference

Here are the main methods available in MergedDataLoader:

Method	Description
`load()`	Loads all documents from all loaders and returns them as a list
`lazy_load()`	Creates an iterator to load documents lazily from each loader
`aload()`	Asynchronously loads all documents from all loaders
`alazy_load()`	Creates an async iterator to load documents lazily
`load_and_split()`	Loads documents and splits them into chunks using a text splitter

How to use MergedDataLoader

Let's look at different ways to use MergedDataLoader in your applications.

Basic Usage

The simplest way to use MergedDataLoader is to combine multiple loaders and load all documents at once:

from langchain_community.document_loaders import WebBaseLoader, PyPDFLoader
from langchain_community.document_loaders.merge import MergedDataLoader

# Initialize individual loaders
web_loader = WebBaseLoader("https://example.com/article")
pdf_loader = PyPDFLoader("document.pdf")

# Combine loaders
merged_loader = MergedDataLoader(loaders=[web_loader, pdf_loader])

# Load all documents
documents = merged_loader.load()

Lazy Loading

When dealing with large documents, you might want to load them lazily to manage memory usage:

# Using synchronous lazy loading
for document in merged_loader.lazy_load():
    # Process each document
    process_document(document)

# Using async lazy loading
async for document in merged_loader.alazy_load():
    # Process each document asynchronously
    await process_document(document)

Loading and Splitting Documents

If you need to split your documents into smaller chunks, you can use the load_and_split() method:

from langchain.text_splitter import CharacterTextSplitter

# Create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Load and split documents
split_docs = merged_loader.load_and_split(text_splitter=text_splitter)

Asynchronous Loading

For better performance in async applications, you can use the asynchronous loading method:

async def load_documents():
    # Initialize loaders
    loaders = [
        WebBaseLoader("https://example.com/page1"),
        WebBaseLoader("https://example.com/page2"),
        PyPDFLoader("document.pdf")
    ]
    
    merged_loader = MergedDataLoader(loaders=loaders)
    
    # Load documents asynchronously
    documents = await merged_loader.aload()
    return documents

The MergedDataLoader is a powerful tool when you need to work with multiple document sources in your LangChain applications. It provides flexibility in how you load and process documents, whether you need them all at once or prefer to process them one at a time.

Remember that the performance and memory usage will depend on how you choose to load the documents (lazy vs. eager loading) and the size of your document sources. Choose the appropriate loading method based on your specific use case and requirements.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs