Loading Org-mode Files in LangChain with UnstructuredOrgModeLoader

Posted: Nov 22, 2024.

When working with Emacs Org-mode files in LangChain, the UnstructuredOrgModeLoader provides a flexible way to load and process your documents. This guide will show you how to effectively use this loader in your applications.

What is UnstructuredOrgModeLoader?

UnstructuredOrgModeLoader is a specialized document loader in LangChain designed to handle Org-mode files - a document format commonly used in Emacs for notes, planning, and authoring. It leverages the Unstructured library to parse and extract content from .org files, offering different modes of operation to suit various use cases.

Reference

Here are the key methods and parameters of UnstructuredOrgModeLoader:

Method/Parameter	Description
`__init__(file_path, mode='single', **unstructured_kwargs)`	Constructor that takes file path, mode, and additional Unstructured parameters
`load()`	Loads the document and returns a list of Document objects
`lazy_load()`	Returns an iterator of Document objects for memory-efficient loading
`aload()`	Async version of load()
`alazy_load()`	Async version of lazy_load()
`load_and_split(text_splitter=None)`	Loads and splits the document into chunks

The mode parameter can be:

'single': Returns the entire document as one Document object
'elements': Splits the document into elements (Title, NarrativeText, etc.)

How to Use UnstructuredOrgModeLoader

Basic Usage with Single Mode

The simplest way to use the loader is in 'single' mode, which processes the entire file as one document:

from langchain_community.document_loaders import UnstructuredOrgModeLoader

# Load the entire document as a single Document object
loader = UnstructuredOrgModeLoader("example.org", mode="single")
docs = loader.load()

Using Elements Mode

For more granular control, use 'elements' mode to split the document into different components:

loader = UnstructuredOrgModeLoader(
    "example.org",
    mode="elements",
)
docs = loader.load()

# Each element will be a separate Document object
for doc in docs:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

Adding Additional Processing Options

You can pass additional parameters to the Unstructured library for customized processing:

loader = UnstructuredOrgModeLoader(
    "example.org",
    mode="elements",
    strategy="fast",  # Use fast processing strategy
    include_metadata=True  # Include additional metadata
)
docs = loader.load()

Lazy Loading for Large Files

When dealing with large Org-mode files, you can use lazy loading to conserve memory:

loader = UnstructuredOrgModeLoader("large_file.org", mode="elements")
# Process documents one at a time
for doc in loader.lazy_load():
    # Process each document
    print(doc.page_content[:100])  # Print first 100 chars

Async Loading

For applications requiring asynchronous operation:

async def load_documents():
    loader = UnstructuredOrgModeLoader("example.org", mode="elements")
    docs = await loader.aload()
    return docs

Loading and Splitting Documents

To load and split the document into smaller chunks:

from langchain.text_splitter import CharacterTextSplitter

loader = UnstructuredOrgModeLoader("example.org")
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = loader.load_and_split(text_splitter=text_splitter)

Remember that UnstructuredOrgModeLoader requires the 'unstructured' package to be installed. You can install it using pip:

pip install "unstructured[org-mode]"

This loader is particularly useful when you need to process Org-mode files as part of a larger LangChain pipeline, such as for document analysis, knowledge bases, or content extraction systems.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs