Using LangChain UnstructuredODTLoader for OpenDocument Files

Posted: Feb 18, 2025.

OpenDocument Text (ODT) files are an open standard format for word processing documents. In this guide, we'll explore how to work with ODT files in LangChain using the UnstructuredODTLoader.

What is UnstructuredODTLoader?

The UnstructuredODTLoader is a document loader class in LangChain that helps you extract and process text content from OpenDocument Text (ODT) files. It uses the Unstructured library under the hood to parse ODT files and convert them into LangChain Document objects that can be used in your document processing pipelines.

Reference

Here are the key methods available in UnstructuredODTLoader:

Method	Description
`load()`	Loads the ODT file and returns a list of Document objects
`lazy_load()`	Loads the file lazily, returning an iterator of Document objects
`aload()`	Asynchronously loads the ODT file
`alazy_load()`	Asynchronously loads the file lazily
`load_and_split()`	Loads the document and splits it into chunks using a text splitter

How to Use UnstructuredODTLoader

Basic Usage

The simplest way to use the UnstructuredODTLoader is to initialize it with a file path and call the load() method:

from langchain_community.document_loaders import UnstructuredODTLoader

# Initialize the loader
loader = UnstructuredODTLoader("document.odt")

# Load the document
docs = loader.load()

Using Different Modes

The loader supports two modes: "single" and "elements". The mode determines how the document content is structured:

# Single mode - entire document as one Document object
loader = UnstructuredODTLoader("document.odt", mode="single")
docs = loader.load()

# Elements mode - splits into semantic elements
loader = UnstructuredODTLoader("document.odt", mode="elements")
docs = loader.load()

In "elements" mode, the document is split into different semantic elements like titles, paragraphs, and lists. This can be useful when you need more granular control over the document structure.

Additional Parameters

You can pass additional parameters to customize the document processing:

# Using strategy parameter for faster processing
loader = UnstructuredODTLoader(
    "document.odt",
    mode="elements",
    strategy="fast"
)
docs = loader.load()

Lazy Loading

For large documents, you might want to use lazy loading to conserve memory:

loader = UnstructuredODTLoader("document.odt")
# Get an iterator instead of loading everything at once
for doc in loader.lazy_load():
    # Process each document
    print(doc.page_content)

Async Loading

The loader also supports asynchronous loading:

async def load_document():
    loader = UnstructuredODTLoader("document.odt")
    docs = await loader.aload()
    return docs

Splitting Documents

You can automatically split the document into chunks using a text splitter:

from langchain.text_splitter import CharacterTextSplitter

loader = UnstructuredODTLoader("document.odt")
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = loader.load_and_split(text_splitter=text_splitter)

Remember to install the necessary dependencies (unstructured[all-docs] or unstructured[odt]) to work with ODT files. The UnstructuredODTLoader provides a flexible way to integrate ODT documents into your LangChain applications, whether you need basic document loading or more advanced processing features.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs