LangChain MathPix PDF Loader - Extract Text from PDFs with High Precision

Posted: Nov 8, 2024.

The MathpixPDFLoader is a powerful document loader in LangChain that uses the Mathpix OCR service to extract text from PDF files with high accuracy, particularly for documents containing mathematical formulas and complex layouts.

What is MathpixPDFLoader?

MathpixPDFLoader is a document loader class that leverages Mathpix's OCR capabilities to convert PDF files into machine-readable text. It's particularly useful when dealing with academic papers, mathematical documents, or any PDFs that contain complex formulas and layouts that traditional PDF extractors might struggle with. The loader handles the communication with Mathpix's API and converts the results into LangChain Document objects.

Reference

Here are the key parameters and methods of MathpixPDFLoader:

Parameter	Description
file_path	Path to the PDF file to load
processed_file_format	Output format for the processed file (default: 'md')
max_wait_time_seconds	Maximum time to wait for Mathpix processing (default: 500)
should_clean_pdf	Whether to clean the PDF content (default: False)
extra_request_data	Additional parameters to send to Mathpix API

Method	Description
load()	Load the PDF and convert to Document objects
lazy_load()	Load documents one at a time using an iterator
aload()	Asynchronously load documents
alazy_load()	Asynchronously load documents one at a time
load_and_split()	Load documents and split them into chunks

How to Use MathpixPDFLoader

Setup and Authentication

First, you'll need to set up your Mathpix API credentials:

import os
os.environ["MATHPIX_API_KEY"] = "your-api-key-here"

Basic Usage

Here's how to load a PDF file:

from langchain_community.document_loaders import MathpixPDFLoader

# Initialize the loader
loader = MathpixPDFLoader("path/to/your/paper.pdf")

# Load all documents
documents = loader.load()

# Access the content
print(documents[0].page_content)

Lazy Loading for Large Documents

When dealing with large PDFs, you can use lazy loading to process documents one at a time:

loader = MathpixPDFLoader("path/to/large/document.pdf")

# Process documents in batches
batch_size = 10
current_batch = []

for doc in loader.lazy_load():
    current_batch.append(doc)
    
    if len(current_batch) >= batch_size:
        # Process the batch
        process_documents(current_batch)
        current_batch = []

Async Loading

For applications that need to handle multiple PDFs concurrently:

async def load_pdf(file_path):
    loader = MathpixPDFLoader(file_path)
    documents = await loader.aload()
    return documents

# Use in an async context
documents = await load_pdf("path/to/paper.pdf")

Customizing Processing Options

You can customize how Mathpix processes your PDF:

loader = MathpixPDFLoader(
    file_path="path/to/paper.pdf",
    processed_file_format="md",  # Output in Markdown format
    should_clean_pdf=True,       # Clean the PDF content
    max_wait_time_seconds=300,   # Custom wait time
    extra_request_data={         # Additional Mathpix options
        "math_inline_delimiters": ["$", "$"],
        "rm_spaces": True
    }
)

Loading and Splitting

You can automatically split the documents into chunks when loading:

from langchain.text_splitter import CharacterTextSplitter

# Create a text splitter
text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Load and split the document
split_docs = loader.load_and_split(text_splitter=text_splitter)

Remember that the Mathpix service requires an API key and may have usage limits based on your subscription plan. Make sure to handle API errors and implement appropriate rate limiting in production applications.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs