Converting Web Content to Markdown with LangChain's ToMarkdownLoader

Posted: Nov 19, 2024.

When building LLM applications, you often need to process web content into a more structured format. The ToMarkdownLoader in LangChain provides an easy way to convert web pages into clean markdown format using the 2markdown.com service.

What is ToMarkdownLoader?

ToMarkdownLoader is a document loader in LangChain that converts web page content into markdown format using the 2markdown.com API service. This is particularly useful when you need to:

  • Clean up HTML content into a simpler markdown structure
  • Extract the main content from web pages while removing navigation, ads, etc.
  • Get web content in a format that's easier to process with LLMs

Reference

Here are the main methods available in ToMarkdownLoader:

MethodDescription
load()Load the URL content and convert it to markdown, returning a list of Document objects
aload()Async version of load()
lazy_load()Lazily load the content, returning an iterator of Document objects
alazy_load()Async version of lazy_load()
load_and_split()Load the content and split it into chunks using a text splitter

How to Use ToMarkdownLoader

Basic Usage

To use ToMarkdownLoader, you'll need an API key from 2markdown.com. Here's how to convert a webpage to markdown:

from langchain_community.document_loaders import ToMarkdownLoader

# Initialize the loader with URL and API key
loader = ToMarkdownLoader(
    url="https://example.com/page",
    api_key="your_2markdown_api_key"
)

# Load and convert the content
docs = loader.load()

# Access the markdown content
markdown_content = docs[0].page_content

Async Loading

If you're working in an async context, you can use the async methods:

async def convert_page():
    loader = ToMarkdownLoader(url="https://example.com/page", api_key="your_key")
    docs = await loader.aload()
    return docs

Lazy Loading

For large pages or when processing multiple URLs, you might want to use lazy loading:

loader = ToMarkdownLoader(url="https://example.com/page", api_key="your_key")

# Iterate through documents lazily
for doc in loader.lazy_load():
    print(doc.page_content)

Splitting Content

You can also load and split the content into smaller chunks using a text splitter:

from langchain.text_splitter import CharacterTextSplitter

loader = ToMarkdownLoader(url="https://example.com/page", api_key="your_key")

# Create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Load and split the content
split_docs = loader.load_and_split(text_splitter=text_splitter)

This approach is useful when you need to process long documents and want to break them into manageable chunks for your LLM application.

Remember that you'll need to sign up for a 2markdown.com account and get an API key to use this loader. The service helps clean up web content by removing unnecessary HTML elements and converting the main content into clean markdown format.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs