Converting Web Content to Markdown with LangChain's ToMarkdownLoader
Posted: Nov 19, 2024.
When building LLM applications, you often need to process web content into a more structured format. The ToMarkdownLoader in LangChain provides an easy way to convert web pages into clean markdown format using the 2markdown.com service.
What is ToMarkdownLoader?
ToMarkdownLoader is a document loader in LangChain that converts web page content into markdown format using the 2markdown.com API service. This is particularly useful when you need to:
- Clean up HTML content into a simpler markdown structure
- Extract the main content from web pages while removing navigation, ads, etc.
- Get web content in a format that's easier to process with LLMs
Reference
Here are the main methods available in ToMarkdownLoader:
Method | Description |
---|---|
load() | Load the URL content and convert it to markdown, returning a list of Document objects |
aload() | Async version of load() |
lazy_load() | Lazily load the content, returning an iterator of Document objects |
alazy_load() | Async version of lazy_load() |
load_and_split() | Load the content and split it into chunks using a text splitter |
How to Use ToMarkdownLoader
Basic Usage
To use ToMarkdownLoader, you'll need an API key from 2markdown.com. Here's how to convert a webpage to markdown:
Async Loading
If you're working in an async context, you can use the async methods:
Lazy Loading
For large pages or when processing multiple URLs, you might want to use lazy loading:
Splitting Content
You can also load and split the content into smaller chunks using a text splitter:
This approach is useful when you need to process long documents and want to break them into manageable chunks for your LLM application.
Remember that you'll need to sign up for a 2markdown.com account and get an API key to use this loader. The service helps clean up web content by removing unnecessary HTML elements and converting the main content into clean markdown format.
An alternative to LangSmith
Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.
LangChain DocsJoin 10,000+ subscribers
Every 2 weeks, latest model releases and industry news.
An alternative to LangSmith
Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.