Using LangChain ScrapingAntLoader for Web Scraping
Posted: Nov 25, 2024.
The ScrapingAntLoader is a powerful document loader in LangChain that helps you scrape web pages and convert them into a format suitable for LLMs by converting web content to markdown. It leverages the ScrapingAnt service which provides advanced web scraping capabilities with features like headless browsers, proxy rotation, and anti-bot bypass.
What is ScrapingAntLoader?
ScrapingAntLoader is a document loader class that integrates with the ScrapingAnt API to scrape web pages and transform them into Document objects containing markdown content. It handles the complexities of web scraping like JavaScript rendering, proxy management, and anti-bot detection, making it ideal for collecting web data at scale.
The loader supports both synchronous and asynchronous operations, lazy loading of documents, and provides configuration options for customizing the scraping behavior.
Reference
Here are the main methods available in ScrapingAntLoader:
Method | Description |
---|---|
load() | Scrapes web pages and returns a list of Document objects |
lazy_load() | Returns an iterator of Document objects, loading them one at a time |
aload() | Asynchronously loads and returns a list of Document objects |
alazy_load() | Returns an async iterator of Document objects |
load_and_split() | Loads documents and splits them into chunks |
How to use ScrapingAntLoader
Basic Usage
Here's how to initialize and use the ScrapingAntLoader for basic web scraping:
Custom Scraping Configuration
You can customize the scraping behavior by providing a configuration dictionary:
Lazy Loading
For handling large numbers of URLs efficiently, you can use lazy loading:
Async Loading
For applications that need asynchronous operation:
Loading and Splitting Documents
If you need to split the scraped content into smaller chunks:
Remember to handle your API key securely, preferably using environment variables:
An alternative to LangSmith
Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.
LangChain DocsJoin 10,000+ subscribers
Every 2 weeks, latest model releases and industry news.
An alternative to LangSmith
Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.