Using LangChain ScrapingAntLoader for Web Scraping

Posted: Nov 25, 2024.

The ScrapingAntLoader is a powerful document loader in LangChain that helps you scrape web pages and convert them into a format suitable for LLMs by converting web content to markdown. It leverages the ScrapingAnt service which provides advanced web scraping capabilities with features like headless browsers, proxy rotation, and anti-bot bypass.

What is ScrapingAntLoader?

ScrapingAntLoader is a document loader class that integrates with the ScrapingAnt API to scrape web pages and transform them into Document objects containing markdown content. It handles the complexities of web scraping like JavaScript rendering, proxy management, and anti-bot detection, making it ideal for collecting web data at scale.

The loader supports both synchronous and asynchronous operations, lazy loading of documents, and provides configuration options for customizing the scraping behavior.

Reference

Here are the main methods available in ScrapingAntLoader:

MethodDescription
load()Scrapes web pages and returns a list of Document objects
lazy_load()Returns an iterator of Document objects, loading them one at a time
aload()Asynchronously loads and returns a list of Document objects
alazy_load()Returns an async iterator of Document objects
load_and_split()Loads documents and splits them into chunks

How to use ScrapingAntLoader

Basic Usage

Here's how to initialize and use the ScrapingAntLoader for basic web scraping:

from langchain_community.document_loaders import ScrapingAntLoader

# Initialize the loader
loader = ScrapingAntLoader(
    urls=["https://example.com", "https://example.org"],
    api_key="your_scrapingant_api_key",
    continue_on_failure=True  # Continue if one URL fails
)

# Load documents
documents = loader.load()

# Process each document
for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:100]}...")

Custom Scraping Configuration

You can customize the scraping behavior by providing a configuration dictionary:

from langchain_community.document_loaders import ScrapingAntLoader

# Define custom scraping configuration
scrape_config = {
    "browser": True,  # Enable headless browser
    "proxy_type": "datacenter",  # Use datacenter proxies
    "proxy_country": "us",  # Use US-based proxies
}

# Initialize loader with custom config
loader = ScrapingAntLoader(
    urls=["https://example.com"],
    api_key="your_scrapingant_api_key",
    scrape_config=scrape_config
)

# Load documents
documents = loader.load()

Lazy Loading

For handling large numbers of URLs efficiently, you can use lazy loading:

from langchain_community.document_loaders import ScrapingAntLoader

loader = ScrapingAntLoader(
    urls=["https://example.com", "https://example.org", "https://example.net"],
    api_key="your_scrapingant_api_key"
)

# Lazy load documents one at a time
for document in loader.lazy_load():
    # Process each document as it's loaded
    print(f"Processing URL: {document.metadata['url']}")
    # Do something with document.page_content

Async Loading

For applications that need asynchronous operation:

import asyncio
from langchain_community.document_loaders import ScrapingAntLoader

async def scrape_urls():
    loader = ScrapingAntLoader(
        urls=["https://example.com", "https://example.org"],
        api_key="your_scrapingant_api_key"
    )
    
    # Load all documents asynchronously
    documents = await loader.aload()
    
    # Or load documents lazily
    async for document in loader.alazy_load():
        print(f"Loaded: {document.metadata['url']}")

# Run the async function
asyncio.run(scrape_urls())

Loading and Splitting Documents

If you need to split the scraped content into smaller chunks:

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import ScrapingAntLoader

loader = ScrapingAntLoader(
    urls=["https://example.com"],
    api_key="your_scrapingant_api_key"
)

# Create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Load and split documents
split_docs = loader.load_and_split(text_splitter=text_splitter)

for doc in split_docs:
    print(f"Chunk length: {len(doc.page_content)}")

Remember to handle your API key securely, preferably using environment variables:

import os
from langchain_community.document_loaders import ScrapingAntLoader

# Initialize with API key from environment variable
loader = ScrapingAntLoader(
    urls=["https://example.com"],
    api_key=os.getenv("SCRAPINGANT_API_KEY")
)

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs