Advanced Web Scraping with Crawl4AI: Markdown Generation, JS Execution, and Structured LLM Extraction

A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction

Crawl4AI v0.8.x introduces a comprehensive workflow for transforming raw web content into structured data using async crawlers and LLM extraction strategies. The system supports advanced features like BM25 query-based filtering and BFS deep crawling to optimize data relevance and crawl depth.

Why This Matters

Traditional web scraping often fails when encountering dynamic, JavaScript-heavy pages or when trying to extract clean data from messy HTML. Crawl4AI addresses the gap between raw HTML retrieval and usable datasets by integrating browser configuration, content pruning with algorithms like BM25, and Pydantic-based LLM schemas, allowing developers to bypass manual selector maintenance and handle complex session states reliably.

Key Insights

Crawl4AI v0.8.x supports BM25 query-based filtering to extract only content relevant to specific user queries, significantly reducing markdown noise.
The BFSDeepCrawlStrategy allows for structured multi-page exploration with domain and URL pattern filters to prevent uncontrolled crawling.
JavaScript execution can be injected before extraction to handle lazy loading and dynamic DOM modifications, verified by custom attributes.
Session management enables persistent cookies and browser states across sequential requests, facilitating authenticated or multi-step crawling.
Integration with LLMs like GPT-4o-mini via LLMExtractionStrategy allows for structured JSON output defined by Pydantic models.

Working Examples

The simplest possible crawl using AsyncWebCrawler to fetch a webpage and retrieve markdown.

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(f"Title: {result.metadata.get('title')}")
    print(result.markdown.raw_markdown[:500])

Extracting structured data using CSS selectors and a predefined JSON schema without an LLM.

schema = {
    "name": "Wikipedia Headings",
    "baseSelector": "div.mw-parser-output h2",
    "fields": [
        {"name": "heading_text", "selector": "span.mw-headline", "type": "text"},
        {"name": "heading_id", "selector": "span.mw-headline", "type": "attribute", "attribute": "id"}
    ]
}
extraction_strategy = JsonCssExtractionStrategy(schema)
run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://en.wikipedia.org/wiki/Python_(programming_language)", config=run_config)

Utilizing LLM-based extraction to convert unstructured web content into Pydantic-validated JSON objects.

class Article(BaseModel):
    title: str = Field(description="The article title")
    summary: str = Field(description="A brief summary")

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token=api_key),
    schema=Article.model_json_schema(),
    extraction_type="schema",
    instruction="Extract article titles and summaries."
)
run_config = CrawlerRunConfig(extraction_strategy=llm_strategy)

Practical Applications

Hacker News story tracking: Using JsonCssExtractionStrategy to pull ranked stories and site origins into a clean JSON format. Pitfall: Relying on brittle selectors that break during site layout updates.
Documentation indexing: Using BFSDeepCrawlStrategy to recursively crawl documentation pages while filtering for specific keywords like ‘quickstart’. Pitfall: Failing to set max_depth, which can lead to infinite crawling of external links.
Dynamic content extraction: Implementing custom JavaScript blocks to trigger lazy-loading via window.scrollTo before capturing page state. Pitfall: Setting insufficient delay_before_return_html, resulting in the capture of empty loading states.

References:

On This Page

A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building Multi-Agent Systems with SmolAgents: Code Execution and Dynamic Orchestration

NVIDIA Nemotron-Terminal: Scaling LLM Agents with Systematic Data Engineering

Designing an Autonomous Multi-Agent Data Infrastructure System with Lightweight Qwen Models