Advanced Web Scraping with Crawl4AI: Markdown Generation, JS Execution, and Structured LLM Extraction
These articles are AI-generated summaries. Please check the original sources for full details.
A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction
Crawl4AI v0.8.x introduces a comprehensive workflow for transforming raw web content into structured data using async crawlers and LLM extraction strategies. The system supports advanced features like BM25 query-based filtering and BFS deep crawling to optimize data relevance and crawl depth.
Why This Matters
Traditional web scraping often fails when encountering dynamic, JavaScript-heavy pages or when trying to extract clean data from messy HTML. Crawl4AI addresses the gap between raw HTML retrieval and usable datasets by integrating browser configuration, content pruning with algorithms like BM25, and Pydantic-based LLM schemas, allowing developers to bypass manual selector maintenance and handle complex session states reliably.
Key Insights
- Crawl4AI v0.8.x supports BM25 query-based filtering to extract only content relevant to specific user queries, significantly reducing markdown noise.
- The BFSDeepCrawlStrategy allows for structured multi-page exploration with domain and URL pattern filters to prevent uncontrolled crawling.
- JavaScript execution can be injected before extraction to handle lazy loading and dynamic DOM modifications, verified by custom attributes.
- Session management enables persistent cookies and browser states across sequential requests, facilitating authenticated or multi-step crawling.
- Integration with LLMs like GPT-4o-mini via LLMExtractionStrategy allows for structured JSON output defined by Pydantic models.
Working Examples
The simplest possible crawl using AsyncWebCrawler to fetch a webpage and retrieve markdown.
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(f"Title: {result.metadata.get('title')}")
print(result.markdown.raw_markdown[:500])
Extracting structured data using CSS selectors and a predefined JSON schema without an LLM.
schema = {
"name": "Wikipedia Headings",
"baseSelector": "div.mw-parser-output h2",
"fields": [
{"name": "heading_text", "selector": "span.mw-headline", "type": "text"},
{"name": "heading_id", "selector": "span.mw-headline", "type": "attribute", "attribute": "id"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema)
run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/Python_(programming_language)", config=run_config)
Utilizing LLM-based extraction to convert unstructured web content into Pydantic-validated JSON objects.
class Article(BaseModel):
title: str = Field(description="The article title")
summary: str = Field(description="A brief summary")
llm_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token=api_key),
schema=Article.model_json_schema(),
extraction_type="schema",
instruction="Extract article titles and summaries."
)
run_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
Practical Applications
- Hacker News story tracking: Using JsonCssExtractionStrategy to pull ranked stories and site origins into a clean JSON format. Pitfall: Relying on brittle selectors that break during site layout updates.
- Documentation indexing: Using BFSDeepCrawlStrategy to recursively crawl documentation pages while filtering for specific keywords like ‘quickstart’. Pitfall: Failing to set max_depth, which can lead to infinite crawling of external links.
- Dynamic content extraction: Implementing custom JavaScript blocks to trigger lazy-loading via window.scrollTo before capturing page state. Pitfall: Setting insufficient delay_before_return_html, resulting in the capture of empty loading states.
References:
Continue reading
Next article
Google Skills in Chrome: Native Browser-Level Prompt Templating for AI Workflows
Related Content
Building Multi-Agent Systems with SmolAgents: Code Execution and Dynamic Orchestration
Learn to build production-ready multi-agent systems using SmolAgents v1.24.0, featuring Python-based code execution and dynamic tool management for complex reasoning tasks.
NVIDIA Nemotron-Terminal: Scaling LLM Agents with Systematic Data Engineering
NVIDIA releases Nemotron-Terminal, a 32B model that outperforms the 480B Qwen3-Coder on terminal benchmarks using the Terminal-Task-Gen pipeline.
Building a Groq-Powered Agentic Research Assistant with LangGraph and Sub-Agents
Build a high-performance research assistant using Groq's inference endpoint, LangGraph, and Llama-3.3-70b to automate multi-step workflows with agentic memory.