Building a Low-Cost Pipeline for U.S. Congress Trading Data
These articles are AI-generated summaries. Please check the original sources for full details.
I built two Apify actors that scrape U.S. Congress trading data — directly from government sources, no QuiverQuant
Engineer Fatih İlhan developed a custom scraping pipeline using Apify actors to extract U.S. Senate and House Periodic Transaction Reports. The system replaces commercial APIs at approximately 1/10th the cost, operating for less than $1 per day.
Why This Matters
While the STOCK Act of 2012 ensures public access to congressional trades, the technical reality of accessing this data involves bypassing Akamai bot protection on the Senate’s Django-based efdsearch.senate.gov and parsing chaotic House PDF disclosures. Commercial aggregators often provide inconsistent data shapes or paywall granular transactions; a direct-to-government pipeline allows for idempotent synchronization and 95% clean data parsing without third-party reliability issues or high subscription costs.
Key Insights
- Senate Django applications use session-based CSRF gates requiring pinned residential proxies to maintain state between the agreement POST and data retrieval (2026).
- Marker-anchored parsing strategies are required for House PTR PDFs to handle chaotic text-extraction where transaction types and amounts lack whitespace separators.
- The House of Representatives publishes daily-updated ZIP files containing XML indices and individual transaction PDFs at disclosures-clerk.house.gov.
- Deterministic deduplication using SHA-256 hashes of natural keys (politician, date, asset, and amount) prevents duplicate entries across independent actor runs.
- Axios default redirect handling can drop critical Set-Cookie headers from 302 responses, requiring manual redirect chain walking for session maintenance.
Working Examples
Pinning to a single residential exit IP to maintain the Django prohibition_agreement state.
const sessionId = `senate_${Date.now()}`;
const proxyUrl = await proxyConfig.newUrl(sessionId);
Marker-anchored regex used to identify row anchors and glued-together transaction data in House PDFs.
const MARKER_RE = /(?:\(([A-Z][A-Z0-9.\-]{0,5})\)\s*)?\[([A-Z]{2})\]/;
const TX_RE = /(S\s*\(partial\)|P|S|E)\s*(\d{1,2}\/\d{1,2}\/\d{4})\s*(\d{1,2}\/\d{1,2}\/\d{4})\s*\$([\d,]+)\s*-\s*\$([\d,]+)/;
Consuming the unified JSON schema via the Apify Node SDK.
const { items } = await client.dataset('senate-dataset-id').listItems({ limit: 200 });
const recentBuys = items.filter(t => t.type === 'buy');
Practical Applications
- Use case: Bypassing session-locked government portals by pinning residential proxy IPs. Pitfall: Using standard rotating datacenter proxies causes session expiration and 403 errors.
- Use case: PDF data extraction for machine-generated documents with poor text ordering. Pitfall: Relying on standard whitespace splitters when font-glyph hacks merge data columns into single strings.
- Use case: Idempotent database synchronization using SHA-256 content hashes. Pitfall: Using auto-incrementing primary keys which cause duplicate records when scraping the same source document twice.
References:
Continue reading
Next article
IBM Releases Two Granite Speech 4.1 2B Models: High-Speed ASR and Translation
Related Content
Building a GPT-2 Level LLM for $100: Analyzing Karpathy's nanochat Pipeline
Andrej Karpathy's nanochat project demonstrates how to train a GPT-2 level LLM for just $100 in two hours, significantly reducing costs from $43,000 in 2019. It provides a complete pipeline from tokenization to SFT, making high-performance model training accessible to engineers.
Engineering a Unified Korean Entertainment Database Across 10 Fragmented Sources
Engineer Cara Jung builds a unified database for Korean entertainment, aggregating data from 10 sources including NAVER and KOBIS to solve metadata fragmentation.
Advisor360 Automates Shadow AI Detection, Reducing Risk Assessment Time from Days to Seconds
Advisor360 reduced AI risk assessment from a week to seconds by implementing Harmonic Security's automated Shadow AI detection.