Avneesh.
01AI / Data Engineering
← back

project_01

News Scraper & Lead Intelligence Engine

Built an async Python pipeline that aggregates business news from RSS feeds and Vertex AI Search, scrapes articles via Tor, classifies leads with Bedrock LLMs, enriches company data, performs semantic deduplication using embeddings, and stores structured results in PostgreSQL.

● LiveAI / LLMWeb Scraping
Overview

A fully automated, asynchronous news-intelligence pipeline built in Python. It ingests articles from 20 RSS feeds (The Hindu, Zee News, Hindustan Times, India TV, Times of India, ET Now, Indian Express, PR Newswire) and Google Vertex AI Discovery Search. Each article is full-text scraped via a Tor hidden-service API, then passed through a two-stage AWS Bedrock (Nova Lite) LLM pipeline: first to classify whether it is a business lead, then to extract structured entities (companies, contacts, signals, project details). Confirmed leads are enriched with LinkedIn data via Unipile. Article summaries and categories are generated by Bedrock, keywords by YAKE. Every article summary is embedded with Alibaba-NLP/gte-large-en-v1.5 (1024-dim) and stored in a pgvector store in PostgreSQL to detect semantically similar prior coverage. Results are fanned out across a normalised 14-table schema.

Problem

Business development teams spend hours every day manually scanning dozens of Indian news outlets to find companies announcing funding rounds, tech initiatives, and leadership changes — events that represent actionable sales or partnership opportunities. The process is slow, non-scalable, and produces unstructured output that is hard to act on.

Engineering Challenges
01.Reliable full-text scraping across news sites with varied anti-bot measures — solved with a Tor-routed onion-service scraper that classifies SOCKS errors (TTL expiry 0x06, host-down 0x01, connection refused 0x05) and auto-restarts the Tor circuit only when genuinely needed.
02.Semantic deduplication across a growing article store without an O(n²) comparison — solved by embedding each summary (1024-dim) and running a threshold cosine-similarity query directly in PostgreSQL via pgvector.
03.Extracting structured JSON (companies, contacts, project details) from unstructured LLM output reliably — Pydantic validation is enforced at parse time on every Bedrock response, with graceful fallback defaults.
04.Long news articles that exceed the sentence-transformer's safe token window — solved with overlapping chunking and re-normalised embeddings across all chunks.
05.Coordinating async DB pool lifecycle, port binding, and graceful shutdown across a single-process asyncio app — handled with a singleton DatabaseManager with exponential-backoff retry and a dedicated port-lock socket.
Key Decisions
Tor onion-service (SOCKS5h) over direct scraping — eliminates per-request billing and reduces TLS fingerprinting exposure.
Two-stage LLM calls (classify → extract) rather than one — saves tokens on the ~60% of articles that are not leads; the extraction stage only fires for confirmed leads.
Amazon Bedrock (Nova Lite) over OpenAI — reuses existing AWS IAM/credential infrastructure and keeps all data within the AWS boundary.
PostgreSQL + pgvector over a dedicated vector store (Pinecone/Weaviate) — keeps lead metadata and vectors co-located, eliminates a network hop, and allows JOIN-based retrieval.
YAKE (unsupervised keyword extraction) for article keywords — zero latency, zero cost, sufficient quality for downstream tagging.
Stack
Python 3.11asyncioPydantic v2asyncpgPostgreSQLpgvectorAmazon Bedrock (Nova Lite)Sentence Transformers — GTE-Large-en-v1.5Tor / SOCKS5hYAKEfeedparserGoogle Vertex AI Discovery SearchUnipile APIAWS Secrets Managerpython-dotenvfilelockpytz
Architecture
Key Features
Dual-source Ingestion Collects articles from 20 RSS feeds across 16 Indian news outlets and from Google Vertex AI Discovery Search using configurable search terms stored in PostgreSQL. Both sources write to the same flat JSON staging file with deduplication.
Tor-routed Full-text Scraping Routes each article URL through a remote onion-service scraper API. Distinguishes TTL-expiry (0x06), host-down (0x01), and unreachable (0x04) SOCKS errors, applies targeted retry/circuit-restart logic per error class, and records scraper diagnostics in the DB for monitoring.
Two-stage LLM Lead Pipeline Stage 1 (classification): Bedrock Nova Lite determines whether the article is a business lead, returning confidence (HIGH/MEDIUM/LOW) and structured reasons. Stage 2 fires only for confirmed leads — extracting companies, contacts, business signals, project details, funding sources, and stated goals.
Unipile Company Enrichment For each extracted company, Unipile is queried to resolve LinkedIn profile URL, logo, website, employee count range, and headquarters city/country. Enriched data is merged into the company record before DB insertion.
Semantic Deduplication with pgvector Each confirmed-lead article summary is embedded with Alibaba-NLP/gte-large-en-v1.5 (1024-dim). The embedding is compared against all existing vectors in PostgreSQL via pgvector cosine similarity; near-duplicate articles are stored as 'similar' rather than 'unique'.
Normalised 14-table PostgreSQL Schema Articles fan out across primary, main, full-news, vectors, category, keywords, and search-term tables — plus leads-summary, lead-companies, lead-contacts, lead-signals, lead-projects, processed, and failed tables — enabling granular querying and per-stage failure tracking.
Metrics
16News Sources Integrated
20RSS Feeds Monitored
1024Vector Embedding Dimensions
2-stageLLM Pipeline Architecture
14Normalised DB Tables
100+Articles Processed per Engine Run