project_01
News Scraper & Lead Intelligence Engine
Built an async Python pipeline that aggregates business news from RSS feeds and Vertex AI Search, scrapes articles via Tor, classifies leads with Bedrock LLMs, enriches company data, performs semantic deduplication using embeddings, and stores structured results in PostgreSQL.
A fully automated, asynchronous news-intelligence pipeline built in Python. It ingests articles from 20 RSS feeds (The Hindu, Zee News, Hindustan Times, India TV, Times of India, ET Now, Indian Express, PR Newswire) and Google Vertex AI Discovery Search. Each article is full-text scraped via a Tor hidden-service API, then passed through a two-stage AWS Bedrock (Nova Lite) LLM pipeline: first to classify whether it is a business lead, then to extract structured entities (companies, contacts, signals, project details). Confirmed leads are enriched with LinkedIn data via Unipile. Article summaries and categories are generated by Bedrock, keywords by YAKE. Every article summary is embedded with Alibaba-NLP/gte-large-en-v1.5 (1024-dim) and stored in a pgvector store in PostgreSQL to detect semantically similar prior coverage. Results are fanned out across a normalised 14-table schema.
Business development teams spend hours every day manually scanning dozens of Indian news outlets to find companies announcing funding rounds, tech initiatives, and leadership changes — events that represent actionable sales or partnership opportunities. The process is slow, non-scalable, and produces unstructured output that is hard to act on.