We Scraped 3.8M+ Local News Articles. RAG Dataset, Updated Daily

Redash dashboard for our RAG project

Client: getsphere.com
Industry: Policy & compliance
Coverage & Scale: ~4.5k local/regional news domains
Engagement: Build → handoff; client now runs ops
Stack: Python, Scrapy, Redis, PostgreSQL, newspaper3k, AWS S3, Redash, Slack
Polite by design: robots.txt respected on every source


Project Snapshot

Goal: Build a reliable stream of local news for a RAG dataset, delivered every morning.
What we built: Hybrid discovery (RSS + sitemaps + landing‑page scans) → parsing articles to a clean schema → storage (PostgreSQL + S3) → Redash dashboards for metrics visibility → daily Slack summaries & alerts.

Results (top numbers):

  • Volume: 10k-31k new articles/day.
  • Last 30 days processed: ≈580k articles
  • Freshness SLO: 100% of yesterday’s articles available by morning.
  • Image URLs captured: ~85-88% recent average with alerting on dips.

Context & Scope

Unify discovery (RSS, sitemaps, landing-page scans), standardize articles into a consistent schema, store dataset in a PostgreSQL and copy it to S3. Surface ops metrics in Redash with daily Slack summaries. Theme-agnostic: switch topics by updating sources and filters.


What We Built

  • Discovery coverage:
    • RSS reader our go-to when available
    • Sitemap scanner to sweep structured listings
    • Landing‑page checker (load front page and go in depth → capture all article links → diff for new URLs)
  • Parsing & normalization: newspaper3k turns messy HTML into a standard schema (headline, body, author, date, image URLs).
  • Storage: PostgreSQL + copy to S3 for easy RAG dataset upload.
  • Monitoring: Redash dashboards on Postgres; daily Slack digest + anomaly alerts (volume spikes/drops, image %).

Outcomes for the Client

  • An analysis‑ready local news dataset that refreshes daily.
  • Consistent schema and predictable volume, enabling downstream labeling and retrieval.
  • Ops visibility via Redash charts and Slack alerts – no manual checking.

Reliability & Ops (how it stays healthy)

  • Source failover: RSS → sitemap → landing‑page scan.
  • Retries/backoff and shardable Scrapy workers.
  • Alert‑driven workflow: drops in daily volume or image coverage trigger Slack summaries with Redash links.

Tooling choice & costs insights

  • We chose newspaper3k across all articles for now; occasional inaccuracies are outweighed by speed and coverage.
  • Going Zyte AP only would land around $3,000-$4,500/month. Our newspaper3k servers cost ~$100/month at today’s scale.
  • Hybrid option is available: keep newspaper3k by default and route the hardest sources through Zyte when it adds clear value.
Start Typing
Web Scraping Logo

How will customers benefit from our services?

eCommerce companies:

Be ahead of your competitors by intelligent price setting. Access to additional information can boost your sales by up to 85%

Wholesalers and manufacturers:

Know the stocks of your customers in near real-time. This information will help you predict the demand better and reduce your stockpile.

Data science teams:

Web scraping can be tedious. We will take care of that headache for you.

Want to hear more?







    Privacy Preferences

    When you visit our website, it may store information through your browser from specific services, usually in the form of cookies. Here you can change your Privacy preferences. It is worth noting that blocking some types of cookies may impact your experience on our website and the services we are able to offer.

    We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

    Contact Form