Skip to content

Techcrunch

the_data_packet.sources.techcrunch

TechCrunch.com article source implementation.

This module implements article collection from TechCrunch.com using RSS feeds and web scraping. TechCrunch provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.

Features
  • RSS feed-based article discovery
  • Multiple category support (artificial-intelligence, security)
  • Robust content extraction with fallback methods
  • Content cleaning and validation
  • Error handling for network issues and malformed content
RSS Feed Strategy
  1. Fetch category-specific RSS feed
  2. Parse feed to extract article URLs
  3. Scrape individual articles for full content
  4. Clean and validate extracted content
  5. Return standardized Article objects
Content Extraction
  • Primary: Article body containers and paragraph tags
  • Fallback: Main content areas and text containers
  • Cleaning: Remove navigation, ads, and boilerplate text
  • Validation: Ensure sufficient content length
Supported Categories
  • ai: Artificial intelligence and machine learning articles
  • security: Security and cybersecurity articles
Rate Limiting
  • Respectful delays between requests
  • Connection reuse via HTTP session
  • Proper User-Agent identification
Example Usage

source = TechCrunchSource()

Get latest AI article

article = source.get_latest_article("ai")

Get multiple security articles

articles = source.get_multiple_articles("security", count=3)

Check supported categories

if "ai" in source.supported_categories: ai_articles = source.get_multiple_articles("ai", count=5)

logger = get_logger(__name__) module-attribute

TechCrunchSource

Article source for TechCrunch.com.

http_client = HTTPClient() instance-attribute

RSS_FEEDS = {'ai': 'https://techcrunch.com/category/artificial-intelligence/feed/', 'security': 'https://techcrunch.com/category/security/feed/'} class-attribute instance-attribute

SKIP_PATTERNS = ['subscribe to techcrunch', 'most popular', 'related articles', 'advertisement', 'get techcrunch', 'sign up', 'newsletter', 'techcrunch+', 'techcrunch disrupt', 'more techcrunch', 'follow us', 'share this article'] class-attribute instance-attribute

name: str property

Source name identifier.

supported_categories: List[str] property

List of supported categories.

__init__() -> None

Initialize TechCrunch source.

get_latest_article(category: str) -> Article

Get the latest article from a category.

get_multiple_articles(category: str, count: int) -> List[Article]

Get multiple articles from a category.