Techcrunch
the_data_packet.sources.techcrunch
¶
TechCrunch.com article source implementation.
This module implements article collection from TechCrunch.com using RSS feeds and web scraping. TechCrunch provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.
Features
- RSS feed-based article discovery
- Multiple category support (artificial-intelligence, security)
- Robust content extraction with fallback methods
- Content cleaning and validation
- Error handling for network issues and malformed content
RSS Feed Strategy
- Fetch category-specific RSS feed
- Parse feed to extract article URLs
- Scrape individual articles for full content
- Clean and validate extracted content
- Return standardized Article objects
Content Extraction
- Primary: Article body containers and paragraph tags
- Fallback: Main content areas and text containers
- Cleaning: Remove navigation, ads, and boilerplate text
- Validation: Ensure sufficient content length
Supported Categories
- ai: Artificial intelligence and machine learning articles
- security: Security and cybersecurity articles
Rate Limiting
- Respectful delays between requests
- Connection reuse via HTTP session
- Proper User-Agent identification
Example Usage
source = TechCrunchSource()
Get latest AI article¶
article = source.get_latest_article("ai")
Get multiple security articles¶
articles = source.get_multiple_articles("security", count=3)
Check supported categories¶
if "ai" in source.supported_categories: ai_articles = source.get_multiple_articles("ai", count=5)
logger = get_logger(__name__)
module-attribute
¶
TechCrunchSource
¶
Article source for TechCrunch.com.
http_client = HTTPClient()
instance-attribute
¶
RSS_FEEDS = {'ai': 'https://techcrunch.com/category/artificial-intelligence/feed/', 'security': 'https://techcrunch.com/category/security/feed/'}
class-attribute
instance-attribute
¶
SKIP_PATTERNS = ['subscribe to techcrunch', 'most popular', 'related articles', 'advertisement', 'get techcrunch', 'sign up', 'newsletter', 'techcrunch+', 'techcrunch disrupt', 'more techcrunch', 'follow us', 'share this article']
class-attribute
instance-attribute
¶
name: str
property
¶
Source name identifier.
supported_categories: List[str]
property
¶
List of supported categories.
__init__() -> None
¶
Initialize TechCrunch source.
get_latest_article(category: str) -> Article
¶
Get the latest article from a category.
get_multiple_articles(category: str, count: int) -> List[Article]
¶
Get multiple articles from a category.