Sources Module - Article Extraction¶
The sources module handles article collection and content extraction from various news websites. Content extraction is integrated directly into source classes.
Wired Source¶
Wired.com article source implementation.
This module implements article collection from Wired.com using RSS feeds and web scraping. Wired.com provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.
- Features:
RSS feed-based article discovery
Multiple category support (security, guides, business, science, AI)
Robust content extraction with fallback methods
Content cleaning and validation
Error handling for network issues and malformed content
- RSS Feed Strategy:
Fetch category-specific RSS feed
Parse feed to extract article URLs
Scrape individual articles for full content
Clean and validate extracted content
Return standardized Article objects
- Content Extraction:
Primary: Article body containers and paragraph tags
Fallback: Main content areas and text containers
Cleaning: Remove navigation, ads, and boilerplate text
Validation: Ensure sufficient content length
- Supported Categories:
security: Security and cybersecurity articles
guide: How-to guides and tutorials
business: Business and industry news
science: Science and technology research
ai: Artificial intelligence and machine learning
- Rate Limiting:
Respectful delays between requests
Connection reuse via HTTP session
Proper User-Agent identification
- Example Usage:
source = WiredSource()
# Get latest security article article = source.get_latest_article(“security”)
# Get multiple guide articles articles = source.get_multiple_articles(“guide”, count=3)
# Check supported categories if “ai” in source.supported_categories:
ai_articles = source.get_multiple_articles(“ai”, count=5)
- class the_data_packet.sources.wired.WiredSource[source]¶
Bases:
ArticleSourceArticle source for Wired.com.
- RSS_FEEDS = {'ai': 'https://www.wired.com/feed/tag/ai/latest/rss', 'science': 'https://www.wired.com/feed/category/science/latest/rss', 'security': 'https://www.wired.com/feed/category/security/latest/rss'}¶
- SKIP_PATTERNS = ['subscribe to wired', 'most popular', 'related stories', 'advertisement', 'get wired', 'sign up', 'newsletter']¶
TechCrunch Source¶
TechCrunch.com article source implementation.
This module implements article collection from TechCrunch.com using RSS feeds and web scraping. TechCrunch provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.
- Features:
RSS feed-based article discovery
Multiple category support (artificial-intelligence, security)
Robust content extraction with fallback methods
Content cleaning and validation
Error handling for network issues and malformed content
- RSS Feed Strategy:
Fetch category-specific RSS feed
Parse feed to extract article URLs
Scrape individual articles for full content
Clean and validate extracted content
Return standardized Article objects
- Content Extraction:
Primary: Article body containers and paragraph tags
Fallback: Main content areas and text containers
Cleaning: Remove navigation, ads, and boilerplate text
Validation: Ensure sufficient content length
- Supported Categories:
ai: Artificial intelligence and machine learning articles
security: Security and cybersecurity articles
- Rate Limiting:
Respectful delays between requests
Connection reuse via HTTP session
Proper User-Agent identification
- Example Usage:
source = TechCrunchSource()
# Get latest AI article article = source.get_latest_article(“ai”)
# Get multiple security articles articles = source.get_multiple_articles(“security”, count=3)
# Check supported categories if “ai” in source.supported_categories:
ai_articles = source.get_multiple_articles(“ai”, count=5)
- class the_data_packet.sources.techcrunch.TechCrunchSource[source]¶
Bases:
ArticleSourceArticle source for TechCrunch.com.
- RSS_FEEDS = {'ai': 'https://techcrunch.com/category/artificial-intelligence/feed/', 'security': 'https://techcrunch.com/category/security/feed/'}¶
- SKIP_PATTERNS = ['subscribe to techcrunch', 'most popular', 'related articles', 'advertisement', 'get techcrunch', 'sign up', 'newsletter', 'techcrunch+', 'techcrunch disrupt', 'more techcrunch', 'follow us', 'share this article']¶
Base Classes¶
Base classes for article sources.
This module defines the core data structures and interfaces for collecting articles from various news sources. It provides a standardized way to represent articles and implement source-specific collection logic.
The design supports: - Multiple news sources with different scraping strategies - Consistent article representation across sources - Validation of article content quality - Category-based article filtering - Extensible source implementation
- Architecture:
Article: Data class representing a single news article ArticleSource: Abstract base class for implementing news sources
- Current Sources:
WiredSource: Wired.com articles via RSS feeds
TechCrunchSource: TechCrunch.com articles via RSS feeds
- Future Sources (extensible):
ArsTechnicaSource
HackerNewsSource
- class the_data_packet.sources.base.Article(title: str, content: str, author: str | None = None, url: str | None = None, category: str | None = None, source: str | None = None)[source]¶
Bases:
objectRepresents a single news article from any source.
This data class provides a standardized representation of news articles regardless of their source. It includes validation methods to ensure articles have sufficient content for podcast generation.
- Content Requirements:
Title must be non-empty
Content must be at least 100 characters after stripping whitespace
Content should be clean text without HTML tags or navigation elements
Example
- article = Article(
title=”New Security Vulnerability Discovered”, content=”A critical security flaw has been found…”, author=”Jane Smith”, url=”https://example.com/article”, category=”security”, source=”wired”
)
- if article.is_valid():
# Process article for podcast generation pass
- is_valid() bool[source]¶
Check if article has sufficient content for podcast generation.
Validates that the article has: - Non-empty title - Non-empty content - Content length of at least 100 characters (after stripping)
- Returns:
True if article meets minimum content requirements
Example
- if not article.is_valid():
logger.warning(f”Skipping invalid article: {article.title}”) continue
- class the_data_packet.sources.base.ArticleSource[source]¶
Bases:
ABCAbstract base class for implementing news article sources.
This class defines the interface that all article sources must implement. It provides a consistent way to collect articles from different news websites while handling source-specific details in subclasses.
Each source implementation should: - Define supported categories - Implement RSS feed or web scraping logic - Handle rate limiting and error recovery - Clean and validate article content - Return standardized Article objects
- Subclasses must implement:
name: Property returning source identifier supported_categories: Property returning list of valid categories get_latest_article(): Method to get single latest article get_multiple_articles(): Method to get multiple articles
- Example Implementation:
- class ExampleSource(ArticleSource):
@property def name(self) -> str:
return “example”
@property def supported_categories(self) -> List[str]:
return [“tech”, “science”]
- def get_latest_article(self, category: str) -> Article:
# Implementation specific logic pass
- Usage:
source = WiredSource() if “security” in source.supported_categories:
article = source.get_latest_article(“security”) articles = source.get_multiple_articles(“security”, count=5)
- abstract property name: str¶
Source name identifier.
Returns a unique string identifier for this source. Used in configuration, logging, and file naming.
- Returns:
Source identifier (e.g., “wired”, “techcrunch”)
- abstract property supported_categories: List[str]¶
List of supported article categories for this source.
Returns the categories this source can collect articles from. Categories should match the source’s RSS feeds or section structure.
- Returns:
List of category strings (e.g., [“security”, “guide”, “business”])
- abstractmethod get_latest_article(category: str) Article[source]¶
Get the latest article from a specific category.
- Parameters:
category – Category to fetch from (must be in supported_categories)
- Returns:
Latest Article instance from the category
- Raises:
ScrapingError – If article collection fails
ValidationError – If category is not supported
NetworkError – If network request fails
Example
- try:
article = source.get_latest_article(“security”) logger.info(f”Retrieved: {article.title}”)
- except ValidationError:
logger.error(f”Category ‘invalid’ not supported”)
- abstractmethod get_multiple_articles(category: str, count: int) List[Article][source]¶
Get multiple articles from a specific category.
- Parameters:
category – Category to fetch from (must be in supported_categories)
count – Maximum number of articles to return
- Returns:
List of Article instances (may be fewer than count if unavailable)
- Raises:
ScrapingError – If article collection fails
ValidationError – If category is not supported or count is invalid
NetworkError – If network request fails
Example
articles = source.get_multiple_articles(“guide”, count=3) valid_articles = [a for a in articles if a.is_valid()]
- validate_category(category: str) None[source]¶
Validate if a category is supported by this source.
- Parameters:
category – Category to validate
- Raises:
ValidationError – If category is not supported
Example
source.validate_category(“security”) # OK source.validate_category(“invalid”) # Raises ValidationError