Sources Module - Article Extraction¶

The sources module handles article collection and content extraction from various news websites. Content extraction is integrated directly into source classes.

Wired Source¶

Wired.com article source implementation.

This module implements article collection from Wired.com using RSS feeds and web scraping. Wired.com provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.

Features:

RSS feed-based article discovery
Multiple category support (security, guides, business, science, AI)
Robust content extraction with fallback methods
Content cleaning and validation
Error handling for network issues and malformed content

RSS Feed Strategy:

Fetch category-specific RSS feed
Parse feed to extract article URLs
Scrape individual articles for full content
Clean and validate extracted content
Return standardized Article objects

Content Extraction:

Primary: Article body containers and paragraph tags
Fallback: Main content areas and text containers
Cleaning: Remove navigation, ads, and boilerplate text
Validation: Ensure sufficient content length

Supported Categories:

security: Security and cybersecurity articles
guide: How-to guides and tutorials
business: Business and industry news
science: Science and technology research
ai: Artificial intelligence and machine learning

Rate Limiting:

Respectful delays between requests
Connection reuse via HTTP session
Proper User-Agent identification

Example Usage:

source = WiredSource()

# Get latest security article article = source.get_latest_article(“security”)

# Get multiple guide articles articles = source.get_multiple_articles(“guide”, count=3)

# Check supported categories if “ai” in source.supported_categories:

ai_articles = source.get_multiple_articles(“ai”, count=5)

class the_data_packet.sources.wired.WiredSource[source]¶

Bases: ArticleSource

Article source for Wired.com.

__init__() → None[source]¶: Initialize Wired source.

RSS_FEEDS = {'ai': 'https://www.wired.com/feed/tag/ai/latest/rss', 'science': 'https://www.wired.com/feed/category/science/latest/rss', 'security': 'https://www.wired.com/feed/category/security/latest/rss'}¶

SKIP_PATTERNS = ['subscribe to wired', 'most popular', 'related stories', 'advertisement', 'get wired', 'sign up', 'newsletter']¶

property name: str¶: Source name identifier.

property supported_categories: List[str]¶: List of supported categories.

get_latest_article(category: str) → Article[source]¶: Get the latest article from a category.

get_multiple_articles(category: str, count: int) → List[Article][source]¶: Get multiple articles from a category.

TechCrunch Source¶

TechCrunch.com article source implementation.

This module implements article collection from TechCrunch.com using RSS feeds and web scraping. TechCrunch provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.

Features:

RSS feed-based article discovery
Multiple category support (artificial-intelligence, security)
Robust content extraction with fallback methods
Content cleaning and validation
Error handling for network issues and malformed content

RSS Feed Strategy:

Fetch category-specific RSS feed
Parse feed to extract article URLs
Scrape individual articles for full content
Clean and validate extracted content
Return standardized Article objects

Content Extraction:

Primary: Article body containers and paragraph tags
Fallback: Main content areas and text containers
Cleaning: Remove navigation, ads, and boilerplate text
Validation: Ensure sufficient content length

Supported Categories:

ai: Artificial intelligence and machine learning articles
security: Security and cybersecurity articles

Rate Limiting:

Respectful delays between requests
Connection reuse via HTTP session
Proper User-Agent identification

Example Usage:

source = TechCrunchSource()

# Get latest AI article article = source.get_latest_article(“ai”)

# Get multiple security articles articles = source.get_multiple_articles(“security”, count=3)

# Check supported categories if “ai” in source.supported_categories:

ai_articles = source.get_multiple_articles(“ai”, count=5)

class the_data_packet.sources.techcrunch.TechCrunchSource[source]¶

Bases: ArticleSource

Article source for TechCrunch.com.

__init__() → None[source]¶: Initialize TechCrunch source.

RSS_FEEDS = {'ai': 'https://techcrunch.com/category/artificial-intelligence/feed/', 'security': 'https://techcrunch.com/category/security/feed/'}¶

SKIP_PATTERNS = ['subscribe to techcrunch', 'most popular', 'related articles', 'advertisement', 'get techcrunch', 'sign up', 'newsletter', 'techcrunch+', 'techcrunch disrupt', 'more techcrunch', 'follow us', 'share this article']¶

property name: str¶: Source name identifier.

property supported_categories: List[str]¶: List of supported categories.

get_latest_article(category: str) → Article[source]¶: Get the latest article from a category.

get_multiple_articles(category: str, count: int) → List[Article][source]¶: Get multiple articles from a category.

Base Classes¶

Base classes for article sources.

This module defines the core data structures and interfaces for collecting articles from various news sources. It provides a standardized way to represent articles and implement source-specific collection logic.

The design supports: - Multiple news sources with different scraping strategies - Consistent article representation across sources - Validation of article content quality - Category-based article filtering - Extensible source implementation

Architecture:

Article: Data class representing a single news article ArticleSource: Abstract base class for implementing news sources

Current Sources:

WiredSource: Wired.com articles via RSS feeds
TechCrunchSource: TechCrunch.com articles via RSS feeds

Future Sources (extensible):

ArsTechnicaSource
HackerNewsSource

class the_data_packet.sources.base.Article(title: str, content: str, author: str | None = None, url: str | None = None, category: str | None = None, source: str | None = None)[source]¶

Bases: object

Represents a single news article from any source.

This data class provides a standardized representation of news articles regardless of their source. It includes validation methods to ensure articles have sufficient content for podcast generation.

title¶

Article headline/title. Required for all articles.

Type:: str

content¶

Full article text content. Required and must be substantial.

Type:: str

author¶

Article author name. Optional but recommended.

Type:: str | None

url¶

Original article URL. Optional but useful for debugging.

Type:: str | None

category¶

Article category (e.g., ‘security’, ‘guide’). Optional.

Type:: str | None

source¶

Source identifier (e.g., ‘wired’). Optional but recommended.

Type:: str | None

Content Requirements:

Title must be non-empty
Content must be at least 100 characters after stripping whitespace
Content should be clean text without HTML tags or navigation elements

Example

article = Article(: title=”New Security Vulnerability Discovered”, content=”A critical security flaw has been found…”, author=”Jane Smith”, url=”https://example.com/article”, category=”security”, source=”wired”

)

if article.is_valid():: # Process article for podcast generation pass

title: str¶

content: str¶

author: str | None = None¶

url: str | None = None¶

category: str | None = None¶

source: str | None = None¶

is_valid() → bool[source]¶

Check if article has sufficient content for podcast generation.

Validates that the article has: - Non-empty title - Non-empty content - Content length of at least 100 characters (after stripping)

Returns:: True if article meets minimum content requirements

Example

if not article.is_valid():: logger.warning(f”Skipping invalid article: {article.title}”) continue

to_dict() → Dict[str, str | None][source]¶

Convert article to dictionary representation.

Returns:: Dictionary with all article fields

Example

article_data = article.to_dict() json.dump(article_data, file)

__init__(title: str, content: str, author: str | None = None, url: str | None = None, category: str | None = None, source: str | None = None) → None¶

class the_data_packet.sources.base.ArticleSource[source]¶

Bases: ABC

Abstract base class for implementing news article sources.

This class defines the interface that all article sources must implement. It provides a consistent way to collect articles from different news websites while handling source-specific details in subclasses.

Each source implementation should: - Define supported categories - Implement RSS feed or web scraping logic - Handle rate limiting and error recovery - Clean and validate article content - Return standardized Article objects

Subclasses must implement:

name: Property returning source identifier supported_categories: Property returning list of valid categories get_latest_article(): Method to get single latest article get_multiple_articles(): Method to get multiple articles

Example Implementation:

class ExampleSource(ArticleSource):

@property def name(self) -> str:

return “example”

@property def supported_categories(self) -> List[str]:

return [“tech”, “science”]

def get_latest_article(self, category: str) -> Article:: # Implementation specific logic pass

Usage:

source = WiredSource() if “security” in source.supported_categories:

article = source.get_latest_article(“security”) articles = source.get_multiple_articles(“security”, count=5)

abstract property name: str¶

Source name identifier.

Returns a unique string identifier for this source. Used in configuration, logging, and file naming.

Returns:: Source identifier (e.g., “wired”, “techcrunch”)

abstract property supported_categories: List[str]¶

List of supported article categories for this source.

Returns the categories this source can collect articles from. Categories should match the source’s RSS feeds or section structure.

Returns:: List of category strings (e.g., [“security”, “guide”, “business”])

abstractmethod get_latest_article(category: str) → Article[source]¶

Get the latest article from a specific category.

Parameters:

category – Category to fetch from (must be in supported_categories)

Returns:

Latest Article instance from the category

Raises:

ScrapingError – If article collection fails
ValidationError – If category is not supported
NetworkError – If network request fails

Example

try:: article = source.get_latest_article(“security”) logger.info(f”Retrieved: {article.title}”)
except ValidationError:: logger.error(f”Category ‘invalid’ not supported”)

abstractmethod get_multiple_articles(category: str, count: int) → List[Article][source]¶

Get multiple articles from a specific category.

Parameters:

category – Category to fetch from (must be in supported_categories)
count – Maximum number of articles to return

Returns:

List of Article instances (may be fewer than count if unavailable)

Raises:

ScrapingError – If article collection fails
ValidationError – If category is not supported or count is invalid
NetworkError – If network request fails

Example

articles = source.get_multiple_articles(“guide”, count=3) valid_articles = [a for a in articles if a.is_valid()]

validate_category(category: str) → None[source]¶

Validate if a category is supported by this source.

Parameters:: category – Category to validate
Raises:: ValidationError – If category is not supported

Example

source.validate_category(“security”) # OK source.validate_category(“invalid”) # Raises ValidationError