Sources Module - Article Extraction

The sources module handles article collection and content extraction from various news websites. Content extraction is integrated directly into source classes.

Wired Source

Wired.com article source implementation.

This module implements article collection from Wired.com using RSS feeds and web scraping. Wired.com provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.

Features:
  • RSS feed-based article discovery

  • Multiple category support (security, guides, business, science, AI)

  • Robust content extraction with fallback methods

  • Content cleaning and validation

  • Error handling for network issues and malformed content

RSS Feed Strategy:
  1. Fetch category-specific RSS feed

  2. Parse feed to extract article URLs

  3. Scrape individual articles for full content

  4. Clean and validate extracted content

  5. Return standardized Article objects

Content Extraction:
  • Primary: Article body containers and paragraph tags

  • Fallback: Main content areas and text containers

  • Cleaning: Remove navigation, ads, and boilerplate text

  • Validation: Ensure sufficient content length

Supported Categories:
  • security: Security and cybersecurity articles

  • guide: How-to guides and tutorials

  • business: Business and industry news

  • science: Science and technology research

  • ai: Artificial intelligence and machine learning

Rate Limiting:
  • Respectful delays between requests

  • Connection reuse via HTTP session

  • Proper User-Agent identification

Example Usage:

source = WiredSource()

# Get latest security article article = source.get_latest_article(“security”)

# Get multiple guide articles articles = source.get_multiple_articles(“guide”, count=3)

# Check supported categories if “ai” in source.supported_categories:

ai_articles = source.get_multiple_articles(“ai”, count=5)

class the_data_packet.sources.wired.WiredSource[source]

Bases: ArticleSource

Article source for Wired.com.

__init__() None[source]

Initialize Wired source.

RSS_FEEDS = {'ai': 'https://www.wired.com/feed/tag/ai/latest/rss', 'science': 'https://www.wired.com/feed/category/science/latest/rss', 'security': 'https://www.wired.com/feed/category/security/latest/rss'}
SKIP_PATTERNS = ['subscribe to wired', 'most popular', 'related stories', 'advertisement', 'get wired', 'sign up', 'newsletter']
property name: str

Source name identifier.

property supported_categories: List[str]

List of supported categories.

get_latest_article(category: str) Article[source]

Get the latest article from a category.

get_multiple_articles(category: str, count: int) List[Article][source]

Get multiple articles from a category.

TechCrunch Source

TechCrunch.com article source implementation.

This module implements article collection from TechCrunch.com using RSS feeds and web scraping. TechCrunch provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.

Features:
  • RSS feed-based article discovery

  • Multiple category support (artificial-intelligence, security)

  • Robust content extraction with fallback methods

  • Content cleaning and validation

  • Error handling for network issues and malformed content

RSS Feed Strategy:
  1. Fetch category-specific RSS feed

  2. Parse feed to extract article URLs

  3. Scrape individual articles for full content

  4. Clean and validate extracted content

  5. Return standardized Article objects

Content Extraction:
  • Primary: Article body containers and paragraph tags

  • Fallback: Main content areas and text containers

  • Cleaning: Remove navigation, ads, and boilerplate text

  • Validation: Ensure sufficient content length

Supported Categories:
  • ai: Artificial intelligence and machine learning articles

  • security: Security and cybersecurity articles

Rate Limiting:
  • Respectful delays between requests

  • Connection reuse via HTTP session

  • Proper User-Agent identification

Example Usage:

source = TechCrunchSource()

# Get latest AI article article = source.get_latest_article(“ai”)

# Get multiple security articles articles = source.get_multiple_articles(“security”, count=3)

# Check supported categories if “ai” in source.supported_categories:

ai_articles = source.get_multiple_articles(“ai”, count=5)

class the_data_packet.sources.techcrunch.TechCrunchSource[source]

Bases: ArticleSource

Article source for TechCrunch.com.

__init__() None[source]

Initialize TechCrunch source.

RSS_FEEDS = {'ai': 'https://techcrunch.com/category/artificial-intelligence/feed/', 'security': 'https://techcrunch.com/category/security/feed/'}
SKIP_PATTERNS = ['subscribe to techcrunch', 'most popular', 'related articles', 'advertisement', 'get techcrunch', 'sign up', 'newsletter', 'techcrunch+', 'techcrunch disrupt', 'more techcrunch', 'follow us', 'share this article']
property name: str

Source name identifier.

property supported_categories: List[str]

List of supported categories.

get_latest_article(category: str) Article[source]

Get the latest article from a category.

get_multiple_articles(category: str, count: int) List[Article][source]

Get multiple articles from a category.

Base Classes

Base classes for article sources.

This module defines the core data structures and interfaces for collecting articles from various news sources. It provides a standardized way to represent articles and implement source-specific collection logic.

The design supports: - Multiple news sources with different scraping strategies - Consistent article representation across sources - Validation of article content quality - Category-based article filtering - Extensible source implementation

Architecture:

Article: Data class representing a single news article ArticleSource: Abstract base class for implementing news sources

Current Sources:
  • WiredSource: Wired.com articles via RSS feeds

  • TechCrunchSource: TechCrunch.com articles via RSS feeds

Future Sources (extensible):
  • ArsTechnicaSource

  • HackerNewsSource

class the_data_packet.sources.base.Article(title: str, content: str, author: str | None = None, url: str | None = None, category: str | None = None, source: str | None = None)[source]

Bases: object

Represents a single news article from any source.

This data class provides a standardized representation of news articles regardless of their source. It includes validation methods to ensure articles have sufficient content for podcast generation.

title

Article headline/title. Required for all articles.

Type:

str

content

Full article text content. Required and must be substantial.

Type:

str

author

Article author name. Optional but recommended.

Type:

str | None

url

Original article URL. Optional but useful for debugging.

Type:

str | None

category

Article category (e.g., ‘security’, ‘guide’). Optional.

Type:

str | None

source

Source identifier (e.g., ‘wired’). Optional but recommended.

Type:

str | None

Content Requirements:
  • Title must be non-empty

  • Content must be at least 100 characters after stripping whitespace

  • Content should be clean text without HTML tags or navigation elements

Example

article = Article(

title=”New Security Vulnerability Discovered”, content=”A critical security flaw has been found…”, author=”Jane Smith”, url=”https://example.com/article”, category=”security”, source=”wired”

)

if article.is_valid():

# Process article for podcast generation pass

title: str
content: str
author: str | None = None
url: str | None = None
category: str | None = None
source: str | None = None
is_valid() bool[source]

Check if article has sufficient content for podcast generation.

Validates that the article has: - Non-empty title - Non-empty content - Content length of at least 100 characters (after stripping)

Returns:

True if article meets minimum content requirements

Example

if not article.is_valid():

logger.warning(f”Skipping invalid article: {article.title}”) continue

to_dict() Dict[str, str | None][source]

Convert article to dictionary representation.

Returns:

Dictionary with all article fields

Example

article_data = article.to_dict() json.dump(article_data, file)

__init__(title: str, content: str, author: str | None = None, url: str | None = None, category: str | None = None, source: str | None = None) None
class the_data_packet.sources.base.ArticleSource[source]

Bases: ABC

Abstract base class for implementing news article sources.

This class defines the interface that all article sources must implement. It provides a consistent way to collect articles from different news websites while handling source-specific details in subclasses.

Each source implementation should: - Define supported categories - Implement RSS feed or web scraping logic - Handle rate limiting and error recovery - Clean and validate article content - Return standardized Article objects

Subclasses must implement:

name: Property returning source identifier supported_categories: Property returning list of valid categories get_latest_article(): Method to get single latest article get_multiple_articles(): Method to get multiple articles

Example Implementation:
class ExampleSource(ArticleSource):

@property def name(self) -> str:

return “example”

@property def supported_categories(self) -> List[str]:

return [“tech”, “science”]

def get_latest_article(self, category: str) -> Article:

# Implementation specific logic pass

Usage:

source = WiredSource() if “security” in source.supported_categories:

article = source.get_latest_article(“security”) articles = source.get_multiple_articles(“security”, count=5)

abstract property name: str

Source name identifier.

Returns a unique string identifier for this source. Used in configuration, logging, and file naming.

Returns:

Source identifier (e.g., “wired”, “techcrunch”)

abstract property supported_categories: List[str]

List of supported article categories for this source.

Returns the categories this source can collect articles from. Categories should match the source’s RSS feeds or section structure.

Returns:

List of category strings (e.g., [“security”, “guide”, “business”])

abstractmethod get_latest_article(category: str) Article[source]

Get the latest article from a specific category.

Parameters:

category – Category to fetch from (must be in supported_categories)

Returns:

Latest Article instance from the category

Raises:

Example

try:

article = source.get_latest_article(“security”) logger.info(f”Retrieved: {article.title}”)

except ValidationError:

logger.error(f”Category ‘invalid’ not supported”)

abstractmethod get_multiple_articles(category: str, count: int) List[Article][source]

Get multiple articles from a specific category.

Parameters:
  • category – Category to fetch from (must be in supported_categories)

  • count – Maximum number of articles to return

Returns:

List of Article instances (may be fewer than count if unavailable)

Raises:

Example

articles = source.get_multiple_articles(“guide”, count=3) valid_articles = [a for a in articles if a.is_valid()]

validate_category(category: str) None[source]

Validate if a category is supported by this source.

Parameters:

category – Category to validate

Raises:

ValidationError – If category is not supported

Example

source.validate_category(“security”) # OK source.validate_category(“invalid”) # Raises ValidationError