Skip to content

Base

the_data_packet.sources.base

Base classes for article sources.

This module defines the core data structures and interfaces for collecting articles from various news sources. It provides a standardized way to represent articles and implement source-specific collection logic.

The design supports: - Multiple news sources with different scraping strategies - Consistent article representation across sources - Validation of article content quality - Category-based article filtering - Extensible source implementation

Architecture

Article: Data class representing a single news article ArticleSource: Abstract base class for implementing news sources

Current Sources
  • WiredSource: Wired.com articles via RSS feeds
  • TechCrunchSource: TechCrunch.com articles via RSS feeds

Future Sources (extensible): - ArsTechnicaSource - HackerNewsSource

Article dataclass

Represents a single news article from any source.

This data class provides a standardized representation of news articles regardless of their source. It includes validation methods to ensure articles have sufficient content for podcast generation.

Attributes:

Name Type Description
title str

Article headline/title. Required for all articles.

content str

Full article text content. Required and must be substantial.

author Optional[str]

Article author name. Optional but recommended.

url Optional[str]

Original article URL. Optional but useful for debugging.

category Optional[str]

Article category (e.g., 'security', 'guide'). Optional.

source Optional[str]

Source identifier (e.g., 'wired'). Optional but recommended.

Content Requirements
  • Title must be non-empty
  • Content must be at least 100 characters after stripping whitespace
  • Content should be clean text without HTML tags or navigation elements
Example

article = Article( title="New Security Vulnerability Discovered", content="A critical security flaw has been found...", author="Jane Smith", url="https://example.com/article", category="security", source="wired" )

if article.is_valid(): # Process article for podcast generation pass

title: str instance-attribute

content: str instance-attribute

author: Optional[str] = None class-attribute instance-attribute

url: Optional[str] = None class-attribute instance-attribute

category: Optional[str] = None class-attribute instance-attribute

source: Optional[str] = None class-attribute instance-attribute

__init__(title: str, content: str, author: Optional[str] = None, url: Optional[str] = None, category: Optional[str] = None, source: Optional[str] = None) -> None

is_valid() -> bool

Check if article has sufficient content for podcast generation.

Validates that the article has: - Non-empty title - Non-empty content - Content length of at least 100 characters (after stripping)

Returns:

Type Description
bool

True if article meets minimum content requirements

Example

if not article.is_valid(): logger.warning(f"Skipping invalid article: {article.title}") continue

to_dict() -> Dict[str, Optional[str]]

Convert article to dictionary representation.

Returns:

Type Description
Dict[str, Optional[str]]

Dictionary with all article fields

Example

article_data = article.to_dict() json.dump(article_data, file)

ArticleSource

Abstract base class for implementing news article sources.

This class defines the interface that all article sources must implement. It provides a consistent way to collect articles from different news websites while handling source-specific details in subclasses.

Each source implementation should: - Define supported categories - Implement RSS feed or web scraping logic - Handle rate limiting and error recovery - Clean and validate article content - Return standardized Article objects

Subclasses must implement

name: Property returning source identifier supported_categories: Property returning list of valid categories get_latest_article(): Method to get single latest article get_multiple_articles(): Method to get multiple articles

Example Implementation

class ExampleSource(ArticleSource): @property def name(self) -> str: return "example"

@property
def supported_categories(self) -> List[str]:
    return ["tech", "science"]

def get_latest_article(self, category: str) -> Article:
    # Implementation specific logic
    pass
Usage

source = WiredSource() if "security" in source.supported_categories: article = source.get_latest_article("security") articles = source.get_multiple_articles("security", count=5)

name: str abstractmethod property

Source name identifier.

Returns a unique string identifier for this source. Used in configuration, logging, and file naming.

Returns:

Type Description
str

Source identifier (e.g., "wired", "techcrunch")

supported_categories: List[str] abstractmethod property

List of supported article categories for this source.

Returns the categories this source can collect articles from. Categories should match the source's RSS feeds or section structure.

Returns:

Type Description
List[str]

List of category strings (e.g., ["security", "guide", "business"])

get_latest_article(category: str) -> Article abstractmethod

Get the latest article from a specific category.

Parameters:

Name Type Description Default
category str

Category to fetch from (must be in supported_categories)

required

Returns:

Type Description
Article

Latest Article instance from the category

Raises:

Type Description
ScrapingError

If article collection fails

ValidationError

If category is not supported

NetworkError

If network request fails

Example

try: article = source.get_latest_article("security") logger.info(f"Retrieved: {article.title}") except ValidationError: logger.error(f"Category 'invalid' not supported")

get_multiple_articles(category: str, count: int) -> List[Article] abstractmethod

Get multiple articles from a specific category.

Parameters:

Name Type Description Default
category str

Category to fetch from (must be in supported_categories)

required
count int

Maximum number of articles to return

required

Returns:

Type Description
List[Article]

List of Article instances (may be fewer than count if unavailable)

Raises:

Type Description
ScrapingError

If article collection fails

ValidationError

If category is not supported or count is invalid

NetworkError

If network request fails

Example

articles = source.get_multiple_articles("guide", count=3) valid_articles = [a for a in articles if a.is_valid()]

validate_category(category: str) -> None

Validate if a category is supported by this source.

Parameters:

Name Type Description Default
category str

Category to validate

required

Raises:

Type Description
ValidationError

If category is not supported

Example

source.validate_category("security") # OK source.validate_category("invalid") # Raises ValidationError