Base
the_data_packet.sources.base
¶
Base classes for article sources.
This module defines the core data structures and interfaces for collecting articles from various news sources. It provides a standardized way to represent articles and implement source-specific collection logic.
The design supports: - Multiple news sources with different scraping strategies - Consistent article representation across sources - Validation of article content quality - Category-based article filtering - Extensible source implementation
Architecture
Article: Data class representing a single news article ArticleSource: Abstract base class for implementing news sources
Current Sources
- WiredSource: Wired.com articles via RSS feeds
- TechCrunchSource: TechCrunch.com articles via RSS feeds
Future Sources (extensible): - ArsTechnicaSource - HackerNewsSource
Article
dataclass
¶
Represents a single news article from any source.
This data class provides a standardized representation of news articles regardless of their source. It includes validation methods to ensure articles have sufficient content for podcast generation.
Attributes:
| Name | Type | Description |
|---|---|---|
title |
str
|
Article headline/title. Required for all articles. |
content |
str
|
Full article text content. Required and must be substantial. |
author |
Optional[str]
|
Article author name. Optional but recommended. |
url |
Optional[str]
|
Original article URL. Optional but useful for debugging. |
category |
Optional[str]
|
Article category (e.g., 'security', 'guide'). Optional. |
source |
Optional[str]
|
Source identifier (e.g., 'wired'). Optional but recommended. |
Content Requirements
- Title must be non-empty
- Content must be at least 100 characters after stripping whitespace
- Content should be clean text without HTML tags or navigation elements
Example
article = Article( title="New Security Vulnerability Discovered", content="A critical security flaw has been found...", author="Jane Smith", url="https://example.com/article", category="security", source="wired" )
if article.is_valid(): # Process article for podcast generation pass
title: str
instance-attribute
¶
content: str
instance-attribute
¶
author: Optional[str] = None
class-attribute
instance-attribute
¶
url: Optional[str] = None
class-attribute
instance-attribute
¶
category: Optional[str] = None
class-attribute
instance-attribute
¶
source: Optional[str] = None
class-attribute
instance-attribute
¶
__init__(title: str, content: str, author: Optional[str] = None, url: Optional[str] = None, category: Optional[str] = None, source: Optional[str] = None) -> None
¶
is_valid() -> bool
¶
Check if article has sufficient content for podcast generation.
Validates that the article has: - Non-empty title - Non-empty content - Content length of at least 100 characters (after stripping)
Returns:
| Type | Description |
|---|---|
bool
|
True if article meets minimum content requirements |
Example
if not article.is_valid(): logger.warning(f"Skipping invalid article: {article.title}") continue
to_dict() -> Dict[str, Optional[str]]
¶
Convert article to dictionary representation.
Returns:
| Type | Description |
|---|---|
Dict[str, Optional[str]]
|
Dictionary with all article fields |
Example
article_data = article.to_dict() json.dump(article_data, file)
ArticleSource
¶
Abstract base class for implementing news article sources.
This class defines the interface that all article sources must implement. It provides a consistent way to collect articles from different news websites while handling source-specific details in subclasses.
Each source implementation should: - Define supported categories - Implement RSS feed or web scraping logic - Handle rate limiting and error recovery - Clean and validate article content - Return standardized Article objects
Subclasses must implement
name: Property returning source identifier supported_categories: Property returning list of valid categories get_latest_article(): Method to get single latest article get_multiple_articles(): Method to get multiple articles
Example Implementation
class ExampleSource(ArticleSource): @property def name(self) -> str: return "example"
@property
def supported_categories(self) -> List[str]:
return ["tech", "science"]
def get_latest_article(self, category: str) -> Article:
# Implementation specific logic
pass
Usage
source = WiredSource() if "security" in source.supported_categories: article = source.get_latest_article("security") articles = source.get_multiple_articles("security", count=5)
name: str
abstractmethod
property
¶
Source name identifier.
Returns a unique string identifier for this source. Used in configuration, logging, and file naming.
Returns:
| Type | Description |
|---|---|
str
|
Source identifier (e.g., "wired", "techcrunch") |
supported_categories: List[str]
abstractmethod
property
¶
List of supported article categories for this source.
Returns the categories this source can collect articles from. Categories should match the source's RSS feeds or section structure.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of category strings (e.g., ["security", "guide", "business"]) |
get_latest_article(category: str) -> Article
abstractmethod
¶
Get the latest article from a specific category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category
|
str
|
Category to fetch from (must be in supported_categories) |
required |
Returns:
| Type | Description |
|---|---|
Article
|
Latest Article instance from the category |
Raises:
| Type | Description |
|---|---|
ScrapingError
|
If article collection fails |
ValidationError
|
If category is not supported |
NetworkError
|
If network request fails |
Example
try: article = source.get_latest_article("security") logger.info(f"Retrieved: {article.title}") except ValidationError: logger.error(f"Category 'invalid' not supported")
get_multiple_articles(category: str, count: int) -> List[Article]
abstractmethod
¶
Get multiple articles from a specific category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category
|
str
|
Category to fetch from (must be in supported_categories) |
required |
count
|
int
|
Maximum number of articles to return |
required |
Returns:
| Type | Description |
|---|---|
List[Article]
|
List of Article instances (may be fewer than count if unavailable) |
Raises:
| Type | Description |
|---|---|
ScrapingError
|
If article collection fails |
ValidationError
|
If category is not supported or count is invalid |
NetworkError
|
If network request fails |
Example
articles = source.get_multiple_articles("guide", count=3) valid_articles = [a for a in articles if a.is_valid()]
validate_category(category: str) -> None
¶
Validate if a category is supported by this source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category
|
str
|
Category to validate |
required |
Raises:
| Type | Description |
|---|---|
ValidationError
|
If category is not supported |
Example
source.validate_category("security") # OK source.validate_category("invalid") # Raises ValidationError