Wired
the_data_packet.sources.wired
¶
Wired.com article source implementation.
This module implements article collection from Wired.com using RSS feeds and web scraping. Wired.com provides RSS feeds for different categories that contain recent article URLs, which are then scraped for full content.
Features
- RSS feed-based article discovery
- Multiple category support (security, guides, business, science, AI)
- Robust content extraction with fallback methods
- Content cleaning and validation
- Error handling for network issues and malformed content
RSS Feed Strategy
- Fetch category-specific RSS feed
- Parse feed to extract article URLs
- Scrape individual articles for full content
- Clean and validate extracted content
- Return standardized Article objects
Content Extraction
- Primary: Article body containers and paragraph tags
- Fallback: Main content areas and text containers
- Cleaning: Remove navigation, ads, and boilerplate text
- Validation: Ensure sufficient content length
Supported Categories
- security: Security and cybersecurity articles
- guide: How-to guides and tutorials
- business: Business and industry news
- science: Science and technology research
- ai: Artificial intelligence and machine learning
Rate Limiting
- Respectful delays between requests
- Connection reuse via HTTP session
- Proper User-Agent identification
Example Usage
source = WiredSource()
Get latest security article¶
article = source.get_latest_article("security")
Get multiple guide articles¶
articles = source.get_multiple_articles("guide", count=3)
Check supported categories¶
if "ai" in source.supported_categories: ai_articles = source.get_multiple_articles("ai", count=5)
logger = get_logger(__name__)
module-attribute
¶
WiredSource
¶
Article source for Wired.com.
http_client = HTTPClient()
instance-attribute
¶
RSS_FEEDS = {'security': 'https://www.wired.com/feed/category/security/latest/rss', 'science': 'https://www.wired.com/feed/category/science/latest/rss', 'ai': 'https://www.wired.com/feed/tag/ai/latest/rss'}
class-attribute
instance-attribute
¶
SKIP_PATTERNS = ['subscribe to wired', 'most popular', 'related stories', 'advertisement', 'get wired', 'sign up', 'newsletter']
class-attribute
instance-attribute
¶
name: str
property
¶
Source name identifier.
supported_categories: List[str]
property
¶
List of supported categories.
__init__() -> None
¶
Initialize Wired source.
get_latest_article(category: str) -> Article
¶
Get the latest article from a category.
get_multiple_articles(category: str, count: int) -> List[Article]
¶
Get multiple articles from a category.