Core Module¶
The core module provides fundamental functionality for The Data Packet including configuration management, exception handling, and logging.
Configuration¶
Unified configuration system for The Data Packet.
This module provides centralized configuration management with support for: - Environment variable loading - Type-safe configuration with validation - Default values for all settings - Global configuration singleton pattern - Override capabilities for testing
The configuration system follows these priorities (highest to lowest): 1. Direct parameter overrides 2. Environment variables 3. Default values
- Configuration Categories:
- API Keys:
Anthropic API key for Claude script generation
ElevenLabs API key for TTS audio generation
AWS credentials for S3 storage
- Podcast Settings:
Show metadata (name, episode numbers)
Audio preferences (voices, sample rate)
RSS feed configuration
- Processing Options:
Which generation steps to run
Article collection preferences
Output and cleanup settings
- Network Settings:
HTTP timeouts and user agents
Retry configurations
Rate limiting settings
- Usage:
# Get default configuration (loads from environment) config = get_config()
# Override specific values config = get_config(
show_name=”My Custom Podcast”, max_articles_per_source=3
)
# Access configuration values if config.anthropic_api_key:
generator = ScriptGenerator(config.anthropic_api_key)
- Environment Variables:
- Required for script generation:
ANTHROPIC_API_KEY - Claude API key
- Required for audio generation:
GCS_BUCKET_NAME - Google Cloud Storage bucket for long audio synthesis GOOGLE_APPLICATION_CREDENTIALS - Path to service account JSON (optional if using default credentials)
- Legacy (deprecated):
ELEVENLABS_API_KEY - ElevenLabs API key (replaced by Google Cloud TTS)
- Optional for S3 uploads:
S3_BUCKET_NAME - S3 bucket for hosting AWS_ACCESS_KEY_ID - AWS access key AWS_SECRET_ACCESS_KEY - AWS secret key AWS_REGION - AWS region (default: us-east-1)
- Optional for Grafana Loki log aggregation:
GRAFANA_LOKI_URL - Loki endpoint URL GRAFANA_LOKI_USERNAME - Loki authentication username GRAFANA_LOKI_PASSWORD - Loki authentication password/API key
- Optional customizations:
SHOW_NAME - Podcast name override LOG_LEVEL - Logging level (DEBUG/INFO/WARNING/ERROR) MAX_ARTICLES - Max articles per source
- Logging configuration:
LOG_DIRECTORY - Directory for JSONL log files (default: output/logs) ENABLE_JSONL_LOGGING - Enable JSONL file logging (true/false, default: true) ENABLE_S3_LOG_UPLOAD - Enable S3 upload of logs (true/false, default: true) LOG_UPLOAD_INTERVAL - Upload interval in seconds (default: 3600) REMOVE_LOGS_AFTER_UPLOAD - Remove local logs after S3 upload (true/false, default: false)
- class the_data_packet.core.config.Config(anthropic_api_key: str | None = None, elevenlabs_api_key: str | None = None, mongodb_username: str | None = None, mongodb_password: str | None = None, google_credentials_path: str | None = None, gcs_bucket_name: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_region: str = 'us-east-1', s3_bucket_name: str | None = None, grafana_loki_url: str | None = None, grafana_loki_username: str | None = None, grafana_loki_password: str | None = None, show_name: str = 'The Data Packet', episode_number: int | None = None, output_directory: Path = PosixPath('output'), max_articles_per_source: int = 1, article_sources: List[str] = <factory>, article_categories: List[str] = <factory>, source_category_mapping: Dict[str, ~typing.List[str]]=<factory>, claude_model: str = 'claude-sonnet-4-5-20250929', tts_model: str = 'google_cloud_tts', max_tokens: int = 3000, temperature: float = 0.7, male_voice: str = 'en-US-Studio-Q', female_voice: str = 'en-US-Studio-O', audio_sample_rate: int = 44100, generate_script: bool = True, generate_audio: bool = True, generate_rss: bool = True, save_intermediate_files: bool = False, cleanup_temp_files: bool = True, rss_channel_title: str | None = 'The Data Packet', rss_channel_description: str | None = None, rss_channel_link: str | None = None, rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png', rss_channel_email: str | None = 'contact@thewintershadow.com', max_rss_episodes: int = 500, http_timeout: int = 30, user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)', log_level: str = 'INFO', log_dir: str = 'output/logs', enable_jsonl_logging: bool = True, enable_s3_log_upload: bool = True, log_upload_interval: int = 3600, remove_logs_after_upload: bool = False)[source]¶
Bases:
objectUnified configuration for The Data Packet with environment variable support.
This class provides type-safe configuration management with automatic environment variable loading and validation. All fields have sensible defaults and can be overridden via environment variables or direct parameter passing.
- API Keys
- anthropic_api_key: Anthropic API key for Claude script generation.
Required for script generation. Loaded from ANTHROPIC_API_KEY.
- elevenlabs_api_key: [DEPRECATED] ElevenLabs API key for legacy TTS.
Replaced by Google Cloud TTS. Loaded from ELEVENLABS_API_KEY.
- mongodb_username: MongoDB username for episode tracking and article deduplication.
Optional. Loaded from MONGODB_USERNAME.
- mongodb_password: MongoDB password for episode tracking and article deduplication.
Optional. Loaded from MONGODB_PASSWORD.
- Google Cloud Configuration
- google_credentials_path: Path to Google Cloud service account JSON file.
Optional if using default application credentials. Loaded from GOOGLE_APPLICATION_CREDENTIALS.
- gcs_bucket_name: Google Cloud Storage bucket for long audio synthesis output.
Required for audio generation. Loaded from GCS_BUCKET_NAME.
- AWS Configuration
aws_access_key_id: AWS access key for S3 uploads. Loaded from AWS_ACCESS_KEY_ID. aws_secret_access_key: AWS secret key for S3 uploads. Loaded from AWS_SECRET_ACCESS_KEY. aws_region: AWS region for S3 operations. Default: us-east-1. s3_bucket_name: S3 bucket name for hosting files. Loaded from S3_BUCKET_NAME.
- Grafana Loki Configuration
grafana_loki_url: Loki endpoint URL for log aggregation. Loaded from GRAFANA_LOKI_URL. grafana_loki_username: Username for Loki authentication. Loaded from GRAFANA_LOKI_USERNAME. grafana_loki_password: Password/API key for Loki authentication. Loaded from GRAFANA_LOKI_PASSWORD.
- Podcast Configuration
show_name: Podcast show name. Used in RSS feeds and file names. episode_number: Episode number for RSS feeds. Auto-generated if None. output_directory: Local directory for generated files.
- Article Collection
max_articles_per_source: Maximum articles to collect per source. article_sources: List of news sources to use (wired, techcrunch). article_categories: List of categories to fetch from each source. source_category_mapping: Maps each source to its supported categories.
- AI Generation Settings
claude_model: Claude model name for script generation. tts_model: Text-to-speech service type (now “google_cloud_tts”). max_tokens: Maximum tokens for Claude API calls. temperature: AI generation temperature (0.0-1.0, lower = more consistent).
- Audio Settings
voice_a: First speaker voice name (Alex - male narrator). voice_b: Second speaker voice name (Sam - female narrator). audio_sample_rate: Audio sample rate in Hz.
- Type:
Google Cloud Studio Multi-speaker voices
- Processing Options
generate_script: Whether to generate podcast scripts. generate_audio: Whether to generate audio files. generate_rss: Whether to generate RSS feeds. save_intermediate_files: Whether to keep intermediate processing files. cleanup_temp_files: Whether to clean up temporary files after processing.
- RSS Feed Configuration
rss_channel_title: RSS channel title. rss_channel_description: RSS channel description. rss_channel_link: RSS channel website link. rss_channel_image_url: RSS channel artwork URL. rss_channel_email: Contact email for podcast. max_rss_episodes: Maximum episodes to keep in RSS feed.
- Network Settings
http_timeout: HTTP request timeout in seconds. user_agent: User agent string for HTTP requests. log_level: Logging level (DEBUG/INFO/WARNING/ERROR/CRITICAL).
Example
# Default configuration with environment variables config = Config()
# Custom configuration config = Config(
show_name=”Tech News Daily”, max_articles_per_source=3, voice_a=”charon”, voice_b=”aoede”
)
# Validate before use config.validate_for_script_generation() config.validate_for_audio_generation()
- rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png'¶
- get_sources_for_category(category: str) List[str][source]¶
Get list of sources that support a given category.
- Parameters:
category – Category name to check
- Returns:
List of source names that support the category
- get_categories_for_source(source: str) List[str][source]¶
Get list of categories supported by a given source.
- Parameters:
source – Source name to check
- Returns:
List of category names supported by the source
- __init__(anthropic_api_key: str | None = None, elevenlabs_api_key: str | None = None, mongodb_username: str | None = None, mongodb_password: str | None = None, google_credentials_path: str | None = None, gcs_bucket_name: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_region: str = 'us-east-1', s3_bucket_name: str | None = None, grafana_loki_url: str | None = None, grafana_loki_username: str | None = None, grafana_loki_password: str | None = None, show_name: str = 'The Data Packet', episode_number: int | None = None, output_directory: Path = PosixPath('output'), max_articles_per_source: int = 1, article_sources: List[str] = <factory>, article_categories: List[str] = <factory>, source_category_mapping: Dict[str, ~typing.List[str]]=<factory>, claude_model: str = 'claude-sonnet-4-5-20250929', tts_model: str = 'google_cloud_tts', max_tokens: int = 3000, temperature: float = 0.7, male_voice: str = 'en-US-Studio-Q', female_voice: str = 'en-US-Studio-O', audio_sample_rate: int = 44100, generate_script: bool = True, generate_audio: bool = True, generate_rss: bool = True, save_intermediate_files: bool = False, cleanup_temp_files: bool = True, rss_channel_title: str | None = 'The Data Packet', rss_channel_description: str | None = None, rss_channel_link: str | None = None, rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png', rss_channel_email: str | None = 'contact@thewintershadow.com', max_rss_episodes: int = 500, http_timeout: int = 30, user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)', log_level: str = 'INFO', log_dir: str = 'output/logs', enable_jsonl_logging: bool = True, enable_s3_log_upload: bool = True, log_upload_interval: int = 3600, remove_logs_after_upload: bool = False) None¶
Exceptions¶
Custom exceptions for The Data Packet.
This module provides a hierarchy of custom exceptions for different types of errors that can occur during podcast generation. All exceptions inherit from the base TheDataPacketError class for easy catching.
- Exception Hierarchy:
TheDataPacketError ├── ConfigurationError (Missing/invalid API keys, settings) ├── NetworkError (HTTP requests, connectivity issues) ├── ScrapingError (Article extraction failures) ├── AIGenerationError (Claude API failures, content generation) ├── AudioGenerationError (ElevenLabs TTS failures, audio processing) └── ValidationError (Invalid data, missing required fields)
Example
- try:
pipeline = PodcastPipeline() result = pipeline.run()
- except ConfigurationError as e:
logger.error(f”Configuration issue: {e}”)
- except AIGenerationError as e:
logger.error(f”AI generation failed: {e}”)
- except TheDataPacketError as e:
logger.error(f”Unexpected error: {e}”)
- exception the_data_packet.core.exceptions.TheDataPacketError[source]¶
Bases:
ExceptionBase exception for all The Data Packet errors.
This is the parent class for all custom exceptions in The Data Packet. Use this for general error handling when you want to catch any podcast generation related error.
- message¶
Human-readable error description
Example
- try:
pipeline.run()
- except TheDataPacketError as e:
logger.error(f”Podcast generation failed: {e}”)
- exception the_data_packet.core.exceptions.ConfigurationError[source]¶
Bases:
TheDataPacketErrorRaised when there’s an issue with configuration.
This exception is raised when: - Required API keys are missing (ANTHROPIC_API_KEY, GOOGLE_API_KEY) - Invalid configuration values are provided - Required environment variables are not set - AWS credentials are missing for S3 uploads
Example
- if not self.anthropic_api_key:
raise ConfigurationError(“Anthropic API key is required”)
- exception the_data_packet.core.exceptions.NetworkError[source]¶
Bases:
TheDataPacketErrorRaised when there’s a network-related error.
This exception is raised when: - HTTP requests fail (timeouts, connection errors) - RSS feeds are unreachable - API services are unavailable - S3 upload failures due to network issues
Example
- try:
response = requests.get(url, timeout=30)
- except requests.RequestException as e:
raise NetworkError(f”Failed to fetch {url}: {e}”)
- exception the_data_packet.core.exceptions.ScrapingError[source]¶
Bases:
TheDataPacketErrorRaised when article scraping fails.
This exception is raised when: - RSS feed parsing fails - Article content extraction fails - Required article fields are missing - Content cleaning/processing fails
Example
- if not article_content:
raise ScrapingError(f”No content found for article: {url}”)
- exception the_data_packet.core.exceptions.AIGenerationError[source]¶
Bases:
TheDataPacketErrorRaised when AI content generation fails.
This exception is raised when: - Claude API returns errors or rate limits - Generated content is invalid or too short - AI refuses to process certain content types - Script generation fails after retries
Example
- if response.status_code != 200:
raise AIGenerationError(f”Claude API error: {response.text}”)
- exception the_data_packet.core.exceptions.AudioGenerationError[source]¶
Bases:
TheDataPacketErrorRaised when audio generation fails.
This exception is raised when: - ElevenLabs TTS API returns errors - Audio file generation fails - Invalid voice configuration - Audio processing or saving fails
Example
- if not output_file.exists():
raise AudioGenerationError(“Audio file generation failed”)
- exception the_data_packet.core.exceptions.ValidationError[source]¶
Bases:
TheDataPacketErrorRaised when data validation fails.
This exception is raised when: - Invalid article categories are provided - Required data fields are missing or malformed - Configuration values are outside acceptable ranges - File paths are invalid or inaccessible
Example
- if category not in self.supported_categories:
raise ValidationError(f”Unsupported category: {category}”)
Logging¶
Centralized logging configuration for The Data Packet.
This module provides unified logging setup for the entire application. It configures structured logging with proper formatters, reduces noise from third-party libraries, and provides a consistent interface for obtaining logger instances throughout the codebase.
- Features:
Structured logging with timestamps and module names
Configurable log levels via environment variables
Noise reduction from third-party libraries
Consistent format across all modules
Console output optimized for Docker containers
- Usage:
# In main application entry point setup_logging()
# In any module from the_data_packet.core.logging import get_logger logger = get_logger(__name__) logger.info(“Processing started”)
- Log Levels:
DEBUG: Detailed debugging information INFO: General operational messages WARNING: Warning messages for recoverable issues ERROR: Error messages for serious problems CRITICAL: Critical errors that may cause shutdown
- class the_data_packet.core.logging.JSONLHandler(log_dir: str = 'output/logs')[source]¶
Bases:
HandlerCustom logging handler that writes log entries to JSONL files.
Features: - Writes structured JSON logs to .jsonl files - Automatically rotates files daily - Includes metadata like timestamp, module, level - Thread-safe file operations
- class the_data_packet.core.logging.S3LogUploader(log_dir: str = 'output/logs', upload_interval: int = 3600, remove_after_upload: bool = False)[source]¶
Bases:
objectBackground service to upload JSONL log files to S3.
Features: - Monitors log directory for completed daily logs - Uploads files to S3 with structured naming - Optionally removes local files after upload - Runs in background thread
- __init__(log_dir: str = 'output/logs', upload_interval: int = 3600, remove_after_upload: bool = False)[source]¶
Initialize S3 log uploader.
- Parameters:
log_dir – Directory containing JSONL log files
upload_interval – How often to check for files to upload (seconds)
remove_after_upload – Whether to delete local files after upload
- the_data_packet.core.logging.setup_logging(log_level: str | None = None, enable_jsonl: bool | None = None, enable_s3_upload: bool | None = None, log_dir: str | None = None) None[source]¶
Configure application-wide logging settings.
Sets up structured logging with consistent formatting, configurable log levels, and noise reduction from third-party libraries. Should be called once at application startup.
- Parameters:
log_level – Override log level (DEBUG, INFO, WARNING, ERROR, CRITICAL). If None, uses configuration default. Case insensitive.
enable_jsonl – Whether to enable JSONL file logging (default: from config)
enable_s3_upload – Whether to enable S3 upload of log files (default: from config)
log_dir – Directory for JSONL log files (default: from config)
Example
# Use default settings from config setup_logging()
# Override to DEBUG level, disable S3 upload setup_logging(“DEBUG”, enable_s3_upload=False)
# Console only (no JSONL files) setup_logging(enable_jsonl=False, enable_s3_upload=False)
Note
This function uses force=True to override any existing logging configuration, ensuring consistent behavior in all environments. JSONL logs include structured metadata for log aggregation and analysis. S3 upload runs in background and uploads completed daily log files.
- the_data_packet.core.logging.get_logger(name: str) Logger[source]¶
Get a named logger instance for a module.
This is the standard way to obtain logger instances throughout the application. Use __name__ as the logger name to get hierarchical logger names that match the module structure.
- Parameters:
name – Logger name, typically __name__ from calling module
- Returns:
Configured logger instance ready for use
Example
# Standard usage in any module from the_data_packet.core.logging import get_logger logger = get_logger(__name__)
# Usage examples logger.info(“Starting article collection”) logger.warning(“Article content is short: %d chars”, len(content)) logger.error(“Failed to generate script: %s”, str(error))
# With structured data (for log aggregation) logger.info(“Article processed”, extra={
“article_id”: article.id, “processing_time”: elapsed_seconds
})
Note
Logger names follow Python’s hierarchical naming convention. For example, ‘the_data_packet.sources.wired’ will inherit configuration from ‘the_data_packet.sources’ and ‘the_data_packet’.
- the_data_packet.core.logging.stop_s3_uploader() None[source]¶
Stop the S3 log uploader service gracefully.
Should be called during application shutdown to ensure any pending uploads complete properly.