Core Module¶

The core module provides fundamental functionality for The Data Packet including configuration management, exception handling, and logging.

Configuration¶

Unified configuration system for The Data Packet.

This module provides centralized configuration management with support for: - Environment variable loading - Type-safe configuration with validation - Default values for all settings - Global configuration singleton pattern - Override capabilities for testing

The configuration system follows these priorities (highest to lowest): 1. Direct parameter overrides 2. Environment variables 3. Default values

Configuration Categories:

API Keys:

Anthropic API key for Claude script generation
ElevenLabs API key for TTS audio generation
AWS credentials for S3 storage

Podcast Settings:

Show metadata (name, episode numbers)
Audio preferences (voices, sample rate)
RSS feed configuration

Processing Options:

Which generation steps to run
Article collection preferences
Output and cleanup settings

Network Settings:

HTTP timeouts and user agents
Retry configurations
Rate limiting settings

Usage:

# Get default configuration (loads from environment) config = get_config()

# Override specific values config = get_config(

show_name=”My Custom Podcast”, max_articles_per_source=3

)

# Access configuration values if config.anthropic_api_key:

generator = ScriptGenerator(config.anthropic_api_key)

Environment Variables:

Required for script generation:: ANTHROPIC_API_KEY - Claude API key
Required for audio generation:: GCS_BUCKET_NAME - Google Cloud Storage bucket for long audio synthesis GOOGLE_APPLICATION_CREDENTIALS - Path to service account JSON (optional if using default credentials)
Legacy (deprecated):: ELEVENLABS_API_KEY - ElevenLabs API key (replaced by Google Cloud TTS)
Optional for S3 uploads:: S3_BUCKET_NAME - S3 bucket for hosting AWS_ACCESS_KEY_ID - AWS access key AWS_SECRET_ACCESS_KEY - AWS secret key AWS_REGION - AWS region (default: us-east-1)
Optional for Grafana Loki log aggregation:: GRAFANA_LOKI_URL - Loki endpoint URL GRAFANA_LOKI_USERNAME - Loki authentication username GRAFANA_LOKI_PASSWORD - Loki authentication password/API key
Optional customizations:: SHOW_NAME - Podcast name override LOG_LEVEL - Logging level (DEBUG/INFO/WARNING/ERROR) MAX_ARTICLES - Max articles per source
Logging configuration:: LOG_DIRECTORY - Directory for JSONL log files (default: output/logs) ENABLE_JSONL_LOGGING - Enable JSONL file logging (true/false, default: true) ENABLE_S3_LOG_UPLOAD - Enable S3 upload of logs (true/false, default: true) LOG_UPLOAD_INTERVAL - Upload interval in seconds (default: 3600) REMOVE_LOGS_AFTER_UPLOAD - Remove local logs after S3 upload (true/false, default: false)

class the_data_packet.core.config.Config(anthropic_api_key: str | None = None, elevenlabs_api_key: str | None = None, mongodb_username: str | None = None, mongodb_password: str | None = None, google_credentials_path: str | None = None, gcs_bucket_name: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_region: str = 'us-east-1', s3_bucket_name: str | None = None, grafana_loki_url: str | None = None, grafana_loki_username: str | None = None, grafana_loki_password: str | None = None, show_name: str = 'The Data Packet', episode_number: int | None = None, output_directory: Path = PosixPath('output'), max_articles_per_source: int = 1, article_sources: List[str] = <factory>, article_categories: List[str] = <factory>, source_category_mapping: Dict[str, ~typing.List[str]]=<factory>, claude_model: str = 'claude-sonnet-4-5-20250929', tts_model: str = 'google_cloud_tts', max_tokens: int = 3000, temperature: float = 0.7, male_voice: str = 'en-US-Studio-Q', female_voice: str = 'en-US-Studio-O', audio_sample_rate: int = 44100, generate_script: bool = True, generate_audio: bool = True, generate_rss: bool = True, save_intermediate_files: bool = False, cleanup_temp_files: bool = True, rss_channel_title: str | None = 'The Data Packet', rss_channel_description: str | None = None, rss_channel_link: str | None = None, rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png', rss_channel_email: str | None = 'contact@thewintershadow.com', max_rss_episodes: int = 500, http_timeout: int = 30, user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)', log_level: str = 'INFO', log_dir: str = 'output/logs', enable_jsonl_logging: bool = True, enable_s3_log_upload: bool = True, log_upload_interval: int = 3600, remove_logs_after_upload: bool = False)[source]¶

Bases: object

Unified configuration for The Data Packet with environment variable support.

This class provides type-safe configuration management with automatic environment variable loading and validation. All fields have sensible defaults and can be overridden via environment variables or direct parameter passing.

API Keys

anthropic_api_key: Anthropic API key for Claude script generation.: Required for script generation. Loaded from ANTHROPIC_API_KEY.
elevenlabs_api_key: [DEPRECATED] ElevenLabs API key for legacy TTS.: Replaced by Google Cloud TTS. Loaded from ELEVENLABS_API_KEY.
mongodb_username: MongoDB username for episode tracking and article deduplication.: Optional. Loaded from MONGODB_USERNAME.
mongodb_password: MongoDB password for episode tracking and article deduplication.: Optional. Loaded from MONGODB_PASSWORD.

Google Cloud Configuration

google_credentials_path: Path to Google Cloud service account JSON file.: Optional if using default application credentials. Loaded from GOOGLE_APPLICATION_CREDENTIALS.
gcs_bucket_name: Google Cloud Storage bucket for long audio synthesis output.: Required for audio generation. Loaded from GCS_BUCKET_NAME.

AWS Configuration: aws_access_key_id: AWS access key for S3 uploads. Loaded from AWS_ACCESS_KEY_ID. aws_secret_access_key: AWS secret key for S3 uploads. Loaded from AWS_SECRET_ACCESS_KEY. aws_region: AWS region for S3 operations. Default: us-east-1. s3_bucket_name: S3 bucket name for hosting files. Loaded from S3_BUCKET_NAME.

Grafana Loki Configuration: grafana_loki_url: Loki endpoint URL for log aggregation. Loaded from GRAFANA_LOKI_URL. grafana_loki_username: Username for Loki authentication. Loaded from GRAFANA_LOKI_USERNAME. grafana_loki_password: Password/API key for Loki authentication. Loaded from GRAFANA_LOKI_PASSWORD.

Podcast Configuration: show_name: Podcast show name. Used in RSS feeds and file names. episode_number: Episode number for RSS feeds. Auto-generated if None. output_directory: Local directory for generated files.

Article Collection: max_articles_per_source: Maximum articles to collect per source. article_sources: List of news sources to use (wired, techcrunch). article_categories: List of categories to fetch from each source. source_category_mapping: Maps each source to its supported categories.

AI Generation Settings: claude_model: Claude model name for script generation. tts_model: Text-to-speech service type (now “google_cloud_tts”). max_tokens: Maximum tokens for Claude API calls. temperature: AI generation temperature (0.0-1.0, lower = more consistent).

Audio Settings

voice_a: First speaker voice name (Alex - male narrator). voice_b: Second speaker voice name (Sam - female narrator). audio_sample_rate: Audio sample rate in Hz.

Type:: Google Cloud Studio Multi-speaker voices

Processing Options: generate_script: Whether to generate podcast scripts. generate_audio: Whether to generate audio files. generate_rss: Whether to generate RSS feeds. save_intermediate_files: Whether to keep intermediate processing files. cleanup_temp_files: Whether to clean up temporary files after processing.

RSS Feed Configuration: rss_channel_title: RSS channel title. rss_channel_description: RSS channel description. rss_channel_link: RSS channel website link. rss_channel_image_url: RSS channel artwork URL. rss_channel_email: Contact email for podcast. max_rss_episodes: Maximum episodes to keep in RSS feed.

Network Settings: http_timeout: HTTP request timeout in seconds. user_agent: User agent string for HTTP requests. log_level: Logging level (DEBUG/INFO/WARNING/ERROR/CRITICAL).

Example

# Default configuration with environment variables config = Config()

# Custom configuration config = Config(

show_name=”Tech News Daily”, max_articles_per_source=3, voice_a=”charon”, voice_b=”aoede”

)

# Validate before use config.validate_for_script_generation() config.validate_for_audio_generation()

anthropic_api_key: str | None = None¶

elevenlabs_api_key: str | None = None¶

mongodb_username: str | None = None¶

mongodb_password: str | None = None¶

google_credentials_path: str | None = None¶

gcs_bucket_name: str | None = None¶

aws_access_key_id: str | None = None¶

aws_secret_access_key: str | None = None¶

aws_region: str = 'us-east-1'¶

s3_bucket_name: str | None = None¶

grafana_loki_url: str | None = None¶

grafana_loki_username: str | None = None¶

grafana_loki_password: str | None = None¶

show_name: str = 'The Data Packet'¶

episode_number: int | None = None¶

output_directory: Path = PosixPath('output')¶

max_articles_per_source: int = 1¶

article_sources: List[str]¶

article_categories: List[str]¶

source_category_mapping: Dict[str, List[str]]¶

claude_model: str = 'claude-sonnet-4-5-20250929'¶

tts_model: str = 'google_cloud_tts'¶

max_tokens: int = 3000¶

temperature: float = 0.7¶

male_voice: str = 'en-US-Studio-Q'¶

female_voice: str = 'en-US-Studio-O'¶

audio_sample_rate: int = 44100¶

generate_script: bool = True¶

generate_audio: bool = True¶

generate_rss: bool = True¶

save_intermediate_files: bool = False¶

cleanup_temp_files: bool = True¶

rss_channel_title: str | None = 'The Data Packet'¶

rss_channel_description: str | None = None¶

rss_channel_link: str | None = None¶

rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png'¶

rss_channel_email: str | None = 'contact@thewintershadow.com'¶

max_rss_episodes: int = 500¶

http_timeout: int = 30¶

user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)'¶

log_level: str = 'INFO'¶

log_dir: str = 'output/logs'¶

enable_jsonl_logging: bool = True¶

enable_s3_log_upload: bool = True¶

log_upload_interval: int = 3600¶

remove_logs_after_upload: bool = False¶

__post_init__() → None[source]¶: Load configuration from environment variables.

validate_for_script_generation() → None[source]¶: Validate configuration for script generation.

validate_for_audio_generation() → None[source]¶: Validate configuration for audio generation.

get_sources_for_category(category: str) → List[str][source]¶

Get list of sources that support a given category.

Parameters:: category – Category name to check
Returns:: List of source names that support the category

get_categories_for_source(source: str) → List[str][source]¶

Get list of categories supported by a given source.

Parameters:: source – Source name to check
Returns:: List of category names supported by the source

to_dict() → Dict[source]¶: Convert configuration to dictionary.

__init__(anthropic_api_key: str | None = None, elevenlabs_api_key: str | None = None, mongodb_username: str | None = None, mongodb_password: str | None = None, google_credentials_path: str | None = None, gcs_bucket_name: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_region: str = 'us-east-1', s3_bucket_name: str | None = None, grafana_loki_url: str | None = None, grafana_loki_username: str | None = None, grafana_loki_password: str | None = None, show_name: str = 'The Data Packet', episode_number: int | None = None, output_directory: Path = PosixPath('output'), max_articles_per_source: int = 1, article_sources: List[str] = <factory>, article_categories: List[str] = <factory>, source_category_mapping: Dict[str, ~typing.List[str]]=<factory>, claude_model: str = 'claude-sonnet-4-5-20250929', tts_model: str = 'google_cloud_tts', max_tokens: int = 3000, temperature: float = 0.7, male_voice: str = 'en-US-Studio-Q', female_voice: str = 'en-US-Studio-O', audio_sample_rate: int = 44100, generate_script: bool = True, generate_audio: bool = True, generate_rss: bool = True, save_intermediate_files: bool = False, cleanup_temp_files: bool = True, rss_channel_title: str | None = 'The Data Packet', rss_channel_description: str | None = None, rss_channel_link: str | None = None, rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png', rss_channel_email: str | None = 'contact@thewintershadow.com', max_rss_episodes: int = 500, http_timeout: int = 30, user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)', log_level: str = 'INFO', log_dir: str = 'output/logs', enable_jsonl_logging: bool = True, enable_s3_log_upload: bool = True, log_upload_interval: int = 3600, remove_logs_after_upload: bool = False) → None¶

the_data_packet.core.config.get_config(**overrides: Any) → Config[source]¶

Get the global configuration instance.

Parameters:: **overrides – Configuration values to override
Returns:: Config instance

the_data_packet.core.config.reset_config() → None[source]¶: Reset the global configuration instance.

Exceptions¶

Custom exceptions for The Data Packet.

This module provides a hierarchy of custom exceptions for different types of errors that can occur during podcast generation. All exceptions inherit from the base TheDataPacketError class for easy catching.

Exception Hierarchy:: TheDataPacketError ├── ConfigurationError (Missing/invalid API keys, settings) ├── NetworkError (HTTP requests, connectivity issues) ├── ScrapingError (Article extraction failures) ├── AIGenerationError (Claude API failures, content generation) ├── AudioGenerationError (ElevenLabs TTS failures, audio processing) └── ValidationError (Invalid data, missing required fields)

Example

try:: pipeline = PodcastPipeline() result = pipeline.run()
except ConfigurationError as e:: logger.error(f”Configuration issue: {e}”)
except AIGenerationError as e:: logger.error(f”AI generation failed: {e}”)
except TheDataPacketError as e:: logger.error(f”Unexpected error: {e}”)

exception the_data_packet.core.exceptions.TheDataPacketError[source]¶

Bases: Exception

Base exception for all The Data Packet errors.

This is the parent class for all custom exceptions in The Data Packet. Use this for general error handling when you want to catch any podcast generation related error.

message¶: Human-readable error description

Example

try:: pipeline.run()
except TheDataPacketError as e:: logger.error(f”Podcast generation failed: {e}”)

exception the_data_packet.core.exceptions.ConfigurationError[source]¶

Bases: TheDataPacketError

Raised when there’s an issue with configuration.

This exception is raised when: - Required API keys are missing (ANTHROPIC_API_KEY, GOOGLE_API_KEY) - Invalid configuration values are provided - Required environment variables are not set - AWS credentials are missing for S3 uploads

Example

if not self.anthropic_api_key:: raise ConfigurationError(“Anthropic API key is required”)

exception the_data_packet.core.exceptions.NetworkError[source]¶

Bases: TheDataPacketError

Raised when there’s a network-related error.

This exception is raised when: - HTTP requests fail (timeouts, connection errors) - RSS feeds are unreachable - API services are unavailable - S3 upload failures due to network issues

Example

try:: response = requests.get(url, timeout=30)
except requests.RequestException as e:: raise NetworkError(f”Failed to fetch {url}: {e}”)

exception the_data_packet.core.exceptions.ScrapingError[source]¶

Bases: TheDataPacketError

Raised when article scraping fails.

This exception is raised when: - RSS feed parsing fails - Article content extraction fails - Required article fields are missing - Content cleaning/processing fails

Example

if not article_content:: raise ScrapingError(f”No content found for article: {url}”)

exception the_data_packet.core.exceptions.AIGenerationError[source]¶

Bases: TheDataPacketError

Raised when AI content generation fails.

This exception is raised when: - Claude API returns errors or rate limits - Generated content is invalid or too short - AI refuses to process certain content types - Script generation fails after retries

Example

if response.status_code != 200:: raise AIGenerationError(f”Claude API error: {response.text}”)

exception the_data_packet.core.exceptions.AudioGenerationError[source]¶

Bases: TheDataPacketError

Raised when audio generation fails.

This exception is raised when: - ElevenLabs TTS API returns errors - Audio file generation fails - Invalid voice configuration - Audio processing or saving fails

Example

if not output_file.exists():: raise AudioGenerationError(“Audio file generation failed”)

exception the_data_packet.core.exceptions.ValidationError[source]¶

Bases: TheDataPacketError

Raised when data validation fails.

This exception is raised when: - Invalid article categories are provided - Required data fields are missing or malformed - Configuration values are outside acceptable ranges - File paths are invalid or inaccessible

Example

if category not in self.supported_categories:: raise ValidationError(f”Unsupported category: {category}”)

Logging¶

Centralized logging configuration for The Data Packet.

This module provides unified logging setup for the entire application. It configures structured logging with proper formatters, reduces noise from third-party libraries, and provides a consistent interface for obtaining logger instances throughout the codebase.

Features:

Structured logging with timestamps and module names
Configurable log levels via environment variables
Noise reduction from third-party libraries
Consistent format across all modules
Console output optimized for Docker containers

Usage:

# In main application entry point setup_logging()

# In any module from the_data_packet.core.logging import get_logger logger = get_logger(__name__) logger.info(“Processing started”)

Log Levels:

DEBUG: Detailed debugging information INFO: General operational messages WARNING: Warning messages for recoverable issues ERROR: Error messages for serious problems CRITICAL: Critical errors that may cause shutdown

class the_data_packet.core.logging.JSONLHandler(log_dir: str = 'output/logs')[source]¶

Bases: Handler

Custom logging handler that writes log entries to JSONL files.

Features: - Writes structured JSON logs to .jsonl files - Automatically rotates files daily - Includes metadata like timestamp, module, level - Thread-safe file operations

__init__(log_dir: str = 'output/logs')[source]¶: Initializes the instance - basically setting the formatter to None and the filter list to empty.

emit(record: LogRecord) → None[source]¶: Write log record as JSON line to daily log file.

class the_data_packet.core.logging.S3LogUploader(log_dir: str = 'output/logs', upload_interval: int = 3600, remove_after_upload: bool = False)[source]¶

Bases: object

Background service to upload JSONL log files to S3.

Features: - Monitors log directory for completed daily logs - Uploads files to S3 with structured naming - Optionally removes local files after upload - Runs in background thread

__init__(log_dir: str = 'output/logs', upload_interval: int = 3600, remove_after_upload: bool = False)[source]¶

Initialize S3 log uploader.

Parameters:

log_dir – Directory containing JSONL log files
upload_interval – How often to check for files to upload (seconds)
remove_after_upload – Whether to delete local files after upload

start() → None[source]¶: Start background upload service.

stop() → None[source]¶: Stop background upload service.

the_data_packet.core.logging.setup_logging(log_level: str | None = None, enable_jsonl: bool | None = None, enable_s3_upload: bool | None = None, log_dir: str | None = None) → None[source]¶

Configure application-wide logging settings.

Sets up structured logging with consistent formatting, configurable log levels, and noise reduction from third-party libraries. Should be called once at application startup.

Parameters:

log_level – Override log level (DEBUG, INFO, WARNING, ERROR, CRITICAL). If None, uses configuration default. Case insensitive.
enable_jsonl – Whether to enable JSONL file logging (default: from config)
enable_s3_upload – Whether to enable S3 upload of log files (default: from config)
log_dir – Directory for JSONL log files (default: from config)

Example

# Use default settings from config setup_logging()

# Override to DEBUG level, disable S3 upload setup_logging(“DEBUG”, enable_s3_upload=False)

# Console only (no JSONL files) setup_logging(enable_jsonl=False, enable_s3_upload=False)

Note

This function uses force=True to override any existing logging configuration, ensuring consistent behavior in all environments. JSONL logs include structured metadata for log aggregation and analysis. S3 upload runs in background and uploads completed daily log files.

the_data_packet.core.logging.get_logger(name: str) → Logger[source]¶

Get a named logger instance for a module.

This is the standard way to obtain logger instances throughout the application. Use __name__ as the logger name to get hierarchical logger names that match the module structure.

Parameters:: name – Logger name, typically __name__ from calling module
Returns:: Configured logger instance ready for use

Example

# Standard usage in any module from the_data_packet.core.logging import get_logger logger = get_logger(__name__)

# Usage examples logger.info(“Starting article collection”) logger.warning(“Article content is short: %d chars”, len(content)) logger.error(“Failed to generate script: %s”, str(error))

# With structured data (for log aggregation) logger.info(“Article processed”, extra={

“article_id”: article.id, “processing_time”: elapsed_seconds

})

Note

Logger names follow Python’s hierarchical naming convention. For example, ‘the_data_packet.sources.wired’ will inherit configuration from ‘the_data_packet.sources’ and ‘the_data_packet’.

the_data_packet.core.logging.stop_s3_uploader() → None[source]¶

Stop the S3 log uploader service gracefully.

Should be called during application shutdown to ensure any pending uploads complete properly.

the_data_packet.core.logging.upload_current_logs() → None[source]¶

Manually trigger upload of completed log files to S3.

Useful for testing or forcing immediate upload of logs. Only uploads files from previous days to avoid interfering with active log files.

the_data_packet.core.logging.upload_current_day_log(config: Config) → None[source]¶

Upload the current day’s log file to S3.

This function specifically uploads today’s log file, which is useful at the end of a pipeline run to ensure the current session’s logs are archived alongside generated files.