Core Module

The core module provides fundamental functionality for The Data Packet including configuration management, exception handling, and logging.

Configuration

Unified configuration system for The Data Packet.

This module provides centralized configuration management with support for: - Environment variable loading - Type-safe configuration with validation - Default values for all settings - Global configuration singleton pattern - Override capabilities for testing

The configuration system follows these priorities (highest to lowest): 1. Direct parameter overrides 2. Environment variables 3. Default values

Configuration Categories:
API Keys:
  • Anthropic API key for Claude script generation

  • ElevenLabs API key for TTS audio generation

  • AWS credentials for S3 storage

Podcast Settings:
  • Show metadata (name, episode numbers)

  • Audio preferences (voices, sample rate)

  • RSS feed configuration

Processing Options:
  • Which generation steps to run

  • Article collection preferences

  • Output and cleanup settings

Network Settings:
  • HTTP timeouts and user agents

  • Retry configurations

  • Rate limiting settings

Usage:

# Get default configuration (loads from environment) config = get_config()

# Override specific values config = get_config(

show_name=”My Custom Podcast”, max_articles_per_source=3

)

# Access configuration values if config.anthropic_api_key:

generator = ScriptGenerator(config.anthropic_api_key)

Environment Variables:
Required for script generation:

ANTHROPIC_API_KEY - Claude API key

Required for audio generation:

GCS_BUCKET_NAME - Google Cloud Storage bucket for long audio synthesis GOOGLE_APPLICATION_CREDENTIALS - Path to service account JSON (optional if using default credentials)

Legacy (deprecated):

ELEVENLABS_API_KEY - ElevenLabs API key (replaced by Google Cloud TTS)

Optional for S3 uploads:

S3_BUCKET_NAME - S3 bucket for hosting AWS_ACCESS_KEY_ID - AWS access key AWS_SECRET_ACCESS_KEY - AWS secret key AWS_REGION - AWS region (default: us-east-1)

Optional for Grafana Loki log aggregation:

GRAFANA_LOKI_URL - Loki endpoint URL GRAFANA_LOKI_USERNAME - Loki authentication username GRAFANA_LOKI_PASSWORD - Loki authentication password/API key

Optional customizations:

SHOW_NAME - Podcast name override LOG_LEVEL - Logging level (DEBUG/INFO/WARNING/ERROR) MAX_ARTICLES - Max articles per source

Logging configuration:

LOG_DIRECTORY - Directory for JSONL log files (default: output/logs) ENABLE_JSONL_LOGGING - Enable JSONL file logging (true/false, default: true) ENABLE_S3_LOG_UPLOAD - Enable S3 upload of logs (true/false, default: true) LOG_UPLOAD_INTERVAL - Upload interval in seconds (default: 3600) REMOVE_LOGS_AFTER_UPLOAD - Remove local logs after S3 upload (true/false, default: false)

class the_data_packet.core.config.Config(anthropic_api_key: str | None = None, elevenlabs_api_key: str | None = None, mongodb_username: str | None = None, mongodb_password: str | None = None, google_credentials_path: str | None = None, gcs_bucket_name: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_region: str = 'us-east-1', s3_bucket_name: str | None = None, grafana_loki_url: str | None = None, grafana_loki_username: str | None = None, grafana_loki_password: str | None = None, show_name: str = 'The Data Packet', episode_number: int | None = None, output_directory: Path = PosixPath('output'), max_articles_per_source: int = 1, article_sources: List[str] = <factory>, article_categories: List[str] = <factory>, source_category_mapping: Dict[str, ~typing.List[str]]=<factory>, claude_model: str = 'claude-sonnet-4-5-20250929', tts_model: str = 'google_cloud_tts', max_tokens: int = 3000, temperature: float = 0.7, male_voice: str = 'en-US-Studio-Q', female_voice: str = 'en-US-Studio-O', audio_sample_rate: int = 44100, generate_script: bool = True, generate_audio: bool = True, generate_rss: bool = True, save_intermediate_files: bool = False, cleanup_temp_files: bool = True, rss_channel_title: str | None = 'The Data Packet', rss_channel_description: str | None = None, rss_channel_link: str | None = None, rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png', rss_channel_email: str | None = 'contact@thewintershadow.com', max_rss_episodes: int = 500, http_timeout: int = 30, user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)', log_level: str = 'INFO', log_dir: str = 'output/logs', enable_jsonl_logging: bool = True, enable_s3_log_upload: bool = True, log_upload_interval: int = 3600, remove_logs_after_upload: bool = False)[source]

Bases: object

Unified configuration for The Data Packet with environment variable support.

This class provides type-safe configuration management with automatic environment variable loading and validation. All fields have sensible defaults and can be overridden via environment variables or direct parameter passing.

API Keys
anthropic_api_key: Anthropic API key for Claude script generation.

Required for script generation. Loaded from ANTHROPIC_API_KEY.

elevenlabs_api_key: [DEPRECATED] ElevenLabs API key for legacy TTS.

Replaced by Google Cloud TTS. Loaded from ELEVENLABS_API_KEY.

mongodb_username: MongoDB username for episode tracking and article deduplication.

Optional. Loaded from MONGODB_USERNAME.

mongodb_password: MongoDB password for episode tracking and article deduplication.

Optional. Loaded from MONGODB_PASSWORD.

Google Cloud Configuration
google_credentials_path: Path to Google Cloud service account JSON file.

Optional if using default application credentials. Loaded from GOOGLE_APPLICATION_CREDENTIALS.

gcs_bucket_name: Google Cloud Storage bucket for long audio synthesis output.

Required for audio generation. Loaded from GCS_BUCKET_NAME.

AWS Configuration

aws_access_key_id: AWS access key for S3 uploads. Loaded from AWS_ACCESS_KEY_ID. aws_secret_access_key: AWS secret key for S3 uploads. Loaded from AWS_SECRET_ACCESS_KEY. aws_region: AWS region for S3 operations. Default: us-east-1. s3_bucket_name: S3 bucket name for hosting files. Loaded from S3_BUCKET_NAME.

Grafana Loki Configuration

grafana_loki_url: Loki endpoint URL for log aggregation. Loaded from GRAFANA_LOKI_URL. grafana_loki_username: Username for Loki authentication. Loaded from GRAFANA_LOKI_USERNAME. grafana_loki_password: Password/API key for Loki authentication. Loaded from GRAFANA_LOKI_PASSWORD.

Podcast Configuration

show_name: Podcast show name. Used in RSS feeds and file names. episode_number: Episode number for RSS feeds. Auto-generated if None. output_directory: Local directory for generated files.

Article Collection

max_articles_per_source: Maximum articles to collect per source. article_sources: List of news sources to use (wired, techcrunch). article_categories: List of categories to fetch from each source. source_category_mapping: Maps each source to its supported categories.

AI Generation Settings

claude_model: Claude model name for script generation. tts_model: Text-to-speech service type (now “google_cloud_tts”). max_tokens: Maximum tokens for Claude API calls. temperature: AI generation temperature (0.0-1.0, lower = more consistent).

Audio Settings

voice_a: First speaker voice name (Alex - male narrator). voice_b: Second speaker voice name (Sam - female narrator). audio_sample_rate: Audio sample rate in Hz.

Type:

Google Cloud Studio Multi-speaker voices

Processing Options

generate_script: Whether to generate podcast scripts. generate_audio: Whether to generate audio files. generate_rss: Whether to generate RSS feeds. save_intermediate_files: Whether to keep intermediate processing files. cleanup_temp_files: Whether to clean up temporary files after processing.

RSS Feed Configuration

rss_channel_title: RSS channel title. rss_channel_description: RSS channel description. rss_channel_link: RSS channel website link. rss_channel_image_url: RSS channel artwork URL. rss_channel_email: Contact email for podcast. max_rss_episodes: Maximum episodes to keep in RSS feed.

Network Settings

http_timeout: HTTP request timeout in seconds. user_agent: User agent string for HTTP requests. log_level: Logging level (DEBUG/INFO/WARNING/ERROR/CRITICAL).

Example

# Default configuration with environment variables config = Config()

# Custom configuration config = Config(

show_name=”Tech News Daily”, max_articles_per_source=3, voice_a=”charon”, voice_b=”aoede”

)

# Validate before use config.validate_for_script_generation() config.validate_for_audio_generation()

anthropic_api_key: str | None = None
elevenlabs_api_key: str | None = None
mongodb_username: str | None = None
mongodb_password: str | None = None
google_credentials_path: str | None = None
gcs_bucket_name: str | None = None
aws_access_key_id: str | None = None
aws_secret_access_key: str | None = None
aws_region: str = 'us-east-1'
s3_bucket_name: str | None = None
grafana_loki_url: str | None = None
grafana_loki_username: str | None = None
grafana_loki_password: str | None = None
show_name: str = 'The Data Packet'
episode_number: int | None = None
output_directory: Path = PosixPath('output')
max_articles_per_source: int = 1
article_sources: List[str]
article_categories: List[str]
source_category_mapping: Dict[str, List[str]]
claude_model: str = 'claude-sonnet-4-5-20250929'
tts_model: str = 'google_cloud_tts'
max_tokens: int = 3000
temperature: float = 0.7
male_voice: str = 'en-US-Studio-Q'
female_voice: str = 'en-US-Studio-O'
audio_sample_rate: int = 44100
generate_script: bool = True
generate_audio: bool = True
generate_rss: bool = True
save_intermediate_files: bool = False
cleanup_temp_files: bool = True
rss_channel_title: str | None = 'The Data Packet'
rss_channel_description: str | None = None
rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png'
rss_channel_email: str | None = 'contact@thewintershadow.com'
max_rss_episodes: int = 500
http_timeout: int = 30
user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)'
log_level: str = 'INFO'
log_dir: str = 'output/logs'
enable_jsonl_logging: bool = True
enable_s3_log_upload: bool = True
log_upload_interval: int = 3600
remove_logs_after_upload: bool = False
__post_init__() None[source]

Load configuration from environment variables.

validate_for_script_generation() None[source]

Validate configuration for script generation.

validate_for_audio_generation() None[source]

Validate configuration for audio generation.

get_sources_for_category(category: str) List[str][source]

Get list of sources that support a given category.

Parameters:

category – Category name to check

Returns:

List of source names that support the category

get_categories_for_source(source: str) List[str][source]

Get list of categories supported by a given source.

Parameters:

source – Source name to check

Returns:

List of category names supported by the source

to_dict() Dict[source]

Convert configuration to dictionary.

__init__(anthropic_api_key: str | None = None, elevenlabs_api_key: str | None = None, mongodb_username: str | None = None, mongodb_password: str | None = None, google_credentials_path: str | None = None, gcs_bucket_name: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_region: str = 'us-east-1', s3_bucket_name: str | None = None, grafana_loki_url: str | None = None, grafana_loki_username: str | None = None, grafana_loki_password: str | None = None, show_name: str = 'The Data Packet', episode_number: int | None = None, output_directory: Path = PosixPath('output'), max_articles_per_source: int = 1, article_sources: List[str] = <factory>, article_categories: List[str] = <factory>, source_category_mapping: Dict[str, ~typing.List[str]]=<factory>, claude_model: str = 'claude-sonnet-4-5-20250929', tts_model: str = 'google_cloud_tts', max_tokens: int = 3000, temperature: float = 0.7, male_voice: str = 'en-US-Studio-Q', female_voice: str = 'en-US-Studio-O', audio_sample_rate: int = 44100, generate_script: bool = True, generate_audio: bool = True, generate_rss: bool = True, save_intermediate_files: bool = False, cleanup_temp_files: bool = True, rss_channel_title: str | None = 'The Data Packet', rss_channel_description: str | None = None, rss_channel_link: str | None = None, rss_channel_image_url: str | None = 'https://the-data-packet.s3.us-west-2.amazonaws.com/the-data-packet/the_data_packet.png', rss_channel_email: str | None = 'contact@thewintershadow.com', max_rss_episodes: int = 500, http_timeout: int = 30, user_agent: str = 'The Data Packet/1.0 (+https://github.com/TheWinterShadow/The-Data-Packet)', log_level: str = 'INFO', log_dir: str = 'output/logs', enable_jsonl_logging: bool = True, enable_s3_log_upload: bool = True, log_upload_interval: int = 3600, remove_logs_after_upload: bool = False) None
the_data_packet.core.config.get_config(**overrides: Any) Config[source]

Get the global configuration instance.

Parameters:

**overrides – Configuration values to override

Returns:

Config instance

the_data_packet.core.config.reset_config() None[source]

Reset the global configuration instance.

Exceptions

Custom exceptions for The Data Packet.

This module provides a hierarchy of custom exceptions for different types of errors that can occur during podcast generation. All exceptions inherit from the base TheDataPacketError class for easy catching.

Exception Hierarchy:

TheDataPacketError ├── ConfigurationError (Missing/invalid API keys, settings) ├── NetworkError (HTTP requests, connectivity issues) ├── ScrapingError (Article extraction failures) ├── AIGenerationError (Claude API failures, content generation) ├── AudioGenerationError (ElevenLabs TTS failures, audio processing) └── ValidationError (Invalid data, missing required fields)

Example

try:

pipeline = PodcastPipeline() result = pipeline.run()

except ConfigurationError as e:

logger.error(f”Configuration issue: {e}”)

except AIGenerationError as e:

logger.error(f”AI generation failed: {e}”)

except TheDataPacketError as e:

logger.error(f”Unexpected error: {e}”)

exception the_data_packet.core.exceptions.TheDataPacketError[source]

Bases: Exception

Base exception for all The Data Packet errors.

This is the parent class for all custom exceptions in The Data Packet. Use this for general error handling when you want to catch any podcast generation related error.

message

Human-readable error description

Example

try:

pipeline.run()

except TheDataPacketError as e:

logger.error(f”Podcast generation failed: {e}”)

exception the_data_packet.core.exceptions.ConfigurationError[source]

Bases: TheDataPacketError

Raised when there’s an issue with configuration.

This exception is raised when: - Required API keys are missing (ANTHROPIC_API_KEY, GOOGLE_API_KEY) - Invalid configuration values are provided - Required environment variables are not set - AWS credentials are missing for S3 uploads

Example

if not self.anthropic_api_key:

raise ConfigurationError(“Anthropic API key is required”)

exception the_data_packet.core.exceptions.NetworkError[source]

Bases: TheDataPacketError

Raised when there’s a network-related error.

This exception is raised when: - HTTP requests fail (timeouts, connection errors) - RSS feeds are unreachable - API services are unavailable - S3 upload failures due to network issues

Example

try:

response = requests.get(url, timeout=30)

except requests.RequestException as e:

raise NetworkError(f”Failed to fetch {url}: {e}”)

exception the_data_packet.core.exceptions.ScrapingError[source]

Bases: TheDataPacketError

Raised when article scraping fails.

This exception is raised when: - RSS feed parsing fails - Article content extraction fails - Required article fields are missing - Content cleaning/processing fails

Example

if not article_content:

raise ScrapingError(f”No content found for article: {url}”)

exception the_data_packet.core.exceptions.AIGenerationError[source]

Bases: TheDataPacketError

Raised when AI content generation fails.

This exception is raised when: - Claude API returns errors or rate limits - Generated content is invalid or too short - AI refuses to process certain content types - Script generation fails after retries

Example

if response.status_code != 200:

raise AIGenerationError(f”Claude API error: {response.text}”)

exception the_data_packet.core.exceptions.AudioGenerationError[source]

Bases: TheDataPacketError

Raised when audio generation fails.

This exception is raised when: - ElevenLabs TTS API returns errors - Audio file generation fails - Invalid voice configuration - Audio processing or saving fails

Example

if not output_file.exists():

raise AudioGenerationError(“Audio file generation failed”)

exception the_data_packet.core.exceptions.ValidationError[source]

Bases: TheDataPacketError

Raised when data validation fails.

This exception is raised when: - Invalid article categories are provided - Required data fields are missing or malformed - Configuration values are outside acceptable ranges - File paths are invalid or inaccessible

Example

if category not in self.supported_categories:

raise ValidationError(f”Unsupported category: {category}”)

Logging

Centralized logging configuration for The Data Packet.

This module provides unified logging setup for the entire application. It configures structured logging with proper formatters, reduces noise from third-party libraries, and provides a consistent interface for obtaining logger instances throughout the codebase.

Features:
  • Structured logging with timestamps and module names

  • Configurable log levels via environment variables

  • Noise reduction from third-party libraries

  • Consistent format across all modules

  • Console output optimized for Docker containers

Usage:

# In main application entry point setup_logging()

# In any module from the_data_packet.core.logging import get_logger logger = get_logger(__name__) logger.info(“Processing started”)

Log Levels:

DEBUG: Detailed debugging information INFO: General operational messages WARNING: Warning messages for recoverable issues ERROR: Error messages for serious problems CRITICAL: Critical errors that may cause shutdown

class the_data_packet.core.logging.JSONLHandler(log_dir: str = 'output/logs')[source]

Bases: Handler

Custom logging handler that writes log entries to JSONL files.

Features: - Writes structured JSON logs to .jsonl files - Automatically rotates files daily - Includes metadata like timestamp, module, level - Thread-safe file operations

__init__(log_dir: str = 'output/logs')[source]

Initializes the instance - basically setting the formatter to None and the filter list to empty.

emit(record: LogRecord) None[source]

Write log record as JSON line to daily log file.

class the_data_packet.core.logging.S3LogUploader(log_dir: str = 'output/logs', upload_interval: int = 3600, remove_after_upload: bool = False)[source]

Bases: object

Background service to upload JSONL log files to S3.

Features: - Monitors log directory for completed daily logs - Uploads files to S3 with structured naming - Optionally removes local files after upload - Runs in background thread

__init__(log_dir: str = 'output/logs', upload_interval: int = 3600, remove_after_upload: bool = False)[source]

Initialize S3 log uploader.

Parameters:
  • log_dir – Directory containing JSONL log files

  • upload_interval – How often to check for files to upload (seconds)

  • remove_after_upload – Whether to delete local files after upload

start() None[source]

Start background upload service.

stop() None[source]

Stop background upload service.

the_data_packet.core.logging.setup_logging(log_level: str | None = None, enable_jsonl: bool | None = None, enable_s3_upload: bool | None = None, log_dir: str | None = None) None[source]

Configure application-wide logging settings.

Sets up structured logging with consistent formatting, configurable log levels, and noise reduction from third-party libraries. Should be called once at application startup.

Parameters:
  • log_level – Override log level (DEBUG, INFO, WARNING, ERROR, CRITICAL). If None, uses configuration default. Case insensitive.

  • enable_jsonl – Whether to enable JSONL file logging (default: from config)

  • enable_s3_upload – Whether to enable S3 upload of log files (default: from config)

  • log_dir – Directory for JSONL log files (default: from config)

Example

# Use default settings from config setup_logging()

# Override to DEBUG level, disable S3 upload setup_logging(“DEBUG”, enable_s3_upload=False)

# Console only (no JSONL files) setup_logging(enable_jsonl=False, enable_s3_upload=False)

Note

This function uses force=True to override any existing logging configuration, ensuring consistent behavior in all environments. JSONL logs include structured metadata for log aggregation and analysis. S3 upload runs in background and uploads completed daily log files.

the_data_packet.core.logging.get_logger(name: str) Logger[source]

Get a named logger instance for a module.

This is the standard way to obtain logger instances throughout the application. Use __name__ as the logger name to get hierarchical logger names that match the module structure.

Parameters:

name – Logger name, typically __name__ from calling module

Returns:

Configured logger instance ready for use

Example

# Standard usage in any module from the_data_packet.core.logging import get_logger logger = get_logger(__name__)

# Usage examples logger.info(“Starting article collection”) logger.warning(“Article content is short: %d chars”, len(content)) logger.error(“Failed to generate script: %s”, str(error))

# With structured data (for log aggregation) logger.info(“Article processed”, extra={

“article_id”: article.id, “processing_time”: elapsed_seconds

})

Note

Logger names follow Python’s hierarchical naming convention. For example, ‘the_data_packet.sources.wired’ will inherit configuration from ‘the_data_packet.sources’ and ‘the_data_packet’.

the_data_packet.core.logging.stop_s3_uploader() None[source]

Stop the S3 log uploader service gracefully.

Should be called during application shutdown to ensure any pending uploads complete properly.

the_data_packet.core.logging.upload_current_logs() None[source]

Manually trigger upload of completed log files to S3.

Useful for testing or forcing immediate upload of logs. Only uploads files from previous days to avoid interfering with active log files.

the_data_packet.core.logging.upload_current_day_log(config: Config) None[source]

Upload the current day’s log file to S3.

This function specifically uploads today’s log file, which is useful at the end of a pipeline run to ensure the current session’s logs are archived alongside generated files.