Skip to content

Architecture

The Data Packet is structured as a layered Python application. Each layer has a single responsibility and communicates through well-defined interfaces.


Layer diagram

graph TB
    CLI["CLI Interface<br/><code>cli.py</code>"]
    WF["Workflows<br/>PodcastPipeline · PodcastResult"]
    SRC["Sources<br/>WiredSource · TechCrunchSource"]
    GEN["Generation<br/>ScriptGenerator · AudioGenerator · RSSGenerator"]
    UTIL["Utils<br/>HTTPClient · S3Storage · MongoDBClient · LokiClient"]
    CORE["Core<br/>Config · Exceptions · Logging"]

    CLI --> WF
    WF --> SRC
    WF --> GEN
    WF --> UTIL
    SRC --> UTIL
    GEN --> UTIL
    SRC --> CORE
    GEN --> CORE
    UTIL --> CORE
    WF --> CORE

Module reference

core/ — Foundation services

Module Class / Function Responsibility
config.py Config, get_config() Unified config from env vars + overrides
exceptions.py TheDataPacketError and subclasses Custom exception hierarchy
logging.py setup_logging(), get_logger() JSONL logging with S3 upload

Exception hierarchy:

graph TD
    Base["TheDataPacketError"]
    Base --> ConfigurationError
    Base --> NetworkError
    Base --> ScrapingError
    Base --> AIGenerationError
    Base --> AudioGenerationError
    Base --> ValidationError

sources/ — Article collection

Module Class Responsibility
base.py Article, ArticleSource Dataclass and abstract base for all sources
wired.py WiredSource RSS-based scraping of Wired.com
techcrunch.py TechCrunchSource Article collection from TechCrunch

ArticleSource is an abstract base class. Adding a new news source means subclassing it and implementing collect_articles(). See Contributing for a step-by-step guide.

Supported categories:

Category URL path
security /category/security/
science /category/science/
ai /category/artificial-intelligence/
Category URL path
ai /category/artificial-intelligence/
security /category/security/

generation/ — Content creation

Module Class Responsibility
script.py ScriptGenerator Claude API → structured dialogue script
audio.py AudioGenerator Google Cloud TTS Long Audio → .wav
rss.py RSSGenerator RSS 2.0 feed XML generation

Script generation flow:

sequenceDiagram
    participant WF as PodcastPipeline
    participant SG as ScriptGenerator
    participant CL as Claude API

    WF->>SG: generate_script(articles)
    loop For each article
        SG->>CL: generate_segment(article)
        CL-->>SG: dialogue segment
    end
    SG->>CL: generate_framework(segments)
    CL-->>SG: intro + transitions + outro
    SG-->>WF: assembled script

Audio generation flow:

sequenceDiagram
    participant WF as PodcastPipeline
    participant AG as AudioGenerator
    participant TTS as Google Cloud TTS
    participant GCS as Cloud Storage

    WF->>AG: generate_audio(script)
    AG->>TTS: synthesize_long_audio(script)
    TTS->>GCS: store intermediate audio
    TTS-->>AG: operation ID
    loop Poll until complete
        AG->>TTS: check status
    end
    AG->>GCS: download audio
    AG-->>WF: local .wav path

utils/ — Infrastructure clients

Module Class Responsibility
http.py HTTPClient Requests with retry, timeout, user-agent
s3.py S3Storage Upload files to AWS S3, return public URLs
mongodb.py MongoDBClient Store episode records, check article IDs
loki.py LokiClient Forward log entries to Grafana Loki

workflows/ — Pipeline orchestration

PodcastPipeline.run() sequence:

flowchart TD
    A["Collect articles\nfrom sources"] --> B{MongoDB\nconfigured?}
    B -->|yes| C["Deduplicate\nagainst MongoDB"]
    B -->|no| D
    C --> D["Generate script\nScriptGenerator"]
    D --> E["Synthesize audio\nAudioGenerator"]
    E --> F["Generate RSS feed\nRSSGenerator"]
    F --> G{S3\nconfigured?}
    G -->|yes| H["Upload to S3\nS3Storage"]
    G -->|no| I
    H --> I["Record episode\nMongoDBClient"]
    I --> J["Return PodcastResult"]

PodcastResult fields:

Field Type Description
success bool Whether the pipeline completed without error
number_of_articles_collected int Articles used in this episode
script_path Path \| None Local path to generated script
audio_path Path \| None Local path to generated audio
rss_path Path \| None Local path to RSS feed
s3_script_url str \| None Public S3 URL for the script
s3_audio_url str \| None Public S3 URL for the audio
execution_time_seconds float Wall-clock time for the run
error_message str \| None Error details if success is False

Configuration resolution

Config values resolve in this order (highest priority first):

1. get_config(keyword=value)    ← Python API direct override
2. CLI flag                     ← --show-name, --male-voice, etc.
3. Environment variable         ← SHOW_NAME, MALE_VOICE, etc.
4. Built-in default             ← hardcoded in Config dataclass

This means you can mix any combination: set required secrets in env vars, override per-run settings with CLI flags, all without a config file.


External dependencies

Graceful degradation

Optional integrations (MongoDB, S3, Grafana Loki) degrade silently if not configured. The core script + audio pipeline works with only ANTHROPIC_API_KEY and GCS_BUCKET_NAME.

Integration Required Env var
Anthropic Claude API Yes ANTHROPIC_API_KEY
Google Cloud TTS Yes GCS_BUCKET_NAME + credentials
AWS S3 No S3_BUCKET_NAME
MongoDB No MONGODB_USERNAME + MONGODB_PASSWORD
Grafana Loki No GRAFANA_LOKI_URL