Skip to content

Architecture

The Data Packet is structured as a layered Python application. Each layer has a single responsibility and communicates through well-defined interfaces.


Layer diagram

graph TB
    CLI["CLI Interface<br/><code>cli.py</code>"]
    WF["Workflows<br/>PodcastPipeline ยท PodcastResult"]
    SRC["Sources<br/>WiredSource ยท TechCrunchSource"]
    GEN["Generation<br/>ScriptGenerator ยท AudioGenerator ยท RSSGenerator"]
    UTIL["Utils<br/>HTTPClient ยท S3Storage ยท MongoDBClient ยท LokiClient"]
    CORE["Core<br/>Config ยท Exceptions ยท Logging"]

    CLI --> WF
    WF --> SRC
    WF --> GEN
    WF --> UTIL
    SRC --> UTIL
    GEN --> UTIL
    SRC --> CORE
    GEN --> CORE
    UTIL --> CORE
    WF --> CORE

Module reference

core/ โ€” Foundation services

Module Class / Function Responsibility
config.py Config, get_config() Unified config from env vars + overrides
exceptions.py TheDataPacketError and subclasses Custom exception hierarchy
logging.py setup_logging(), get_logger() JSONL logging with S3 upload

Exception hierarchy:

graph TD
    Base["TheDataPacketError"]
    Base --> ConfigurationError
    Base --> NetworkError
    Base --> ScrapingError
    Base --> AIGenerationError
    Base --> AudioGenerationError
    Base --> ValidationError

sources/ โ€” Article collection

Module Class Responsibility
base.py Article, ArticleSource Dataclass and abstract base for all sources
wired.py WiredSource RSS-based scraping of Wired.com
techcrunch.py TechCrunchSource Article collection from TechCrunch

ArticleSource is an abstract base class. Adding a new news source means subclassing it and implementing collect_articles(). See Contributing for a step-by-step guide.

Supported categories:

Category URL path
security /category/security/
science /category/science/
ai /category/artificial-intelligence/
Category URL path
ai /category/artificial-intelligence/
security /category/security/

generation/ โ€” Content creation

Module Class Responsibility
script.py ScriptGenerator Claude API โ†’ structured dialogue script
audio.py AudioGenerator Google Cloud TTS Long Audio โ†’ .wav
rss.py RSSGenerator RSS 2.0 feed XML generation

Script generation flow:

sequenceDiagram
    participant WF as PodcastPipeline
    participant SG as ScriptGenerator
    participant CL as Claude API

    WF->>SG: generate_script(articles)
    loop For each article
        SG->>CL: generate_segment(article)
        CL-->>SG: dialogue segment
    end
    SG->>CL: generate_framework(segments)
    CL-->>SG: intro + transitions + outro
    SG-->>WF: assembled script

Audio generation flow:

sequenceDiagram
    participant WF as PodcastPipeline
    participant AG as AudioGenerator
    participant TTS as Google Cloud TTS
    participant GCS as Cloud Storage

    WF->>AG: generate_audio(script)
    AG->>TTS: synthesize_long_audio(script)
    TTS->>GCS: store intermediate audio
    TTS-->>AG: operation ID
    loop Poll until complete
        AG->>TTS: check status
    end
    AG->>GCS: download audio
    AG-->>WF: local .wav path

utils/ โ€” Infrastructure clients

Module Class Responsibility
http.py HTTPClient Requests with retry, timeout, user-agent
s3.py S3Storage Upload files to AWS S3, return public URLs
mongodb.py MongoDBClient Store episode records, check article IDs
loki.py LokiClient Forward log entries to Grafana Loki

workflows/ โ€” Pipeline orchestration

PodcastPipeline.run() sequence:

flowchart TD
    A["Collect articles\nfrom sources"] --> B{MongoDB\nconfigured?}
    B -->|yes| C["Deduplicate\nagainst MongoDB"]
    B -->|no| D
    C --> D["Generate script\nScriptGenerator"]
    D --> E["Synthesize audio\nAudioGenerator"]
    E --> F["Generate RSS feed\nRSSGenerator"]
    F --> G{S3\nconfigured?}
    G -->|yes| H["Upload to S3\nS3Storage"]
    G -->|no| I
    H --> I["Record episode\nMongoDBClient"]
    I --> J["Return PodcastResult"]

PodcastResult fields:

Field Type Description
success bool Whether the pipeline completed without error
number_of_articles_collected int Articles used in this episode
script_path Path \| None Local path to generated script
audio_path Path \| None Local path to generated audio
rss_path Path \| None Local path to RSS feed
s3_script_url str \| None Public S3 URL for the script
s3_audio_url str \| None Public S3 URL for the audio
execution_time_seconds float Wall-clock time for the run
error_message str \| None Error details if success is False

Configuration resolution

Config values resolve in this order (highest priority first):

1. get_config(keyword=value)    โ† Python API direct override
2. CLI flag                     โ† --show-name, --male-voice, etc.
3. Environment variable         โ† SHOW_NAME, MALE_VOICE, etc.
4. Built-in default             โ† hardcoded in Config dataclass

This means you can mix any combination: set required secrets in env vars, override per-run settings with CLI flags, all without a config file.


External dependencies

Graceful degradation

Optional integrations (MongoDB, S3, Grafana Loki) degrade silently if not configured. The core script + audio pipeline works with only ANTHROPIC_API_KEY and GCS_BUCKET_NAME.

Integration Required Env var
Anthropic Claude API Yes ANTHROPIC_API_KEY
Google Cloud TTS Yes GCS_BUCKET_NAME + credentials
AWS S3 No S3_BUCKET_NAME
MongoDB No MONGODB_USERNAME + MONGODB_PASSWORD
Grafana Loki No GRAFANA_LOKI_URL