MongoDB¶
MongoDB is an optional integration that prevents article reuse across episodes and maintains a full audit trail of podcast generation history. If credentials are absent, the pipeline runs normally and deduplication is simply skipped.
What MongoDB stores¶
-
Articles collection
Tracks every article URL that has been used in a past episode.
When the pipeline runs, it checks this collection and excludes any articles already seen — ensuring every episode has fresh, unique content.
-
Episodes collection
Records metadata for every generated episode: execution time, success status, article count, output file paths, and S3 URLs.
Provides a complete audit trail and generation history.
Local setup (Docker)¶
The included mongodb.sh script manages a local MongoDB container for development and testing.
./mongodb.sh status # Check if running
./mongodb.sh shell # Open MongoDB shell
./mongodb.sh logs # View container logs
./mongodb.sh stop # Stop the container
./mongodb.sh remove # Stop + remove container and data volume
Connection details¶
| Field | Value |
|---|---|
| Host | localhost |
| Port | 27017 |
| Username | admin |
| Password | Set in mongodb.sh (default: password123) |
| Database | the_data_packet |
Connection URL:
Change the default password
The default password in mongodb.sh is for local development only.
Always set a strong password in any environment beyond your own machine.
Configuration¶
Set these environment variables to enable MongoDB integration:
MONGODB_USERNAME=admin
MONGODB_PASSWORD=your-password
MONGODB_HOST=localhost # default
MONGODB_PORT=27017 # default
MONGODB_DATABASE=the_data_packet # default
Or pass them to Docker:
docker run --rm --env-file .env \
-e MONGODB_USERNAME=admin \
-e MONGODB_PASSWORD=your-password \
-v "$(pwd)/output:/app/output" \
ghcr.io/thewintershadow/the-data-packet:latest
Inspecting stored data¶
from pymongo import MongoClient
client = MongoClient(
"mongodb://admin:password123@localhost:27017/the_data_packet?authSource=admin"
)
db = client.the_data_packet
# Articles used across all episodes
articles = list(db.articles.find())
print(f"Total articles tracked: {len(articles)}")
# Episode history
episodes = list(db.episodes.find())
print(f"Total episodes generated: {len(episodes)}")
# Most recent episode
latest = db.episodes.find_one(sort=[("_id", -1)])
print(f"Last run: {latest['execution_time_seconds']:.1f}s — success={latest['success']}")
// Open with: ./mongodb.sh shell
use the_data_packet
// Show collections
show collections
// Count documents
db.articles.countDocuments()
db.episodes.countDocuments()
// Recent episodes (newest first)
db.episodes.find().sort({ _id: -1 }).limit(5)
// Articles from a specific source
db.articles.find({ source: "techcrunch" })
Data schema¶
| Field | Type | Description |
|---|---|---|
url |
string |
Article URL (unique index) |
title |
string |
Article title |
source |
string |
Source name (wired, techcrunch) |
published_date |
datetime |
Publication date |
used_at |
datetime |
When it was added to an episode |
| Field | Type | Description |
|---|---|---|
success |
bool |
Whether the run completed successfully |
number_of_articles |
int |
Articles used |
script_path |
string |
Local path to script file |
audio_path |
string |
Local path to audio file |
s3_audio_url |
string |
Public S3 URL (if uploaded) |
execution_time_seconds |
float |
Wall-clock run time |
error_message |
string |
Error details if success is false |
created_at |
datetime |
Episode generation timestamp |
Production setup¶
For production, use a managed MongoDB service rather than the local Docker script:
- MongoDB Atlas — fully managed, free tier available
- Amazon DocumentDB — MongoDB-compatible, AWS-native
- Self-hosted MongoDB with replica set for high availability
Set MONGODB_HOST to your managed instance hostname and provide the credentials via
environment variables or a secrets manager.
Use Atlas free tier for simple deployments
MongoDB Atlas M0 (free tier) is sufficient for most podcast generation workloads. It provides 512 MB of storage, which easily holds thousands of episode records.