The short version: We run PaxMachina like an Airflow-style DAG, separating heavy lifting from reasoning to save tokens. We replaced generic vector stores with a specialized Query-Memory-Document (QMD) backend for high-velocity state. We treat Telegram channels as immutable event logs, not watercoolers. And we added a task ledger protocol that prevents the runaway loops plaguing other agent frameworks.
AI agents are like Airflow for intelligence
I used to think the bottleneck in agent systems was model intelligence. I was wrong. The bottleneck is context hygiene. If you treat an agent like a chatty intern, you burn tokens on coordination and lose state in the noise.
The shift that made our system (PaxMachina) work was treating it like an ops pipeline. Specifically, like Airflow DAGs. We separated the "muscle" (gathering data) from the "brain" (reasoning), and we locked down how they talk to each other.
If you've followed the recent OpenClaw saga (500 runaway iMessages, $20 burned overnight on a heartbeat cron, 341 malicious skills on ClawHub, three CVEs in three days), you've seen what happens when agent systems lack this discipline. PaxMachina started as our answer to the same problems before OpenClaw made them front-page news.
Here are the four architectural decisions that turned our prototype into a production loop.
PaxMachina: Agents as Ops Pipeline
No LLM tokens
No drift
Only when needed
Immutable audit
1. Script-based gathering (saving tokens by not asking nicely)
Early on, we had agents "browsing" the web. One agent would ask another to "go check the pricing page." The second agent would spin up a browser, scroll, read, and summarize. It was painfully slow and expensive.
We replaced that with script-based gathering. Fetching a URL, stripping the DOM, and extracting diffs is a deterministic task.
We don't need an LLM to decide how to use curl or a headless browser.
Now, the "Research" node in our DAG is mostly Python scripts that run rigid, efficient collection jobs. They hand off clean, token-efficient summaries to the LLM only when reasoning is actually required. This dropped our token costs by ~40% and made the inputs for the Chief of Staff agent consistent every time.
The OpenClaw lesson: One user's "heartbeat" cron sent 120,000 tokens of context to Claude Opus just to check the time every 30 minutes. When your agent can invoke arbitrary tools, you're one bad loop away from a $20 overnight bill. Or worse. Deterministic scripts don't hallucinate token-expensive approaches.
2. The QMD memory backend (handling state at speed)
Managing agent memory is usually an afterthought. Then you have six agents trying to read and write context simultaneously. We initially used simple JSON dumps, then generic vector stores, but both got messy. JSON files had race conditions; vector stores were too slow for rapid state updates.
We moved to a custom QMD (Query-Memory-Document) backend: a shared memory buffer optimized for retrieval speed with distinct "memory banks" (short-term working memory vs. long-term archival). It handles embedding management and context windowing so the agents don't have to.
The design goal was similar to Apache Wayang: a memory layer decoupled from any specific model provider or storage engine. Wayang abstracts across execution engines; QMD abstracts across memory access patterns. It just holds the state, fast.
The OpenClaw lesson: Their troubleshooting docs warn about "state drift" when multiple workspace directories cause auth and context conflicts. We've seen the same failure mode. QMD gives us a single source of truth with atomic updates. No split-brain scenarios.
3. Telegram as an immutable ledger
This is the most controversial part of our stack, but it works. We use Telegram not just for notifications, but as the immutable ledger of the system.
In many agent frameworks, the "logs" are buried in a database. In PaxMachina, the Chief of Staff broadcasts state changes and decisions to a dedicated, read-only Telegram channel. Because the channel is linear and timestamped, it acts as a perfect audit trail.
Why not Kafka or a proper event store? We considered it. But Telegram gave us human-readable audit trails with zero infrastructure overhead. We can scroll back and see exactly what the system decided, when, and why. No need to spin up a consumer or query a topic. For our scale (internal ops, not enterprise SaaS), the tradeoff is worth it.
The Hygiene Rule: Workers report to the Chief of Staff, and the Chief of Staff writes to the ledger. If it's not in the channel, it didn't happen. This forces us to be disciplined about what constitutes a "decision" versus just internal noise.
4. Task ledger protocol (preventing runaway loops)
The 500-message OpenClaw incident wasn't a bug. It was a missing feedback loop. The agent had no explicit record of what it had already done, so it kept doing it.
Agent Loops: Uncontrolled vs Controlled
We added a Task Ledger Protocol directly into PaxMachina's SOUL file:
- On every new request: open/update
memory/TASKS.md - When delegating: log
assignedAt, agent, expected output location - When results arrive: mark
DONE+ link to commit/file - If no result by SLA (15–30 min): mark
NUDGED+ resend
The SQUAD file enforces the other half: "Workers must report back to PaxMachina" and "PaxMachina tracks work in memory/TASKS.md."
This creates a closed loop. Every task has a paper trail. If something runs away, we can see exactly where the chain broke. No more "the agent just kept going." The agent can't proceed without updating its ledger.
Security principles (keeping the DAG safe)
Running automated ops pipelines requires paranoia. CrowdStrike's analysis of OpenClaw describes how misconfigured agents become "AI backdoor agents capable of taking orders from adversaries." Prompt injection isn't just a data leak, it's a foothold for automated lateral movement. We learned some of these lessons the hard way after an early prototype started modifying its own config files.
- Least privilege: The script that fetches URLs cannot write to the database. The agent that writes code cannot deploy it.
- Operator != Admin: The human overseeing the loop (me) has to sudo up for destructive actions; I don't run as admin by default.
- Allowlists, not capabilities: We don't let agents execute arbitrary shell commands. They pick from a library of approved tasks. This is the opposite of OpenClaw's skill model, where 26% of analyzed skills contained vulnerabilities.
- No external skill repositories: ClawHub has 341+ confirmed malicious skills. We don't pull code from registries the agent can browse autonomously.
- DR Backups: We assume the memory state will get corrupted. We snapshot the QMD and the ledger so we can replay the week if needed.
Skill auditing (because ClawHub is a minefield)
341 malicious skills on ClawHub means you can't trust the registry. We run a layered audit before enabling any skill.
1. Inventory installed vs allowed
Installed skills live under ~/workspace/skills/ and npm global skills. Allowed skills are configured via skills.allowBundled and skills.entries.<skill>.enabled in openclaw.json. Run openclaw doctor to see the current state. Ours shows: Eligible: 15, Blocked by allowlist: 39. First step: diff installed vs allowed and remove anything not explicitly needed.
2. Static scan
For every installed skill folder, grep for obvious exfil and persistence patterns:
# Exfil patterns curl | wget | nc | ssh | scp | rsync # Encoding (review context) base64 | openssl enc | tar # Secret access process.env | ~/.config/openclaw/env | .ssh | .git-credentials # Persistence ~/.bashrc | cron | systemd units
Also check package.json scripts. The postinstall and preinstall hooks are where supply chain attacks hide.
3. Provenance check
For ClawHub-installed skills, inspect .clawhub/origin.json and pin versions. Prefer skills with known publishers, minimal dependencies, and small code surface. If a skill pulls in 47 transitive dependencies to format a date, reject it.
4. Sandbox test
Run skills with security=allowlist, no network (or monitored outbound DNS/HTTP), and minimal env (no tokens). See if they still try to touch secrets. If they do, they're malicious or badly written. Either way, reject.
5. Hardening defaults
Even if a skill passes audit, assume it could be compromised later:
Keep exec.security on deny/allowlist (never "full")
Keep secrets only in env/service, never in repo files
Run scheduled jobs via your own scripts, not arbitrary skill code
What we run
We automated this as /home/clawdia/workspace/scripts/skills_audit.sh. It inventories installed skills, checks for install hooks, greps for high-risk patterns, and outputs a report to memory/security/skills_audit_YYYY-MM-DD.md.
The report becomes a control, not just documentation. We set an explicit allowlist in openclaw.json with only what we actually use: gog, github, ddg-search, xai, bird, summarize, gsc. Everything else becomes inert even if installed.
Production configs that actually matter
OpenClaw's defaults assume you want maximum recall and frequent syncing. In practice, this burns tokens and causes the context bloat that leads to runaway loops. Here's what we tuned.
Memory tuning
The default workspace indexing (workspace/**/*.md) pulled in everything. We stripped it back to only what agents actually need:
includeDefaultMemory: true # MEMORY.md + memory/**/*.md only memory.qmd.update.interval: 10m # was 5m debounceMs: 30000 # was 15000 onBoot: true # Recall caps maxResults: 6 maxSnippetChars: 700 maxInjectedChars: 4000 timeoutMs: 6000
The maxInjectedChars: 4000 cap is the important one. Without it, a single recall can dump 20k tokens of "helpful context" into every request.
Event-driven, not heartbeat
The $20 overnight burn happened because a heartbeat cron kept firing. We don't use heartbeat for routine work. Instead:
Instant triggers: Gmail webhooks, Telegram messages, and (optionally) GitHub webhooks fire agents immediately when something happens.
Cron sweeps: Periodic scripts check outboxes and queues (memory/rss-outbox.md, content-pipeline/inbox/), run the relevant agent, and write outputs. Follow-ups happen because pipeline scripts call the next agent (Oracle → Scribe, Cipher → send). Not because heartbeat is looping.
We use HEARTBEAT.md only for true alerts: DR failures, watcher down, security incidents. Not for routine follow-ups.
API guard wrappers
Even with good architecture, a misbehaving agent can burn expensive API calls. We added a hard guard wrapper at /home/clawdia/workspace/scripts/xai_guard.sh:
# Rate limits max_requests_per_minute: 2 max_requests_per_day: 20 # Per-task caps finance_briefing: max 2 xAI calls/day (one run + one retry)
The search fallback chain respects these limits:
1. Brave (primary, when not rate-limited)
2. DDG via ddg-search skill
3. Gemini grounded search when quota allows
4. xAI search (only after Gemini fails, still guarded)
If an agent tries to call xAI in a loop, the wrapper kills it after 2 requests. Defense in depth.
Telegram hygiene
We keep Telegram channels clean by deleting raw messages after processing. But we still need the audit trail. The solution: daily executive digests.
A script (telegram_digest_daily.sh) runs at end of day, reads WORKING.md plus the day's memory tail, and writes a 5-10 bullet summary into memory/YYYY-MM-DD.md. No Telegram message sent. Secrets redacted by policy.
session.dmScope="per-channel-peer" to prevent context leaking between sessions.
Memory improves over time without storing raw chat. The channel stays usable. The audit trail stays complete.
The result: technical documentation grounded in evidence
We use this stack to power the technical documentation for KafScale: architecture guides, API references, and implementation deep dives. Because the gathering is script-based and the memory is structured, the output is grounded in actual codebase state and benchmarks, not hallucinations.
The E-E-A-T compliance happens naturally: the system can only cite what it has actually retrieved and verified. When we document a performance claim, there's a traceable path from the benchmark script to the prose.
The lesson? Don't build a chatbot. Build a pipeline.
If you're wrestling with agent architectures for technical workflows, I help teams design systems that don't burn tokens or trust. See how I work with teams.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.