Skip to main content

How I built a secure, high-performance AI agent squad with OpenClaw

The short version: We run PaxMachina like an Airflow-style DAG, separating heavy lifting from reasoning to save tokens. We replaced generic vector stores with a specialized Query-Memory-Document (QMD) backend for high-velocity state. We treat Telegram channels as immutable event logs, not watercoolers. And we added a task ledger protocol that prevents the runaway loops plaguing other agent frameworks.

AI agents are like Airflow for intelligence

I used to think the bottleneck in agent systems was model intelligence. I was wrong. The bottleneck is context hygiene. If you treat an agent like a chatty intern, you burn tokens on coordination and lose state in the noise.

The shift that made our system (PaxMachina) work was treating it like an ops pipeline. Specifically, like Airflow DAGs. We separated the "muscle" (gathering data) from the "brain" (reasoning), and we locked down how they talk to each other.

If you've followed the recent OpenClaw saga (500 runaway iMessages, $20 burned overnight on a heartbeat cron, 341 malicious skills on ClawHub, three CVEs in three days), you've seen what happens when agent systems lack this discipline. PaxMachina started as our answer to the same problems before OpenClaw made them front-page news.

Here are the four architectural decisions that turned our prototype into a production loop.

PaxMachina: Agents as Ops Pipeline

📜
SCRIPTS
Deterministic gathering
No LLM tokens
🧠
QMD BACKEND
Atomic state
No drift
🎖️
CHIEF OF STAFF
LLM reasoning
Only when needed
📋
LEDGER
Telegram + TASKS.md
Immutable audit
Muscle (no tokens) Brain (tokens only here)
novatechflow.com

1. Script-based gathering (saving tokens by not asking nicely)

Early on, we had agents "browsing" the web. One agent would ask another to "go check the pricing page." The second agent would spin up a browser, scroll, read, and summarize. It was painfully slow and expensive.

We replaced that with script-based gathering. Fetching a URL, stripping the DOM, and extracting diffs is a deterministic task. We don't need an LLM to decide how to use curl or a headless browser.

Now, the "Research" node in our DAG is mostly Python scripts that run rigid, efficient collection jobs. They hand off clean, token-efficient summaries to the LLM only when reasoning is actually required. This dropped our token costs by ~40% and made the inputs for the Chief of Staff agent consistent every time.

The OpenClaw lesson: One user's "heartbeat" cron sent 120,000 tokens of context to Claude Opus just to check the time every 30 minutes. When your agent can invoke arbitrary tools, you're one bad loop away from a $20 overnight bill. Or worse. Deterministic scripts don't hallucinate token-expensive approaches.

2. The QMD memory backend (handling state at speed)

Managing agent memory is usually an afterthought. Then you have six agents trying to read and write context simultaneously. We initially used simple JSON dumps, then generic vector stores, but both got messy. JSON files had race conditions; vector stores were too slow for rapid state updates.

We moved to a custom QMD (Query-Memory-Document) backend: a shared memory buffer optimized for retrieval speed with distinct "memory banks" (short-term working memory vs. long-term archival). It handles embedding management and context windowing so the agents don't have to.

The design goal was similar to Apache Wayang: a memory layer decoupled from any specific model provider or storage engine. Wayang abstracts across execution engines; QMD abstracts across memory access patterns. It just holds the state, fast.

The OpenClaw lesson: Their troubleshooting docs warn about "state drift" when multiple workspace directories cause auth and context conflicts. We've seen the same failure mode. QMD gives us a single source of truth with atomic updates. No split-brain scenarios.

3. Telegram as an immutable ledger

This is the most controversial part of our stack, but it works. We use Telegram not just for notifications, but as the immutable ledger of the system.

In many agent frameworks, the "logs" are buried in a database. In PaxMachina, the Chief of Staff broadcasts state changes and decisions to a dedicated, read-only Telegram channel. Because the channel is linear and timestamped, it acts as a perfect audit trail.

Why not Kafka or a proper event store? We considered it. But Telegram gave us human-readable audit trails with zero infrastructure overhead. We can scroll back and see exactly what the system decided, when, and why. No need to spin up a consumer or query a topic. For our scale (internal ops, not enterprise SaaS), the tradeoff is worth it.

The Hygiene Rule: Workers report to the Chief of Staff, and the Chief of Staff writes to the ledger. If it's not in the channel, it didn't happen. This forces us to be disciplined about what constitutes a "decision" versus just internal noise.

4. Task ledger protocol (preventing runaway loops)

The 500-message OpenClaw incident wasn't a bug. It was a missing feedback loop. The agent had no explicit record of what it had already done, so it kept doing it.

Agent Loops: Uncontrolled vs Controlled

⚠️ UNCONTROLLED (OpenClaw-style)
Agent receives task
no checkpoint
Agent calls tool
no checkpoint
Tool returns → Agent reasons
no checkpoint
Agent calls tool again
no record of prior calls
∞ RUNAWAY LOOP
Result: 500 messages, $20 overnight, state drift
CONTROLLED (PaxMachina-style)
Task logged in TASKS.md
checkpoint ✓
Script gathers (no LLM)
checkpoint ✓
Chief of Staff reasons
checkpoint ✓
Decision → Telegram ledger
TASKS.md marked DONE
■ COMPLETE
Result: Traceable, auditable, token-efficient
The difference isn't intelligence. It's context hygiene.

We added a Task Ledger Protocol directly into PaxMachina's SOUL file:

  • On every new request: open/update memory/TASKS.md
  • When delegating: log assignedAt, agent, expected output location
  • When results arrive: mark DONE + link to commit/file
  • If no result by SLA (15–30 min): mark NUDGED + resend

The SQUAD file enforces the other half: "Workers must report back to PaxMachina" and "PaxMachina tracks work in memory/TASKS.md."

This creates a closed loop. Every task has a paper trail. If something runs away, we can see exactly where the chain broke. No more "the agent just kept going." The agent can't proceed without updating its ledger.

Security principles (keeping the DAG safe)

Running automated ops pipelines requires paranoia. CrowdStrike's analysis of OpenClaw describes how misconfigured agents become "AI backdoor agents capable of taking orders from adversaries." Prompt injection isn't just a data leak, it's a foothold for automated lateral movement. We learned some of these lessons the hard way after an early prototype started modifying its own config files.

  • Least privilege: The script that fetches URLs cannot write to the database. The agent that writes code cannot deploy it.
  • Operator != Admin: The human overseeing the loop (me) has to sudo up for destructive actions; I don't run as admin by default.
  • Allowlists, not capabilities: We don't let agents execute arbitrary shell commands. They pick from a library of approved tasks. This is the opposite of OpenClaw's skill model, where 26% of analyzed skills contained vulnerabilities.
  • No external skill repositories: ClawHub has 341+ confirmed malicious skills. We don't pull code from registries the agent can browse autonomously.
  • DR Backups: We assume the memory state will get corrupted. We snapshot the QMD and the ledger so we can replay the week if needed.

Skill auditing (because ClawHub is a minefield)

341 malicious skills on ClawHub means you can't trust the registry. We run a layered audit before enabling any skill.

1. Inventory installed vs allowed

Installed skills live under ~/workspace/skills/ and npm global skills. Allowed skills are configured via skills.allowBundled and skills.entries.<skill>.enabled in openclaw.json. Run openclaw doctor to see the current state. Ours shows: Eligible: 15, Blocked by allowlist: 39. First step: diff installed vs allowed and remove anything not explicitly needed.

2. Static scan

For every installed skill folder, grep for obvious exfil and persistence patterns:

# Exfil patterns
curl | wget | nc | ssh | scp | rsync

# Encoding (review context)
base64 | openssl enc | tar

# Secret access
process.env | ~/.config/openclaw/env | .ssh | .git-credentials

# Persistence
~/.bashrc | cron | systemd units

Also check package.json scripts. The postinstall and preinstall hooks are where supply chain attacks hide.

3. Provenance check

For ClawHub-installed skills, inspect .clawhub/origin.json and pin versions. Prefer skills with known publishers, minimal dependencies, and small code surface. If a skill pulls in 47 transitive dependencies to format a date, reject it.

4. Sandbox test

Run skills with security=allowlist, no network (or monitored outbound DNS/HTTP), and minimal env (no tokens). See if they still try to touch secrets. If they do, they're malicious or badly written. Either way, reject.

5. Hardening defaults

Even if a skill passes audit, assume it could be compromised later:

Keep exec.security on deny/allowlist (never "full")
Keep secrets only in env/service, never in repo files
Run scheduled jobs via your own scripts, not arbitrary skill code

What we run

We automated this as /home/clawdia/workspace/scripts/skills_audit.sh. It inventories installed skills, checks for install hooks, greps for high-risk patterns, and outputs a report to memory/security/skills_audit_YYYY-MM-DD.md.

The report becomes a control, not just documentation. We set an explicit allowlist in openclaw.json with only what we actually use: gog, github, ddg-search, xai, bird, summarize, gsc. Everything else becomes inert even if installed.

Production configs that actually matter

OpenClaw's defaults assume you want maximum recall and frequent syncing. In practice, this burns tokens and causes the context bloat that leads to runaway loops. Here's what we tuned.

Memory tuning

The default workspace indexing (workspace/**/*.md) pulled in everything. We stripped it back to only what agents actually need:

includeDefaultMemory: true          # MEMORY.md + memory/**/*.md only
memory.qmd.update.interval: 10m     # was 5m
debounceMs: 30000                   # was 15000
onBoot: true

# Recall caps
maxResults: 6
maxSnippetChars: 700
maxInjectedChars: 4000
timeoutMs: 6000

The maxInjectedChars: 4000 cap is the important one. Without it, a single recall can dump 20k tokens of "helpful context" into every request.

Event-driven, not heartbeat

The $20 overnight burn happened because a heartbeat cron kept firing. We don't use heartbeat for routine work. Instead:

Instant triggers: Gmail webhooks, Telegram messages, and (optionally) GitHub webhooks fire agents immediately when something happens.

Cron sweeps: Periodic scripts check outboxes and queues (memory/rss-outbox.md, content-pipeline/inbox/), run the relevant agent, and write outputs. Follow-ups happen because pipeline scripts call the next agent (Oracle → Scribe, Cipher → send). Not because heartbeat is looping.

We use HEARTBEAT.md only for true alerts: DR failures, watcher down, security incidents. Not for routine follow-ups.

API guard wrappers

Even with good architecture, a misbehaving agent can burn expensive API calls. We added a hard guard wrapper at /home/clawdia/workspace/scripts/xai_guard.sh:

# Rate limits
max_requests_per_minute: 2
max_requests_per_day: 20

# Per-task caps
finance_briefing: max 2 xAI calls/day (one run + one retry)

The search fallback chain respects these limits:

1. Brave (primary, when not rate-limited)
2. DDG via ddg-search skill
3. Gemini grounded search when quota allows
4. xAI search (only after Gemini fails, still guarded)

If an agent tries to call xAI in a loop, the wrapper kills it after 2 requests. Defense in depth.

Telegram hygiene

We keep Telegram channels clean by deleting raw messages after processing. But we still need the audit trail. The solution: daily executive digests.

A script (telegram_digest_daily.sh) runs at end of day, reads WORKING.md plus the day's memory tail, and writes a 5-10 bullet summary into memory/YYYY-MM-DD.md. No Telegram message sent. Secrets redacted by policy.

OpenClaw supports multi-tenant deployments out of the box. PaxMachina serves multiple users, so we set session.dmScope="per-channel-peer" to prevent context leaking between sessions.

Memory improves over time without storing raw chat. The channel stays usable. The audit trail stays complete.

The result: technical documentation grounded in evidence

We use this stack to power the technical documentation for KafScale: architecture guides, API references, and implementation deep dives. Because the gathering is script-based and the memory is structured, the output is grounded in actual codebase state and benchmarks, not hallucinations.

The E-E-A-T compliance happens naturally: the system can only cite what it has actually retrieved and verified. When we document a performance claim, there's a traceable path from the benchmark script to the prose.

The lesson? Don't build a chatbot. Build a pipeline.

If you're wrestling with agent architectures for technical workflows, I help teams design systems that don't burn tokens or trust. See how I work with teams.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...