You're running a server. CPU spikes. Disk fills up. A process dies. You find out hours later when a client complains or a monitoring dashboard finally alerts you. Then you're scrambling to figure out what happened.
Dashboards are reactive. An AI agent that monitors your server continuously is different — it's watching in real time, understands context, and can tell you exactly what's wrong and why, not just that something went red.
This is how we run our infrastructure at MasterClaw. An OpenClaw agent monitors eight servers using simple shell commands, aggregates the data, spots patterns, and alerts us before problems become incidents. No email fatigue. No false positives. Just signal.
What an Agent Actually Monitors
You don't need metrics for everything. An AI agent is most useful when it's watching a small set of critical signals and explaining what's happening. Here's what matters:
- CPU and memory utilization: Is something chewing through resources?
- Disk usage: Are you about to run out of space?
- Process health: Is OpenClaw still running? Is the gateway responding?
- Network connections: Are there unexpected outbound connections?
- System load: Can the server handle the work it's trying to do?
- Log anomalies: Are there repeated errors in systemd or application logs?
The key: you're not recording every metric. You're sampling. Every 5-10 minutes, the agent runs a quick health check, compares it to baseline, and reports what's changed. That's it.
Building the Health Check Script
Start with a simple bash script that the agent can call. It doesn't need to be fancy — it just needs to be fast and parseable.
#!/bin/bash
# /opt/health-check.sh - Server health snapshot
echo "=== SYSTEM HEALTH SNAPSHOT ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# CPU & Memory
echo ""
echo "CPU and Memory:"
top -bn1 | grep "Cpu(s)" | awk '{print " CPU Load: " $2 " user, " $4 " system"}'
free -h | grep Mem | awk '{print " Memory: " $3 " used / " $2 " total (" int($3/$2*100) "%)"}'
# Disk
echo ""
echo "Disk Usage:"
df -h / | tail -1 | awk '{print " Root: " $3 " used / " $2 " total (" $5 ")"}'
df -h /var | tail -1 | awk '{print " /var: " $3 " used / " $2 " total (" $5 ")"}'
# Process health
echo ""
echo "Process Health:"
systemctl is-active openclaw >/dev/null 2>&1 && echo " ✓ OpenClaw: running" || echo " ✗ OpenClaw: STOPPED"
systemctl is-active nginx >/dev/null 2>&1 && echo " ✓ nginx: running" || echo " ✗ nginx: STOPPED"
systemctl is-active ssh >/dev/null 2>&1 && echo " ✓ SSH: running" || echo " ✗ SSH: STOPPED"
# Top processes by CPU
echo ""
echo "Top processes (CPU):"
ps aux --sort=-%cpu | head -4 | tail -3 | awk '{printf " %s: %.1f%% (%s)\n", $11, $3, $2}'
# System load
echo ""
echo "System Load:"
uptime | awk -F'load average:' '{print " " $2}'
# Recent errors (last 5 minutes)
echo ""
echo "Recent System Errors:"
journalctl --since "5 min ago" --priority=err --no-pager | tail -3 | sed 's/^/ /' || echo " None"
This script runs in under a second and gives you the snapshot you need. Make it executable and test it locally:
chmod +x /opt/health-check.sh
/opt/health-check.sh
The script above uses awk to extract and format values. This makes the output consistent and easy for an AI to parse — no weird shell variations.
The Agent That Watches
Now you need an OpenClaw agent that runs this script periodically. This is a cron job + agent combo. The cron fires every 10 minutes, the agent reads the output, stores the baseline, detects anomalies, and reports.
Here's the agent code (living in your OpenClaw workspace):
// health-monitor.js - OpenClaw Health Monitoring Agent
const fs = require('fs');
const { execSync } = require('child_process');
const HISTORY_FILE = '/root/.openclaw/memory/health-baseline.json';
const ALERT_THRESHOLD = {
cpu: 85, // Alert if CPU > 85%
memory: 80, // Alert if memory > 80%
disk: 90, // Alert if disk > 90%
load: 4 // Alert if load > 4.0
};
async function getHealthSnapshot() {
const output = execSync('/opt/health-check.sh', { encoding: 'utf8' });
return parseHealthOutput(output);
}
function parseHealthOutput(output) {
const lines = output.split('\n');
const data = {
timestamp: new Date().toISOString(),
cpu: null,
memory: null,
disk: null,
load: null,
processes: {},
errors: []
};
for (const line of lines) {
if (line.includes('CPU Load:')) {
const match = line.match(/(\d+\.?\d*)/);
if (match) data.cpu = parseFloat(match[1]);
}
if (line.includes('Memory:')) {
const match = line.match(/(\d+)%/);
if (match) data.memory = parseInt(match[1]);
}
if (line.includes('Root:')) {
const match = line.match(/(\d+)%/);
if (match) data.disk = parseInt(match[1]);
}
if (line.includes('load average:')) {
const match = line.match(/(\d+\.?\d*),/);
if (match) data.load = parseFloat(match[1]);
}
if (line.includes('OpenClaw:')) {
data.processes.openclaw = line.includes('running');
}
}
return data;
}
async function checkForAnomalies(current) {
const issues = [];
if (current.cpu > ALERT_THRESHOLD.cpu) {
issues.push(`⚠️ High CPU: ${current.cpu}%`);
}
if (current.memory > ALERT_THRESHOLD.memory) {
issues.push(`⚠️ High Memory: ${current.memory}%`);
}
if (current.disk > ALERT_THRESHOLD.disk) {
issues.push(`🔴 Low Disk Space: Only ${100 - current.disk}% free`);
}
if (current.load > ALERT_THRESHOLD.load) {
issues.push(`⚠️ System Load: ${current.load} (high)`);
}
if (!current.processes.openclaw) {
issues.push(`🔴 OpenClaw is not running`);
}
return issues;
}
async function run() {
const snapshot = await getHealthSnapshot();
const issues = await checkForAnomalies(snapshot);
if (issues.length > 0) {
console.log(`[HEALTH ALERT] Server Issues Detected:\n${issues.join('\n')}`);
// Send to Telegram, Slack, or your notification system
} else {
console.log(`[HEALTH OK] All systems normal at ${snapshot.timestamp}`);
}
// Store baseline for trend analysis
fs.writeFileSync(HISTORY_FILE, JSON.stringify(snapshot, null, 2));
}
run().catch(console.error);
Install this as a systemd timer or add it to your OpenClaw cron schedule. The key is that it runs every 10 minutes and stays quiet unless something's wrong.
If your agent alerts on every minor fluctuation, you'll start ignoring it. Set thresholds high enough that alerts mean "something actually needs attention" — not "the server is doing its job."
Connecting Alerts to Action
An alert is useless if nobody sees it. Wire your agent to send notifications through a channel you actually watch. Here's the flow we use:
- Agent detects an issue
- Agent sends a message to a Telegram group with context
- Message includes the problem + a quick fix (if available)
- Agent logs the incident to your memory system for later analysis
The Telegram message is simple and actionable — not a wall of metrics:
🔴 Server Alert: kova-1
━━━━━━━━━━━━━━━━━━━━━━
Low Disk Space: 92% full
/var: 450GB used / 500GB total
Quick fix:
$ journalctl --vacuum=2d
(free up old logs)
If that doesn't help:
1. Check what's eating space: `du -sh /*`
2. Remove old backups if safe
3. Scale the disk in Hetzner console
This is why an AI agent beats a dashboard. The agent gives you context and next steps, not just a red light.
Baseline and Trending
After the agent runs a few times, it builds a baseline. This is powerful: it can detect slow growth before it becomes a crisis.
Example: Your disk space goes from 60% to 65% to 70% over three days. A normal alert won't fire at 70%. But an agent that tracks the trend can say: "At this rate, you'll hit 100% in 5 days. Act now."
Store each snapshot in memory. Every 24 hours, analyze the trend:
function analyzeTrend(history) {
const last24h = history.filter(s =>
Date.now() - new Date(s.timestamp) < 24 * 60 * 60 * 1000
);
if (last24h.length < 2) return null;
const diskGrowth = last24h[last24h.length - 1].disk - last24h[0].disk;
const daysToFull = (100 - last24h[last24h.length - 1].disk) / (diskGrowth / 1);
if (daysToFull < 7) {
return `Disk will be full in ${Math.round(daysToFull)} days at current growth rate`;
}
return null;
}
This is how you turn monitoring into forecasting. You're not just watching what's happening — you're predicting what will happen and giving your future self time to act.
See how we approach this with running multiple agents on one server — the same health-monitoring principles apply whether you're managing one server or eight.