How to Monitor Server Health with an AI Agent

You're running a server. CPU spikes. Disk fills up. A process dies. You find out hours later when a client complains or a monitoring dashboard finally alerts you. Then you're scrambling to figure out what happened.

Dashboards are reactive. An AI agent that monitors your server continuously is different — it's watching in real time, understands context, and can tell you exactly what's wrong and why, not just that something went red.

This is how we run our infrastructure at MasterClaw. An OpenClaw agent monitors eight servers using simple shell commands, aggregates the data, spots patterns, and alerts us before problems become incidents. No email fatigue. No false positives. Just signal.

What an Agent Actually Monitors

You don't need metrics for everything. An AI agent is most useful when it's watching a small set of critical signals and explaining what's happening. Here's what matters:

CPU and memory utilization: Is something chewing through resources?
Disk usage: Are you about to run out of space?
Process health: Is OpenClaw still running? Is the gateway responding?
Network connections: Are there unexpected outbound connections?
System load: Can the server handle the work it's trying to do?
Log anomalies: Are there repeated errors in systemd or application logs?

The key: you're not recording every metric. You're sampling. Every 5-10 minutes, the agent runs a quick health check, compares it to baseline, and reports what's changed. That's it.

Building the Health Check Script

Start with a simple bash script that the agent can call. It doesn't need to be fancy — it just needs to be fast and parseable.

#!/bin/bash
# /opt/health-check.sh - Server health snapshot

echo "=== SYSTEM HEALTH SNAPSHOT ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# CPU & Memory
echo ""
echo "CPU and Memory:"
top -bn1 | grep "Cpu(s)" | awk '{print "  CPU Load: " $2 " user, " $4 " system"}'
free -h | grep Mem | awk '{print "  Memory: " $3 " used / " $2 " total (" int($3/$2*100) "%)"}'

# Disk
echo ""
echo "Disk Usage:"
df -h / | tail -1 | awk '{print "  Root: " $3 " used / " $2 " total (" $5 ")"}'
df -h /var | tail -1 | awk '{print "  /var: " $3 " used / " $2 " total (" $5 ")"}'

# Process health
echo ""
echo "Process Health:"
systemctl is-active openclaw >/dev/null 2>&1 && echo "  ✓ OpenClaw: running" || echo "  ✗ OpenClaw: STOPPED"
systemctl is-active nginx >/dev/null 2>&1 && echo "  ✓ nginx: running" || echo "  ✗ nginx: STOPPED"
systemctl is-active ssh >/dev/null 2>&1 && echo "  ✓ SSH: running" || echo "  ✗ SSH: STOPPED"

# Top processes by CPU
echo ""
echo "Top processes (CPU):"
ps aux --sort=-%cpu | head -4 | tail -3 | awk '{printf "  %s: %.1f%% (%s)\n", $11, $3, $2}'

# System load
echo ""
echo "System Load:"
uptime | awk -F'load average:' '{print "  " $2}'

# Recent errors (last 5 minutes)
echo ""
echo "Recent System Errors:"
journalctl --since "5 min ago" --priority=err --no-pager | tail -3 | sed 's/^/  /' || echo "  None"

This script runs in under a second and gives you the snapshot you need. Make it executable and test it locally:

chmod +x /opt/health-check.sh
/opt/health-check.sh

💡

Use awk for parsing

The script above uses awk to extract and format values. This makes the output consistent and easy for an AI to parse — no weird shell variations.

The Agent That Watches

Now you need an OpenClaw agent that runs this script periodically. This is a cron job + agent combo. The cron fires every 10 minutes, the agent reads the output, stores the baseline, detects anomalies, and reports.

Here's the agent code (living in your OpenClaw workspace):

// health-monitor.js - OpenClaw Health Monitoring Agent
const fs = require('fs');
const { execSync } = require('child_process');

const HISTORY_FILE = '/root/.openclaw/memory/health-baseline.json';
const ALERT_THRESHOLD = {
  cpu: 85,      // Alert if CPU > 85%
  memory: 80,   // Alert if memory > 80%
  disk: 90,     // Alert if disk > 90%
  load: 4       // Alert if load > 4.0
};

async function getHealthSnapshot() {
  const output = execSync('/opt/health-check.sh', { encoding: 'utf8' });
  return parseHealthOutput(output);
}

function parseHealthOutput(output) {
  const lines = output.split('\n');
  const data = {
    timestamp: new Date().toISOString(),
    cpu: null,
    memory: null,
    disk: null,
    load: null,
    processes: {},
    errors: []
  };

  for (const line of lines) {
    if (line.includes('CPU Load:')) {
      const match = line.match(/(\d+\.?\d*)/);
      if (match) data.cpu = parseFloat(match[1]);
    }
    if (line.includes('Memory:')) {
      const match = line.match(/(\d+)%/);
      if (match) data.memory = parseInt(match[1]);
    }
    if (line.includes('Root:')) {
      const match = line.match(/(\d+)%/);
      if (match) data.disk = parseInt(match[1]);
    }
    if (line.includes('load average:')) {
      const match = line.match(/(\d+\.?\d*),/);
      if (match) data.load = parseFloat(match[1]);
    }
    if (line.includes('OpenClaw:')) {
      data.processes.openclaw = line.includes('running');
    }
  }

  return data;
}

async function checkForAnomalies(current) {
  const issues = [];

  if (current.cpu > ALERT_THRESHOLD.cpu) {
    issues.push(`⚠️ High CPU: ${current.cpu}%`);
  }
  if (current.memory > ALERT_THRESHOLD.memory) {
    issues.push(`⚠️ High Memory: ${current.memory}%`);
  }
  if (current.disk > ALERT_THRESHOLD.disk) {
    issues.push(`🔴 Low Disk Space: Only ${100 - current.disk}% free`);
  }
  if (current.load > ALERT_THRESHOLD.load) {
    issues.push(`⚠️ System Load: ${current.load} (high)`);
  }
  if (!current.processes.openclaw) {
    issues.push(`🔴 OpenClaw is not running`);
  }

  return issues;
}

async function run() {
  const snapshot = await getHealthSnapshot();
  const issues = await checkForAnomalies(snapshot);

  if (issues.length > 0) {
    console.log(`[HEALTH ALERT] Server Issues Detected:\n${issues.join('\n')}`);
    // Send to Telegram, Slack, or your notification system
  } else {
    console.log(`[HEALTH OK] All systems normal at ${snapshot.timestamp}`);
  }

  // Store baseline for trend analysis
  fs.writeFileSync(HISTORY_FILE, JSON.stringify(snapshot, null, 2));
}

run().catch(console.error);

Install this as a systemd timer or add it to your OpenClaw cron schedule. The key is that it runs every 10 minutes and stays quiet unless something's wrong.

⚠️

Alert fatigue is real

If your agent alerts on every minor fluctuation, you'll start ignoring it. Set thresholds high enough that alerts mean "something actually needs attention" — not "the server is doing its job."

Connecting Alerts to Action

An alert is useless if nobody sees it. Wire your agent to send notifications through a channel you actually watch. Here's the flow we use:

Agent detects an issue
Agent sends a message to a Telegram group with context
Message includes the problem + a quick fix (if available)
Agent logs the incident to your memory system for later analysis

The Telegram message is simple and actionable — not a wall of metrics:

🔴 Server Alert: kova-1
━━━━━━━━━━━━━━━━━━━━━━
Low Disk Space: 92% full
/var: 450GB used / 500GB total

Quick fix:
$ journalctl --vacuum=2d
(free up old logs)

If that doesn't help:
1. Check what's eating space: `du -sh /*`
2. Remove old backups if safe
3. Scale the disk in Hetzner console

This is why an AI agent beats a dashboard. The agent gives you context and next steps, not just a red light.

After the agent runs a few times, it builds a baseline. This is powerful: it can detect slow growth before it becomes a crisis.

Example: Your disk space goes from 60% to 65% to 70% over three days. A normal alert won't fire at 70%. But an agent that tracks the trend can say: "At this rate, you'll hit 100% in 5 days. Act now."

Store each snapshot in memory. Every 24 hours, analyze the trend:

function analyzeTrend(history) {
  const last24h = history.filter(s => 
    Date.now() - new Date(s.timestamp) < 24 * 60 * 60 * 1000
  );

  if (last24h.length < 2) return null;

  const diskGrowth = last24h[last24h.length - 1].disk - last24h[0].disk;
  const daysToFull = (100 - last24h[last24h.length - 1].disk) / (diskGrowth / 1);

  if (daysToFull < 7) {
    return `Disk will be full in ${Math.round(daysToFull)} days at current growth rate`;
  }

  return null;
}

This is how you turn monitoring into forecasting. You're not just watching what's happening — you're predicting what will happen and giving your future self time to act.

See how we approach this with running multiple agents on one server — the same health-monitoring principles apply whether you're managing one server or eight.

MasterClaw Team

Role: AI Agency Operations

Bio: We build and operate AI agent systems for real clients. Everything published here comes from running these systems in production — not from theory. We've deployed agents to handle scheduling, monitoring, content creation, and agency operations across multiple servers and clients.

How to Monitor Server Health with an AI Agent

What an Agent Actually Monitors

Building the Health Check Script

The Agent That Watches

Connecting Alerts to Action

Baseline and Trending

MasterClaw Team

Related Articles

Running 8 AI Agents on One Server — The Real Architecture

How to Fix OpenClaw Gateway Errors