2012-06-03 | pyinx | linux

利用watchdog 实现智能监控
by purple_grape
last update 2012-06-03

watchdog 通常是一段物理电路，但这里讲的是linux 内核的实现，也就是softdog ，其原理详见[IBM developerworks](http://www.ibm.com/developerworks/cn/linux/l-cn-watchdog/index.html)

watchdog非常强健，是内核的一个模块，能够在系统资源耗尽之际实现系统软重启，防止硬重启带来的巨大损失。

Watchdog 默认的动作是每隔1分钟往/dev/watchdog设备执行写操作，以证明系统运行正常，发现异常就主动发出重启信号。这种功能在高负荷的系统中很有用。虽然重启系统常常是包治百病的“明智之举“，但总让管理员们无法接受。除了重启系统，watchdog真的就没有用武之地了吗？

watchdog 强大的地方在于提供了两个接口test和repair 。

test选项提供了测试机制，一旦测试失败，watchdog就会采取行动挽救，也就是repair 。

repair 选项提供了挽救系统的办法，默认是重启系统，但可以定制，通常是一个shell 脚本。这就是本文的要点。

Watchdog配置文件示例

#ping = 172.31.14.1

#ping = 172.26.1.255

#interface = eth0

#file = /var/log/messages

#change = 1407

Uncomment to enable test. Setting one of these values to ‘0’ disables it.

These values will hopefully never reboot your machine during normal use

(if your machine is really hung, the loadavg will go much higher than 25)

#max-load-1 = 24

#max-load-5 = 18

#max-load-15 = 12

Note that this is the number of pages!

To get the real size, check how large the pagesize is on your machine.

#min-memory = 1

#repair-binary = /usr/sbin/repair

#test-binary =

#test-timeout =

#watchdog-device = /dev/watchdog

Defaults compiled into the binary

#temperature-device =

#max-temperature = 120

Defaults compiled into the binary

#admin = root

#interval = 10

#logtick = 1

This greatly decreases the chance that watchdog won’t be scheduled before

your machine is really loaded

realtime = yes
priority = 1

Check if syslogd is still running by enabling the following line

#pidfile = /var/run/syslogd.pid

如果你安装并运行了watchdog，配置文件中的几个基本选项是要打开的，有益无害。

Min-memory = 1
watchdog-device = /dev/watchdog

从上面可以得到，watchdog的配置文件里可以监控网络、负载、内存、进程的pid号等，这不就是管理员们日常监控的内容吗？

现在我们进入智能监控主题。所谓智能监控，就是无人值守的监控行为，发现问题，自动修复，如果不能自动修复，还可以通知管理员手动修复。

以监控网络接口eth0 为例，我们开启选项 interface = eth0，接下来开启选项repair-binary= /etc/watchdog.d/repair.sh ,这里我把修复行为指向了一个脚本，repair.sh

,内容如下

#!/usr/bin/env bash

ifconfig eth0 >dev/null

if [$? != “0”];then

ifup eth0

模拟故障，手动停止eth0，过一会儿，就会看到奇迹发生了，o(∩∩)o…哈哈！！

由于把修复动作交给了脚本，那么一切变得皆有可能！只要其他条件允许，发邮件，发短信、放音乐、看CCAV，总之，完全由你自己决定。

同理，配置文件中的其他监控选项可以举一反三。

假如我有一些定制的服务，平时都是用脚本循环进行监控，用watchdog怎么监控呢？这就要靠test 选项了，watchdog默认的检测行为是每隔一分钟往/dev/watchdog 设备里执行一次写操作，这是个闲活，test选项可以让看门狗执行其他一些检测，以发现异常。

开启选项test = /etc/watchdog.d/test.sh，以监控进程abc为例

#!/usr/bin/env bash

process_number=pgrep abc |wc -l

if [ “$process_number” == “0” ];then
echo “process abc is not running”
exit 1
fi

按照脚本的意思，如果abc 进程数为0，那么就返回值1，检测到了异常！
相应的，watchdog会尝试修复异常。至于repari 脚本怎么写，你懂的！

你还在用while 循环吗？ out 了！

附:
最简单的安装教程(CentOS)
yum install watchdog -y
modprobe softdog
chkconfig watchdog on
/etc/init.d/watchdog start

转自：http://bbs.linuxtone.org/thread-19567-1-1.html