Eu estava usando o manipulador de eventos do Nagios como uma solução simples.
No servidor NRPE:
command[check_crond]=/usr/lib64/nagios/plugins/check_procs -c 1: -C crond
command[autostart_crond]=sudo /etc/init.d/crond start
command[stop_crond]=sudo /etc/init.d/crond stop
Não se esqueça de adicionar o nagios
usuário ao grupo sudoers:
nagios ALL=(ALL) NOPASSWD:/usr/lib64/nagios/plugins/, /etc/init.d/crond
e desabilite requiretty
:
Defaults:nagios !requiretty
No servidor Nagios:
services.cfg
define service{
use generic-service
host_name cpc_3.145
service_description crond
check_command check_nrpe!check_crond
event_handler autostart_crond!cpc_2.93
process_perf_data 0
contact_groups admin,admin-sms
}
commands.cfg
define command{
command_name autostart_crond
command_line $USER1$/eventhandlers/autostart_crond.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$
}
autostart_crond.sh
#!/bin/bash
case "$1" in
OK)
/usr/local/nagios/libexec/check_nrpe -H $4 -c stop_crond
;;
WARNING)
;;
UNKNOWN)
/usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
;;
CRITICAL)
/usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
;;
esac
exit 0
mas mudei para usar Pacemaker e Corosync, pois é a melhor solução para garantir que o recurso seja executado apenas em um nó por vez.
Aqui estão as etapas que eu fiz:
Verifique se o script init crond é compatível com LSB . No meu CentOS, preciso alterar o status de saída de 1 para 0 (se iniciar uma corrida ou parar uma parada) para atender aos requisitos:
start() {
echo -n $"Starting $prog: "
if [ -e /var/lock/subsys/crond ]; then
if [ -e /var/run/crond.pid ] && [ -e /proc/`cat /var/run/crond.pid` ]; then
echo -n $"cannot start crond: crond is already running.";
failure $"cannot start crond: crond already running.";
echo
#return 1
return 0
fi
fi
stop() {
echo -n $"Stopping $prog: "
if [ ! -e /var/lock/subsys/crond ]; then
echo -n $"cannot stop crond: crond is not running."
failure $"cannot stop crond: crond is not running."
echo
#return 1;
return 0;
fi
então ele pode ser adicionado ao marcapasso usando:
# crm configure primitive Crond lsb:crond \
op monitor interval="60s"
crm configure show
node SVR022-293.localdomain
node SVR233NTC-3145.localdomain
primitive Crond lsb:crond \
op monitor interval="60s"
property $id="cib-bootstrap-options" \
dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
status crm
============
Last updated: Fri Jun 7 13:44:03 2013
Stack: openais
Current DC: SVR233NTC-3145.localdomain - partition with quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ SVR022-293.localdomain SVR233NTC-3145.localdomain ]
Crond (lsb:crond): Started SVR233NTC-3145.localdomain
Testando o failover parando o Pacemaker e o Corosync no 3.145:
[root@3145 corosync]# service pacemaker stop
Signaling Pacemaker Cluster Manager to terminate: [ OK ]
Waiting for cluster services to unload:...... [ OK ]
[root@3145 corosync]# service corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:. [ OK ]
verifique o status do cluster no 2.93:
============
Last updated: Fri Jun 7 13:47:31 2013
Stack: openais
Current DC: SVR022-293.localdomain - partition WITHOUT quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ SVR022-293.localdomain ]
OFFLINE: [ SVR233NTC-3145.localdomain ]
Crond (lsb:crond): Started SVR022-293.localdomain