Heartbeat

From UGCS
(Difference between revisions)
Jump to: navigation, search
(Moving back)
 
(8 intermediate revisions by one user not shown)
Line 1: Line 1:
 +
NOTE:  We no longer run heartbeat, as it was causing more problems than it was solving.  The information on this page will be helpful if you try setting up heartbeat again, although some of the software has changed a lot over the past few years.
 +
 
Heartbeat is a daemon that handles all the failover in the cluster.  For starters, see http://linux-ha.org/ .  We are running Heartbeat V2.
 
Heartbeat is a daemon that handles all the failover in the cluster.  For starters, see http://linux-ha.org/ .  We are running Heartbeat V2.
  
Line 6: Line 8:
 
There are also rules that help specify where services can be run.  The most common is a "location" rule.
 
There are also rules that help specify where services can be run.  The most common is a "location" rule.
  
==DRBD==
+
==Quick reference to resource types==
DRBD handles the disk stuff.  It's kinda like raid1 but over a network.
+
===drbddisk===
 +
Parameter: 1, value: <name of DRBD resource>
 +
===Filesystem===
 +
* Parameter: device, value /dev/whatever
 +
* Parameter: directory, value: /mountpoint
 +
* Parameter: fstype, value xfs|reiserfs|ext3|etc
 +
===IPaddr===
 +
Parameter: 1, value: <ip address>
  
Sometimes the nodes may refuse to connect.  If this happens, check 'dmesg' for the message
+
==Commands==
drbd1: Split-Brain detected, dropping connection!
+
===crm_resource===
This means that at some point, both of them thought they were primary.  This can cause possible FS corruption, so drbd says that a human has to do something about it.  The best thing to do is to run
+
drbdadm invalidate <resource>
+
on the host that doesn't have the data you want (usually the backup one).  You will then be able to re-connect the nodes, and they will resync.
+
 
+
If the drbd connection is on the same link that the heartbeat is, you will always have a split-brain when the network cable is pulled.  This is why we have automatic split-brain recovery enabled- the one that was most recently primary is the one that is considered authoritative. You can specify this with
+
net {
+
    after-sb-0pri discard-older-primary;
+
    after-sb-1pri consensus;
+
}
+
 
+
in drbd.conf
+
 
+
After you change a config file, you can update the node with
+
drbdadm adjust <resource>
+
You can then check to see if it worked (or to see other options if you're curious) with
+
drbdsetup /dev/drbd# show
+
 
+
Here are my preferred network options:
+
net {
+
    after-sb-0pri discard-older-primary;
+
    after-sb-1pri consensus;
+
    always-asbp;
+
    timeout 30;
+
    connect-int 5;
+
    ping-int 5;
+
}
+
[[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 22:08, 7 June 2008 (PDT)
+
 
+
==crm_resource==
+
 
crm_resource is a command that lets you manage resources in the cluster.  To use it, you must be a member of the haclient group.
 
crm_resource is a command that lets you manage resources in the cluster.  To use it, you must be a member of the haclient group.
  
Line 50: Line 30:
 
to remove this rule.
 
to remove this rule.
  
==hb_gui==
+
crm_resource -r <resource> -H <host> -C
 +
"Cleans up" a resource.  You must use a real resource name, not a resource group.
 +
 
 +
crm_resource -r <resource> -p target_role -v (started|stopped)
 +
Sets a resource's target role to either started or stopped.
 +
 
 +
===hb_gui===
 
hb_gui is a graphical interface to the heartbeat cluster.  It's quite nice, and is also very useful for configuring services.
 
hb_gui is a graphical interface to the heartbeat cluster.  It's quite nice, and is also very useful for configuring services.
  
==crm_mon==
+
===crm_mon===
 
crm_mon is a command-line program that pretty-prints the current cluster status.  You may also want to try crm_mon -n to show resources by nodes, or crm_mon -1 to just give one-shot info (not try to update it every 15sec or so)
 
crm_mon is a command-line program that pretty-prints the current cluster status.  You may also want to try crm_mon -n to show resources by nodes, or crm_mon -1 to just give one-shot info (not try to update it every 15sec or so)
 +
 +
===crm_standby===
 +
crm_standyb allows you to set/clear the standby status of a machine.  To put a machine into standby,
 +
crm_standby -U <host> -v on
 +
To take it out of standby, use either of the following commands
 +
crm_standby -U <host> -D
 +
crm_standby -U <host> -v off
  
 
==Notes==
 
==Notes==
Line 61: Line 54:
  
 
=AFS=
 
=AFS=
Doing stuff with AFS on heartbeat is kinda tricky.  The problem is that the VLDB expects the volumes to be on the first server (server1, the primary)If you suddenly move them to the backup (server2), the VLDB doesn't know about it, and clients will try to talk to server1.  The simple way to solve this is to run "vol syncvldb" on both, and then run "vos syncserv" on both.  The problem is that you can't do this if one of the servers is down, and just running it on one server doesn't fix it.
+
Doing stuff with AFS on heartbeat is kinda tricky.  You need a shared IP, and the VLDB must reference the shared IP in the VLDBSee the ha-openafs scripts to see what's going on.
==Moving AFS server1-> server2==
+
This is the easy case.  server1 has failed and cannot be contacted (bad hardware failure, network cable got unplugged, etc).  heartbeat sees this, and makes server2 primary drbd, mounts the filesystem, salvages, and restarts the fs.  You can then a vos syncvldb/syncserv and everything will be ok.
+
  
==Moving back==
+
=Postgres=
Just do the opposite.
+
You need to patch the init scripts so that status returns 3 instead of 4 if no clusters are defined (/usr/share/postgresql-common/init.d-functions, in status(), it should exit 3 instead of exit 4 if no clusters are defined)
  
 
[[Category:Sysadmin_Documentation]]
 
[[Category:Sysadmin_Documentation]]

Latest revision as of 06:50, 12 September 2011

NOTE: We no longer run heartbeat, as it was causing more problems than it was solving. The information on this page will be helpful if you try setting up heartbeat again, although some of the software has changed a lot over the past few years.

Heartbeat is a daemon that handles all the failover in the cluster. For starters, see http://linux-ha.org/ . We are running Heartbeat V2.

Contents

Basics

Heartbeat works by managing resources on nodes. A node is a computer that runs stuff. A resource is any type of service that gets moved around. Examples of services include failover IP's, drbd disks (who is primary/secondary), filesystems (you use these to mount drbd stuff), and services. Resources are usually put into "Resource Groups". All the services in a resource group will be run on the same host, and they will be started/stopped sequentially.

There are also rules that help specify where services can be run. The most common is a "location" rule.

Quick reference to resource types

drbddisk

Parameter: 1, value: <name of DRBD resource>

Filesystem

  • Parameter: device, value /dev/whatever
  • Parameter: directory, value: /mountpoint
  • Parameter: fstype, value xfs|reiserfs|ext3|etc

IPaddr

Parameter: 1, value: <ip address>

Commands

crm_resource

crm_resource is a command that lets you manage resources in the cluster. To use it, you must be a member of the haclient group.

crm_resource -W -r <resource>

Tells you where the specified resource is running

crm_resource -M -r <resource> [-h host]

Migrates the specified resource off of its current host. If -h is specified, it moves it to that host. This adds a location constraint with a score of -INFINITY for the resource and its current host (translation: the resource will never be run on its current host again), so you probably want to run

crm_resource -U -r <resource>

to remove this rule.

crm_resource -r <resource> -H <host> -C 

"Cleans up" a resource. You must use a real resource name, not a resource group.

crm_resource -r <resource> -p target_role -v (started|stopped)

Sets a resource's target role to either started or stopped.

hb_gui

hb_gui is a graphical interface to the heartbeat cluster. It's quite nice, and is also very useful for configuring services.

crm_mon

crm_mon is a command-line program that pretty-prints the current cluster status. You may also want to try crm_mon -n to show resources by nodes, or crm_mon -1 to just give one-shot info (not try to update it every 15sec or so)

crm_standby

crm_standyb allows you to set/clear the standby status of a machine. To put a machine into standby,

crm_standby -U <host> -v on

To take it out of standby, use either of the following commands

crm_standby -U <host> -D
crm_standby -U <host> -v off

Notes

  • The raw configuration file is in /var/lib/heartbeat/crm/cib.xml . Never edit this file by hand- use cibadmin to add stuff to it. Better yet, use the gui.
  • When configuring drbd stuff, use "drbddisk" instead of "drbd". The "drbd" resource is a V2 one that uses some complex master-slave stuff. Supposedly it can do cool stuff if it's set up correctly, but otherwise it's just confusing. "drbddisk" is much simpler and Just Works.

AFS

Doing stuff with AFS on heartbeat is kinda tricky. You need a shared IP, and the VLDB must reference the shared IP in the VLDB. See the ha-openafs scripts to see what's going on.

Postgres

You need to patch the init scripts so that status returns 3 instead of 4 if no clusters are defined (/usr/share/postgresql-common/init.d-functions, in status(), it should exit 3 instead of exit 4 if no clusters are defined)

Personal tools