Install guide

Index

Background

VMware has released their vSphere Storage Appliance which seems like a nice alternative for small and medium sized setups to get shared storage without buying a SAN og NAS. It does however have a few drawbacks.

  1. It only runs in VMware.
  2. It is kind of expensive (you could probably buy a nice NAS for the same money).

The idea

The release of vSphere Storage Appliance did, however, give me the idea of creating a storage Appliance from open source software. The idea for pmsApp was born.

The idea is that you have two (or more) virtualization hosts you want to set up in a cluster. This could be VMware ESXiProxmox PVEVirtualBox or any other virtualization platform that is capable of clustering. In order to get a working cluster, you need shared storage, but you only have the local storage in each virtualization node.
pmsApp will take local storage on a virtualization host and share it with pmsApp’s on the other virtualization hosts to create shared storage that can be given back to the virtualization hosts through iSCSI or NFS.

A bit more detail… Each pmsApp will join a HA cluster and export local storage as iSCSI targets. The main pmsApp in the HA cluster will join the exported storage in a sotware RAID and export that as NFS share or iSCSI target that can be used by the virtualization hosts as shared storage.

Decisions

I decided to use CentOS for several reasons.

  1. It is based on RedHat Enterprise Linux and should be fairly stable.
  2. I am used to working with RedHat based linux distributions.
  3. It includes iSCSI target and initiator utilities
  4. It includes HA cluster capabilities

Any distribution capable of doing iSCSI target/client, Cluster HA and NFS hosting should do, but for now I will focus on CentOS 6.2.

Installation

Create a virtual machine on each virtualization host.
I have used the following settings:

  • 1 CPU
  • 1 GB of RAM
  • 3 NICs
  • 1 harddrive of 8 GB

The reason for 3 NICs is that we want to ensure that there is enough bandwidth for NFS, iSCSI and Heartbeat traffic.

  • 1st NIC, eth0 used for NFS and management traffic
  • 2nd NIC, eth1 used for iSCSI traffic
  • 3rd NIC, eth2 used for heartbeat and other cluster traffic

After initial installation we will add a second harddrive to the virtual machine. This is the storage that will be shared.
Install a minimal CentOS 6.2, se the video or screenshots.

If you skipped seeing how I installed a minimal CentOS 6.2 you should know that in the installation, I also disable selinux and the firewall.

# chkconfig iptables off
# vi /etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

Add a second harddrive to the VMs, make sure it is the same size for all the VMs.

Initial configuration

If you have followed the installations like I did, all your VMs will be configured with DHCP and localhost as their hostname.
We need the hosts to have static IP and it would probably be nice with a real hostname. Below you will find the configurations I have done on my first node. The other nodes should have similar configurations.
Note: If you do not like to edit files with vi, you can install nano

# yum -y install nano

/etc/resolv.conf

domain pmsapp.org
search pmsapp.org
nameserver 192.168.0.11
nameserver 208.67.222.222
nameserver 208.67.220.220

/etc/sysconfig/network

NETWORKING=yes
HOSTNAME=pmsapp1.pmsapp.org

/etc/sysconfig/network-scripts/ifcfg-eth0

# Fist NIC, used for NFS and management traffic
# should be reachable by the virtualization hosts
DEVICE=eth0
BOOTPROTO=static
ONBOOT=yes
TYPE=Ethernet
IPADDR=192.168.0.21
NETMASK=255.255.255.0
BROADCAST=192.168.0.255
GATEWAY=192.168.0.1
DNS1=192.168.0.11
DNS2=208.67.222.222
DNS3=208.67.220.220
IPV6INIT=no
USERCTL=no

/etc/sysconfig/network-scripts/ifcfg-eth1

# 2nd NIC, used for iSCSI traffic.
# No DNS or routes are necessary, but all nodes
# should be able to communicate on the subnet
DEVICE=eth1
BOOTPROTO=static
ONBOOT=yes
TYPE=Ethernet
IPADDR=172.16.0.21
NETMASK=255.255.255.0
BROADCAST=172.16.0.255
IPV6INIT=no
USERCTL=no

/etc/sysconfig/network-scripts/ifcfg-eth2

# 3rd NIC, used for heartbeat and cluster traffic.
# No DNS or routes are necessary, but all nodes
# should be able to communicate on the subnet
DEVICE=eth2
BOOTPROTO=static
ONBOOT=yes
TYPE=Ethernet
IPADDR=10.0.0.21
NETMASK=255.255.255.0
BROADCAST=10.0.0.255
IPV6INIT=no
USERCTL=no

/etc/hosts

127.0.0.1	localhost localhost.localdomain
10.0.0.21	pmsapp1	pmsapp1.pmsapp.org
10.0.0.22	pmsapp2	pmsapp2.pmsapp.org
10.0.0.23	pmsapp3	pmsapp3.pmsapp.org

Note that I have added 3 hosts to my /etc/hosts file, this is because I will demo both a two- and a three node cluster.

Installing software

Now we get to installing the extra software we need to share storage and create a cluster. – Remember to do this on all nodes.
First we install the high availability package that will provide tools for clustering.

# yum -y groupinstall "High Availability"

Next we will install tools to work with iSCSI, software RAID and NFS

# yum -y install scsi-target-utils iscsi-initiator-utils nfs-utils mdadm

Ensure that the different services are not started during boot. This will be handled by the cluster software.

# chkconfig iscsi off
# chkconfig nfs off
# chkconfig tgtd off

Reboot the appliance to ensure correct kernel modules are loaded and IP’s are set.

iSCSI target configuration

Here we will configure each pmsApp to us the second harddrive as an iSCSI target. In my setup, the second disk is /dev/vdb, but in most setups it will probably be /dev/sdb which is a normal SCSI device.
On each node, add lines similar to the following to the beginning of /etc/tgt/targets.conf

<target iqn.2012-03.org.pmsapp:pmsapp1.disk>
	backing-store /dev/vdb
</target>

Offcourse you would change the target line to correspond to the name of the host.
Have a look at the rest of /etc/tgt/targets.conf for information on tweaks and other settings for your iSCSI target.

iSCSI initiator configuration

Now that we have all nodes setup up to share their second harddrive as an iSCSI target, we can actually create our software RAID device.
First we need to start the iSCSI target service on each pmsApp.

# service tgtd start

Before we start discovering targets we need to edit /etc/iscsi/iscsid.conf default settings are to retry login to target 8 times with a timeoutvalue of 15 seconds. This means that if we try to login to a target that is not responding it will take about 120 seconds before we get a timeout. This is a very long time to wait if a node in the cluster has failed. Expecting that the network is working fairly well, I have set the timeout value to 2 seconds with only 1 allowed retry. This gives us a total of about 4 seconds before we get a timeout.
Edit /etc/iscsi/iscsid.conf and ensure you have something similar to the following 2 lines:

node.conn[0].timeo.login_timeout = 2
node.session.initial_login_retry_max = 1

Copy the configuration file to the other nodes.

# scp /etc/iscsi/iscsid.conf pmsapp2:/etc/iscsi
# scp /etc/iscsi/iscsid.conf pmsapp3:/etc/iscsi

On the first node (and only the first node) we will start with discovering the iSCSI targets.

# iscsiadm -m discovery -t st -p 172.16.0.21
Starting iscsid:			[  OK  ]
172.16.0.21:3260,1 iqn.2012-03.org.pmsapp:pmsapp1.disk
# iscsiadm -m discovery -t st -p 172.16.0.22
172.16.0.22:3260,1 iqn.2012-03.org.pmsapp:pmsapp2.disk
# iscsiadm -m discovery -t st -p 172.16.0.23
172.16.0.23:3260,1 iqn.2012-03.org.pmsapp:pmsapp3.disk

Note that first time you run a discovery the iscsid service will automaticly be started.
Now that we know that we can access all our targets, we can log in to them.

# iscsiadm -m node -T iqn.2012-03.org.pmsapp:pmsapp1.disk -p 172.16.0.21 --login
Logging in to [iface: default, target: iqn.2012-03.org.pmsapp:pmsapp1.disk, portal: 172.16.0.21,3260] (multiple)
Login to [iface: default, target: iqn.2012-03.org.pmsapp:pmsapp1.disk, portal: 172.16.0.21,3260] successful.
# iscsiadm -m node -T iqn.2012-03.org.pmsapp:pmsapp2.disk -p 172.16.0.22 --login
Logging in to [iface: default, target: iqn.2012-03.org.pmsapp:pmsapp2.disk, portal: 172.16.0.22,3260] (multiple)
Login to [iface: default, target: iqn.2012-03.org.pmsapp:pmsapp2.disk, portal: 172.16.0.22,3260] successful.
# iscsiadm -m node -T iqn.2012-03.org.pmsapp:pmsapp3.disk -p 172.16.0.23 --login
Logging in to [iface: default, target: iqn.2012-03.org.pmsapp:pmsapp3.disk, portal: 172.16.0.23,3260] (multiple)
Login to [iface: default, target: iqn.2012-03.org.pmsapp:pmsapp3.disk, portal: 172.16.0.23,3260] successful.

Check out the iscsiadm man page for more information.
You should now be able to see your iSCSI devices as disks.

# fdisk -l 2>/dev/null | grep Disk | grep bytes
Disk /dev/vda: 8589 MB, 8589934592 bytes
Disk /dev/vdb: 10.7 GB, 10737418240 bytes
Disk /dev/mapper/VolGroup-lv_root: 5947 MB, 5947523072 bytes
Disk /dev/mapper/VolGroup-lv_swap: 2113 MB, 2113929216 bytes
Disk /dev/sda: 10.7 GB, 10737418240 bytes
Disk /dev/sdb: 10.7 GB, 10737418240 bytes
Disk /dev/sdc: 10.7 GB, 10737418240 bytes

A bit of explanation:
In my setup, I have /dev/vda which is my primary disk, where the OS is installed.
/dev/vdb is the secondary disk, that is shared as an iSCSI target.
Then there are two logical volumes created by the installer to hold the root and swap filesystems.
My iSCSI disks are /dev/sda, /dev/sdb and /dev/sdc
In this setup the disks are only 10.7 GB, in a normal setup the disks would be much larger.
Note: If your primary and secondary disks were normal SCSI disks, they would have been /dev/sda and /dev/sdb this in turn means that the iSCSI disks will start from /dev/sdc and continue with /dev/sdd and /dev/sde.
(I hope that made sense)

Software RAID configuration

Finally we get to the point where we can actually create our software RAID.
I will first show how to create a mirror (RAID1) with two disks for use with a two node cluster. Then I will show how to create a RAID5 with 3 disks for use with a 3 node cluster.

Creating a mirror (RAID1):
# mdadm --create /dev/md0 --bitmap=internal --level=1 --raid-devices=2 /dev/sda /dev/sdb
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? yes
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

You can safely answer ‘yes’ to the ‘Continue creating array?’ question as we are not going to be booting from this array.
When the array is first created, the driver will synchronize the drives, this will take some time. You can watch the progress in /proc/mdstat

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb[1] sda[0]
      10484664 blocks super 1.2 [2/2] [UU]
      [=========>...........]  resync = 46.4% (4874624/10484664) finish=2.8min speed=33017K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
Creating a RAID5:
# mdadm --create /dev/md0 --bitmap=internal --level=5 --raid-devices=3 /dev/sda /dev/sdb /dev/sdc
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

As a RAID5 can not be used to boot from, you will not be asked about booting. Again, you can watch the progress in /proc/mdstat

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid5 sdc[3] sdb[1] sda[0]
      20968448 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
      [=========>...........]  recovery = 45.4% (4767620/10484224) finish=4.2min speed=22062K/sec
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

Have a look at the mdadm man page for more information about the mdadm command.

When the array is done synchronizing, you should save the configuration of the array in to the /etc/mdadm.conf file.

# mdadm --examine --scan > /etc/mdadm.conf

It will look something like this

# cat /etc/mdadm.conf
ARRAY /dev/md/0 metadata=1.2 UUID=d00d77c9:982f6ed5:e944a153:5db97044 name=pmsapp1.pmsapp.org:0
Distribution:

Copy the mdadm.conf file to the other nodes.

# scp /etc/mdadm.conf pmsapp2:/etc
# scp /etc/mdadm.conf pmsapp3:/etc

Now we need to ensure that the other nodes are able to connect to the iSCSI targets and see the array as well. Start by stopping the array and the iSCSI initiator on the primary node.

# mdadm --stop /dev/md0
# service iscsi stop

On the other node(s), login to the targets and check /proc/mdstat to ensure that the RAID driver picks up the array.

# iscsiadm -m discovery -t st -p 172.16.0.21
# iscsiadm -m discovery -t st -p 172.16.0.22
# iscsiadm -m discovery -t st -p 172.16.0.23
# iscsiadm -m node -T iqn.2012-03.org.pmsapp:pmsapp1.disk -p 172.16.0.21 --login
# iscsiadm -m node -T iqn.2012-03.org.pmsapp:pmsapp2.disk -p 172.16.0.22 --login
# iscsiadm -m node -T iqn.2012-03.org.pmsapp:pmsapp3.disk -p 172.16.0.23 --login
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb[1] sda[0]
      10484664 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>
# mdadm --stop /dev/md0
# service iscsi stop

First gotcha

Since we share our disk directly as an iSCSI target it means that the changes done to the iSCSI target is done to that disk. We have configured the iSCSI target to be part of a RAID set which means that the disk is part of a RAID set. When the host boots it will see the disk as beeing part of a RAID set and will try to build that RAID set. This, in turn, means that the disk is in use and our iSCSI target software can not use it as an iSCSI target.
This is the reason we are not starting the iSCSI target service automaticly.
It is possible to overcome this by rebuilding the initrd image with the raid boot option disabled, but if the kernel is later updated it will have to be rebuild again. An easier way is to create a new startup script that will stop the RAID array and start the iSCSI target service.
Create the file /etc/init.d/iscsitarget it should look like this:

#!/bin/bash
case "$1" in
	start)
		mdadm --stop /dev/md0
		/etc/init.d/tgtd start
		;;
	*)
		echo $"Usage: $0 {start}"
		exit 2
esac
exit $?

Make the script executable and ensure it is started before the cluster software

# chmod +x /etc/init.d/iscsitarget
# ln -s /etc/init.d/iscsitarget /etc/rc3.d/S16iscsitarget

The above should be done on all nodes.

Storage Conclusion

Now we actually have a storage device that can be run and restarted from any node in the cluster and data is distributed to all nodes in the cluster. We could just create a filesystem, mount it and share it to the virtualization hosts. It will, however, mean that if the controlling nodes fails, we will have to manually restart the storage device on one of the surviving nodes.

Create a filesystem on your RAID array.

# mkfs.ext4 /dev/md0

On each node, create a mountpoint for the shared storage.

# mkdir /sharedstorage

If you can live with the fact that you need to manually restart your storage after a node failure, STOP here.
After this point, we are going into somewhat uncharted waters as this is my first time working with linux HA clusters.

HA clustering

As I stated, I am not really used to working with HA clustering in Linux. I have mostly been using the following 3 pages as references.

I started with Ensure High Availability with CentOS 6 Clustering thinking that it would get me all the way, but I soon found that it is missing a few steps. We will get to that later.

Initial Cluster Configuration

We have allready installed the clustering software on all nodes, now we need to ensure they start during boot on all nodes.

# chkconfig cman on
# chkconfig rgmanager on
# chkconfig modclusterd on
# chkconfig ricci on

We also need to set a password for the ricci user on all the nodes (this is something that is left out of Ensure High Availability with CentOS 6 Clustering).

# passwd ricci

Remember what password you set, you will need it later.

There are several tools to help create /etc/cluster/cluster.conf in this case we will create it by hand to help improve understanding and ensure it is configured correctly.
We will start with a very basic configuration file that only tells that the nodes are members of the cluster.

3 node cluster 2 node cluster
<?xml version="1.0"?>
<cluster config_version="1" name="pmsappcluster">
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1"/>
    <clusternode name="pmsapp2" nodeid="2"/>
    <clusternode name="pmsapp3" nodeid="3"/>
  </clusternodes>
</cluster>
<?xml version="1.0"?>
<cluster config_version="1" name="pmsappcluster">
  <cman expected_votes="1" two_node="1"/>
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1"/>
    <clusternode name="pmsapp2" nodeid="2"/>
  </clusternodes>
</cluster>

Explanation:
<?xml version=”1″?>: The cluster.conf file is a version 1 XML file.
<cluster config_version=”1″ name=”pmsappcluster”></cluster>: Defines the cluster name and the version of the file. Each time you make a change to the file, the version should be incremented. Note that you “close” the cluster in the bottom of the file.
<clusternodes></clusternodes>: Encapsulates the nodes in the cluster.
<clusternode name=”pmsapp1″ nodeid=”1″/>: Defines a cluster node. The name must be resolveable to an IP and the nodeid must be unique within the cluster. Note that this point is “closed” by the slash (/) at the end.
<cman expected_votes=”1″ two_node=”1″/>: This is the only line that actually differs between the two configurations. The cluster software will use “majority rules”, but in a two node cluster there can never be a majority as each node is 50% of the cluster. This lines tells the cluster to accept that only one node is available. Be warned that this can lead to a “split brain” scenario were both nodes believes the other node to be down and clustered services is started on both nodes at the same time.

When you are done creating the cluster.conf file, get it validated and copy it to the other node(s).

# ccs_config_validate
Configuration validates
# scp /etc/cluster/cluster.conf pmsapp2:/etc/cluster
# scp /etc/cluster/cluster.conf pmsapp3:/etc/cluster

Start the cluster services on all nodes.

# service cman start
# service rgmanager start
# service modclusterd start
# service ricci start

Check that the cluster has started and all nodes have joined.

# clustat
Cluster Status for pmsappcluster @ Sat Feb 11 13:08:16 2012
Member Status: Quorate

 Member Name                                 ID   Status
 ------ ----                                 ---- ------
 pmsapp1                                      1 Online, Local
 pmsapp2                                      2 Online
 pmsapp3                                      3 Online

Adding Resources

We will start with adding a virtual IP address for the cluster. I will show you the configuration and then try to explain it.

3 node cluster 2 node cluster
<?xml version="1.0"?>
<cluster config_version="2" name="pmsappcluster">
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1"/>
    <clusternode name="pmsapp2" nodeid="2"/>
    <clusternode name="pmsapp3" nodeid="3"/>
  </clusternodes>
  <rm>
    <service autostart="1" exclusive="0" name="clusvc" recovery="relocate">
      <ip address="192.168.0.20" monitor_link="on"/>
    </service>
  </rm>
</cluster>
<?xml version="1.0"?>
<cluster config_version="2" name="pmsappcluster">
  <cman expected_votes="1" two_node="1"/>
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1"/>
    <clusternode name="pmsapp2" nodeid="2"/>
  </clusternodes>
  <rm>
    <service autostart="1" exclusive="0" name="clusvc" recovery="relocate">
      <ip address="192.168.0.20" monitor_link="on"/>
    </service>
  </rm>
</cluster>

Check the configuration and get it copied to the other nodes in the cluster.

# ccs_config_validate
Configuration validates
# cman_tool version -r
You have not authenticated to the ricci daemon on pmsapp1
Password:
You have not authenticated to the ricci daemon on pmsapp3
Password:
You have not authenticated to the ricci daemon on pmsapp2
Password:

The password to use is the one you chose earlier when you typed passwd ricci.
Explanation:
Note that the config_version has been incremented from 1 to 2. <rm></rm>:Resource Manager encasulates all resources/services.
<service autostart=”1″ exclusive=”0″ name=”clusvc” recovery=”relocate”></service>: Define the service, make it start automaticly, allow it to run on nodes that also have other services, call the service clusvc, if the service fails, it should be relocated to another node.
<ip address=”192.168.0.20″ monitor_link=”on”/>: Create a virtual IP resource with IP 192.168.0.20 only start the service if we have link on our NIC.

We can now ping our Virtual IP, the cman_tool takes care of copying and activating the configuration on the cluster

Second gotcha

It would seem that we should have a working cluster now. The only function it performs right now is to keep a virtual IP up and running. We can even move the IP from one node to another.

# clustat
Cluster Status for pmsappcluster @ Thu Feb 16 08:22:52 2012
Member Status: Quorate

 Member Name                                 ID   Status
 ------ ----                                 ---- ------
 pmsapp1                                        1 Online, Local, rgmanager
 pmsapp2                                        2 Online, rgmanager
 pmsapp3                                        3 Online, rgmanager

 Service Name                                Owner (Last)                             State
 ------- ----                                ----- ------                             -----
 service:clusvc                              pmsapp1                                  started       

# clusvcadm -r clusvc -m pmsapp2
Trying to relocate service:clusvc to pmsapp2...Success
service:clusvc is now running on pmsapp2

# clustat
Cluster Status for pmsappcluster @ Thu Feb 16 08:27:16 2012
Member Status: Quorate

 Member Name                                 ID   Status
 ------ ----                                 ---- ------
 pmsapp1                                        1 Online, Local, rgmanager
 pmsapp2                                        2 Online, rgmanager
 pmsapp3                                        3 Online, rgmanager

 Service Name                                Owner (Last)                             State
 ------- ----                                ----- ------                             -----
 service:clusvc                              pmsapp2                                  started

While doing some testing, I killed the node that owned the virtual IP and expected it to start on another node. This never happened.
I tried several things to make it work and allmost gave up, but in the end I found that fencing was my problem.

The cluster will not relocate/restart any service as long as it can not be sure that the failed node is actually dead and gone.
This is done through fencing which usually means that the cluster will STONITH (Shoot The Other Node In The Head) the failing node.
My first thought was to disable fencing, but this does not seem to be possible. Next I looked for a fencing method I could use, there are several.

I was unable to find a fencing method that could be used in a generic environment where there is only i single IP connection between nodes, so I ended up with a workaround that does the same as if fencing was disabled.

Fencing

The method described here actually disables fencing. This can result in split-brain scenario. If it is possible for you to use one of the predefined fencing agents, I suggest that you do that.

A fencing agent is actually a script located in /usr/sbin designed to ensure that a failing node no longer is active. The cluster runs the script when it wants to fence out a node. If the script returns without errors the cluster assumes that the node has been fenced out. As none of the predefined fencing agents work in our generic scenario, we will create our own.

# echo -e "\x23\x21/bin/bash" > /usr/sbin/fence_disable
# chmod +x /usr/sbin/fence_disable
# scp /usr/sbin/fence_disable pmsapp2:/usr/sbin
# scp /usr/sbin/fence_disable pmsapp3:/usr/sbin

The script does not do anything other then return without error.
Now we can add our new fencing agent to the cluster configuration.

3 node cluster 2 node cluster
<?xml version="1.0"?>
<cluster config_version="3" name="pmsappcluster">
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pmsapp2" nodeid="2">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pmsapp3" nodeid="3">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_disable" name="no_fence"/>
  </fencedevices>
  <rm>
    <service autostart="1" exclusive="0" name="clusvc" recovery="relocate">
      <ip address="192.168.0.20" monitor_link="on"/>
    </service>
  </rm>
</cluster>
<?xml version="1.0"?>
<cluster config_version="3" name="pmsappcluster">
  <cman expected_votes="1" two_node="1"/>
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pmsapp2" nodeid="2">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_disable" name="no_fence"/>
  </fencedevices>
  <rm>
    <service autostart="1" exclusive="0" name="clusvc" recovery="relocate">
      <ip address="192.168.0.20" monitor_link="on"/>
    </service>
  </rm>
</cluster>

Explanation:
<fencedevices></fencedevices>: This section holds information about fencing agents. I some situations, different agents will be used for different nodes.
<fencedevice agent=”fence_disable” name=”no_fence”/>: This defines our fencing agent. The agent is the name of our script. The name can be anything and is only used to reference the agent within the cluster configuration.

Looking at the <clusternodes> section, we see that it has changed quite a bit. Previously we just defined the cluster node and that was it. Now we define the cluster node and the fencing mechanism that should be used to fence out that particular node.
<fence></fence>: Section that holds fencing information for a specific node.
<method name=”fence_off”></method>: I am unsure why we need a method with a name, but it seems that it must be there in order to make it work. I have chosen the name more or less randomly.
<device name=”no_fence”/>: Tells which fence device to use for this specific node. The fence device should be configured in the <fencedevices> section.

Check the configuration and get it copied to the other nodes in the cluster.

# ccs_config_validate
Configuration validates
# cman_tool version -r

Finally we have a functioning failover cluster. At the moment it only handles our virtual IP, but now it is time to change that.

Storage Resource

The cluster allready has a virtual IP resource, but we need the cluster to be able to connect to the iscsi targets, start the software RAID array, mount the filesystem and export it as NFS.
The cluster software has build-in functions for mounting filesystems and and creating NFS exports, we will have a look at those shortly.
There are no build-in functions to connect to iscsi targets and start a software RAID array so we have to create our own. Below is a script that can be used to start and stop the software RAID array.
Save it as /etc/init.d/swraid:

#!/bin/bash
case "$1" in
	start)
		# Start the iscsi client, it will log in to the allready configured targets.
		service iscsi start 

		# Wait for 1 second to allow the software RAID driver to discover the RAID set.
		sleep 1

		# Search /proc/mdstat for ": active"
		grep ": active" /proc/mdstat &>/dev/null

		# If ": active" is not found, it means that the array is not started.
		if [ $? -ne 0 ]; then
			# Try to start the array
			mdadm --run /dev/md0;
		fi

		# Check /proc/mdstat again to see if the array is running.
		grep ": active" /proc/mdstat &>/dev/null

		# If ": active" is not found, it means that the array is not started
		if [ $? -ne 0 ]; then
			# Tell that the array is not started and return with an non-zero exit code
			echo "Array /dev/md0 is not started"
			exit 1
		fi
		# Tell that the array is started and return with 0 as exit code.
		echo "Array /dev/md0 is started"
		exit 0
		;;
	stop)
		# Stop the RAID array
		mdadm --stop /dev/md0
		# Stop the iscsi service and exit without error
		service iscsi stop
		exit 0
		;;
	status)
		# Check /proc/mdstat for ": active"
		grep ": active" /proc/mdstat &>/dev/null
		# If ": active" is not found, the array is not running
		if [ $? -ne 0 ]; then
			# Tell that the array is not running and exit with a non-zero exit code.
			echo "Array /dev/md0 is not running"
			exit 1
		fi
		# Tell that the array is running and exit with 0 as exit code.
		echo "Array /dev/md0 is running"
		exit 0
		;;
	*)
		# Tell how to use the script and exit with a non-zero exit code.
		echo $"Usage: $0 { start | stop | status }"
		exit 2
		;;
esac

Make it executeable and copy it to the other nodes.

# chmod +x /etc/init.d/swraid
# scp /etc/init.d/swraid pmsapp2:/etc/init.d
# scp /etc/init.d/swraid pmsapp3:/etc/init.d

Now we can reconfigure our cluster to enable the shared storage.

3 node cluster
<?xml version="1.0"?>
<cluster config_version="4" name="pmsappcluster">
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pmsapp2" nodeid="2">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pmsapp3" nodeid="3">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_disable" name="no_fence"/>
  </fencedevices>
  <rm>
    <service autostart="1" exclusive="0" name="clusvc" recovery="relocate">
      <script file="/etc/init.d/swraid" name="swraid">
        <fs device="/dev/md0" fstype="ext4" mountpoint="/sharedstorage" name="sharedvol">
          <nfsexport name="sharednfs">
            <nfsclient name="nfsclients" options="rw,no_root_squash" target="192.168.0.0/24"/>
          </nfsexport>
        </fs>
      </script>
      <ip address="192.168.0.20" monitor_link="on"/>
    </service>
  </rm>
</cluster>
2 node cluster
<?xml version="1.0"?>
<cluster config_version="4" name="pmsappcluster">
  <cman expected_votes="1" two_node="1"/>
  <clusternodes>
    <clusternode name="pmsapp1" nodeid="1">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pmsapp2" nodeid="2">
      <fence>
        <method name="fence_off">
          <device name="no_fence"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_disable" name="no_fence"/>
  </fencedevices>
  <rm>
    <service autostart="1" exclusive="0" name="clusvc" recovery="relocate">
      <script file="/etc/init.d/swraid" name="swraid">
        <fs device="/dev/md0" fstype="ext4" mountpoint="/sharedstorage" name="sharedvol">
          <nfsexport name="sharednfs">
            <nfsclient name="nfsclients" options="rw,no_root_squash" target="192.168.0.0/24"/>
          </nfsexport>
        </fs>
      </script>
      <ip address="192.168.0.20" monitor_link="on"/>
    </service>
  </rm>
</cluster>

Explanation:
<script file=”/etc/init.d/swraid” name=”swraid”></script>: This will run our script with the argument, start, when the service needs to be started. The script will be run with argument, stop, when the service needs to be stopped.
<fs device=”/dev/md0″ fstype=”ext4″ mountpoint=”/sharedstorage” name=”sharedvol”></fs>: This is a child of the <script> argument which means that it will only be started if the script has started successfully. It will mount /dev/md0 on /sharedstorage ans ext4.
<nfsexport name=”sharednfs”></nfsexport>: When the script has been run and the filesystem is mounted, the cluster should mark the filesystem as an NFS export.
<nfsclient name=”nfsclients” options=”rw,no_root_squash” target=”192.168.0.0/24″/>: This tells the cluster how the NFS export should be configured. In this case, we allow the 192.168.0.0 subnet to access it and we do a no_root_squash as most virtualization hosts would log in as root.

Check the configuration and get it copied to the other nodes in the cluster.

# ccs_config_validate
Configuration validates
# cman_tool version -r

Now we should have a functioning cluster that will ensure the availability of a virtual IP and a shared storage resource, exported as NFS. You can check that the NFS export has actually been created.

# showmount -e 192.168.0.20

Scripts, configs and misc

Conclusion

At last we have a cluster configured with purely opensource software, but there is still a lot of work to do.

  • Performance testing
  • Failover testing
  • Troubleshooting tips
  • Create an actual appliance

Unfortunately I only have a single virtualization host at my disposal so I can not run reliable performance testing. For the other points, any help would be appreciated. I will however try to get these done when time permits. Please leave a comment below or send me an email on jimmy at pmsapp.org

All comments are welcome – Happy clustering.

3 thoughts on “Install guide

  1. Hi,

    i always get the state “recoverable” after a few seconds and the resource is flapping between the two nodes. Any idea?

    Regrads,
    Sven

  2. Hi,

    i followed your install guide step-by-step. The last step was to replicate the cluster config with cman_tool.
    But, i guess it workes now. I forgot a slash in the cluster.conf. After a reboot it seems fine.
    Thanks for that ultra-fast response.

    Regards,
    Sven

Leave a Reply to Sven Cancel reply