Shared storage with compression using drbd and zfs

This not an update of the original Poor Mans Storage Appliance, but rather a new take on it. This time it is built through DRBD and ZFS. DRBD is designed for mirroring over network and the zfs filesystem is able to do compression and deduplication although deduplication is not recommended unless you have lots and lots of memory. I started out with Debian 7.5 (wheezy), but it comes stock with drbd 8.3. Most guides I could find on configuring and using drbd is for drbd 8.4. By using backports, I was able to get drbd up to version 8.4, but then I had trouble getting zfs to work correctly. After some trial and error I decided that it was too much hassle and instead opted for Ubuntu. Ubuntu 14.04 LTS comes with drbd 8.4 and zfs installs without any issues. I started with a VM on each of my two physical hosts. Each VM has a virtual disk for the OS and a virtual disk for the shares storage. The following resources were used as reference:

After installing Ubuntu Server, configuring static IP and ensuring host names resolve (either through your own DNS server or /etc/hosts files on both VMs), I started installing software (everything needs to be done on both VMs).

# Update repository list and upgrade packages to latest version
sudo apt-get update && sudo apt-get -y dist-upgrade

To install zfs on linux, an extra repository needs to be added

# Add zfs on linux repository
sudo add-apt-repository ppa:zfs-native/stable

Install the software. drbd to distribute data between nodes, ntp to ensure time is synchronized, zfs to create filesystem capable of doing compression and deduplication, nfs to export the distributed storage and heartbeat to manage the two nodes in a cluster.

# Update repository list and install software
sudo apt-get update && sudo apt-get -y install \
drbd8-utils ntp ntpdate ubuntu-zfs nfs-kernel-server heartbeat

Reboot to ensure that new versions are being used.

# Reboot to use new kernel and software
sudo reboot

After the VMs have restarted, we start by configuring the drbd device, this is done by creating a resource description file in /etc/drbd.d

sudo nano /etc/drbd.d/myredundantstorage.res

The file should look something like this:

resource myredundantstorage {
 protocol C;
 startup { wfc-timeout 5; degr-wfc-timeout 15; }

 disk { on-io-error detach; }

 syncer { rate 10M; }

 on node1.mydomain.local {
 device /dev/drbd0;
 disk /dev/vdb;
 meta-disk internal;
 address 192.168.0.41:7788;
 }

 on node2.mydomain.local {
 device /dev/drbd0;
 disk /dev/vdb;
 meta-disk internal;
 address 192.168.0.42:7788;
 }
}

Make sure that the VMs host names match the host names specified in above configuration file. The names in the file should be the same as the name provided by:

uname -n

Now that the resource file is created on both nodes, describing how the resource should be handled, we can create the actual resource.

# Create and enable storage on both nodes
sudo drbdadm create-md myredundantstorage
sudo drbdadm up myredundantstorage

You should now be able to see that both nodes are up although they are both secondary.

# Show status of our drbd device
cat /proc/drbd

On the node that you want to be primary, you can run either of these two commands:

# Create primary node without syncing devices (should only
# be done if you are sure that disks does not contain any
# data
sudo drbdadm -- \
--clear-bitmap new-current-uuid myredundantstorage

or:

# Create primary node by syncing the contents of local disk
# to the remote disk (can be very time-consuming)
sudo drbdadm -- \
--overwrite-data-of-peer primary myredundantstorage

Depending on which command is used, you should see the device being ready or syncing, but you should see that now one node is primary and the other node is secondary. (cat /proc/drbd) Now we have a device on which we can create a filesystem (/dev/drbd0). To enable us to choose compression and deduplication we will use zfs. The filesystem will only be created on the primary node.

# Create zfs filesystem on the redundant device
sudo zpool create myredundantfilesystem /dev/drbd0

The zfs documentation says that after the zpool is created, a zfs should be created inside that pool. I have opted not to do thatĀ for 3 reasons. The zpool functions as a filesystem, I do not neet more that one filesystem in my pool and if a zfs is created inside the zpool, zfs will try to mount it during boot. If you want compression enabled on the filesystem:

sudo zfs set compression=on myredundantfilesystem

If you want deduplication enabled on the filesystem:

sudo zfs set dedup=on myredundantfilesystem

Now we need to ensure that the mountpoint for the zfs filesystem exists on both nodes:

# Create directory for mounting zfs if it does not exist
if [ ! -d /myredundantfilesystem ]; then
 sudo mkdir /myredundantfilesystem
fi

On the primary node the directory should already exist and the zfs filesystem should be mounted. Now we can edit /etc/exports on both nodes to ensure that it can export the zfs filesystem to clients. The export should look something like this:

/myredundantfilesystem 192.168.0.0/24(rw,async,no_root_squash,no_subtree_check,fsid=1)

Most options are “normal” for nfs exports, but the fsid is there to tell the client that it is the same filesystem on both nodes. Finally we need clustering to maintain a failover relationship between the nodes. We have already installed heartbeat, now we need to configure it. (it should be done on both nodes) Before we start configuring heartbeat, we need to disable the nfs server during boot.

sudo update-rc.d -f nfs-kernel-server remove

First we will create the /etc/ha.d/ha.cf file, it should look something like this.

autojoin none
auto_failback off
keepalive 1
warntime 3
deadtime 5
initdead 20
bcast eth0
node openvs1.dansbo.local
node openvs2.dansbo.local
logfile /var/log/heartbeat-log
debugfile /var/log/heartbeat-debug

The parameter deadtime tells Heartbeat to declare the other node dead after this many seconds. Heartbeat will send a heartbeat every keepalive number of seconds. Next we will protect the heartbeat configuration by editing /etc/ha.d/authkeys, it should look something like this:

auth 3
3 md5 my_secret_key

Set permissions on the file

sudo chmod 600 /etc/ha.d/authkeys

Next we need to tell heartbeat about the resources we want it to manage. It is done in the file /etc/ha.d/haresources and it should look something like this:

node1.mydomain.local \
IPaddr::192.168.0.45/24/eth0 \
drbddisk::myredundantstorage \
zfsmount \
nfs-kernel-server

In this example node1 will be primary, but if it fails, the other node will take over. We need a “floating” IP address as our clients needs a fixed IP to connect to. Heartbeat needs to ensure that the drbd device is active on the primary node. zfsmount is a script I have created in /etc/init.d, more about in a short while Lastly, heartbeat should start the nfs server. In haresources there is a reference to the zfsmount script. This is a script I have created my self to ensure that both the drbd device and the zfs filesystem gets activated on the secondary node in the event of primary node failure. The script should be located in /etc/init.d and accept at least the start and stop parameters. My script looks like this:

#!/bin/bash
EXITCODE=0
case "$1" in
 start)
  # Try to make this node primary for the drbd
  drbdadm primary myredundantstorage
  # Ensure that zfs knows about our filesystem
  zpool import myredundantfilesystem -f
  # Try to mount the zfs filesystem
  zfs mount myredundantfilesystem
  EXITCODE=$?
 ;;
 stop)
  # If the filesystem is mounted, it should be unmounted
  df -h | grep myredundantfilesystem > /dev/null
  if [ $? -eq 0 ]; then
   zfs unmount myredundantfilesystem
   EXITCODE=$?
  else
   EXITCODE=0
  fi
 ;;
esac
exit $EXITCODE

Some trial and error as well as a lot of looking through /var/log/heartbeat-debug log file helped me create this script. The script must exit with success if it is called with stop even though it has already been stopped.

Initially I thought that it would be sufficient to make the node primary (drbdadm primary myredundantstorage) during a failover, but it seems that the zfs driver does not necessarily recognize that there is a zfs filesystem just because the device becomes available. That is why the import is done. The import will fail if the filesystem is already recognized, but that does not matter in this script, it will still try to mount the filesystem.

I hope this will be helpful for other people as well. It took me a long time to find anything dealing with drbd and zfs at the same time. If you have any questions, don’t hesitate to contact me. ( jimmy at dansbo dot dk )

Btw. I can say that this setup performs a hell of a lot better than zfs and gluster on the same hardware.