Shared storage with compression using drbd and zfs

This not an update of the original Poor Mans Storage Appliance, but rather a new take on it. This time it is built through DRBD and ZFS. DRBD is designed for mirroring over network and the zfs filesystem is able to do compression and deduplication although deduplication is not recommended unless you have lots and lots of memory. I started out with Debian 7.5 (wheezy), but it comes stock with drbd 8.3. Most guides I could find on configuring and using drbd is for drbd 8.4. By using backports, I was able to get drbd up to version 8.4, but then I had trouble getting zfs to work correctly. After some trial and error I decided that it was too much hassle and instead opted for Ubuntu. Ubuntu 14.04 LTS comes with drbd 8.4 and zfs installs without any issues. I started with a VM on each of my two physical hosts. Each VM has a virtual disk for the OS and a virtual disk for the shares storage. The following resources were used as reference:

After installing Ubuntu Server, configuring static IP and ensuring host names resolve (either through your own DNS server or /etc/hosts files on both VMs), I started installing software (everything needs to be done on both VMs).

# Update repository list and upgrade packages to latest version
sudo apt-get update && sudo apt-get -y dist-upgrade

To install zfs on linux, an extra repository needs to be added

# Add zfs on linux repository
sudo add-apt-repository ppa:zfs-native/stable

Install the software. drbd to distribute data between nodes, ntp to ensure time is synchronized, zfs to create filesystem capable of doing compression and deduplication, nfs to export the distributed storage and heartbeat to manage the two nodes in a cluster.

# Update repository list and install software
sudo apt-get update && sudo apt-get -y install \
drbd8-utils ntp ntpdate ubuntu-zfs nfs-kernel-server heartbeat

Reboot to ensure that new versions are being used.

# Reboot to use new kernel and software
sudo reboot

After the VMs have restarted, we start by configuring the drbd device, this is done by creating a resource description file in /etc/drbd.d

sudo nano /etc/drbd.d/myredundantstorage.res

The file should look something like this:

resource myredundantstorage {
 protocol C;
 startup { wfc-timeout 5; degr-wfc-timeout 15; }

 disk { on-io-error detach; }

 syncer { rate 10M; }

 on node1.mydomain.local {
 device /dev/drbd0;
 disk /dev/vdb;
 meta-disk internal;
 address 192.168.0.41:7788;
 }

 on node2.mydomain.local {
 device /dev/drbd0;
 disk /dev/vdb;
 meta-disk internal;
 address 192.168.0.42:7788;
 }
}

Make sure that the VMs host names match the host names specified in above configuration file. The names in the file should be the same as the name provided by:

uname -n

Now that the resource file is created on both nodes, describing how the resource should be handled, we can create the actual resource.

# Create and enable storage on both nodes
sudo drbdadm create-md myredundantstorage
sudo drbdadm up myredundantstorage

You should now be able to see that both nodes are up although they are both secondary.

# Show status of our drbd device
cat /proc/drbd

On the node that you want to be primary, you can run either of these two commands:

# Create primary node without syncing devices (should only
# be done if you are sure that disks does not contain any
# data
sudo drbdadm -- \
--clear-bitmap new-current-uuid myredundantstorage

or:

# Create primary node by syncing the contents of local disk
# to the remote disk (can be very time-consuming)
sudo drbdadm -- \
--overwrite-data-of-peer primary myredundantstorage

Depending on which command is used, you should see the device being ready or syncing, but you should see that now one node is primary and the other node is secondary. (cat /proc/drbd) Now we have a device on which we can create a filesystem (/dev/drbd0). To enable us to choose compression and deduplication we will use zfs. The filesystem will only be created on the primary node.

# Create zfs filesystem on the redundant device
sudo zpool create myredundantfilesystem /dev/drbd0

The zfs documentation says that after the zpool is created, a zfs should be created inside that pool. I have opted not to do that for 3 reasons. The zpool functions as a filesystem, I do not neet more that one filesystem in my pool and if a zfs is created inside the zpool, zfs will try to mount it during boot. If you want compression enabled on the filesystem:

sudo zfs set compression=on myredundantfilesystem

If you want deduplication enabled on the filesystem:

sudo zfs set dedup=on myredundantfilesystem

Now we need to ensure that the mountpoint for the zfs filesystem exists on both nodes:

# Create directory for mounting zfs if it does not exist
if [ ! -d /myredundantfilesystem ]; then
 sudo mkdir /myredundantfilesystem
fi

On the primary node the directory should already exist and the zfs filesystem should be mounted. Now we can edit /etc/exports on both nodes to ensure that it can export the zfs filesystem to clients. The export should look something like this:

/myredundantfilesystem 192.168.0.0/24(rw,async,no_root_squash,no_subtree_check,fsid=1)

Most options are “normal” for nfs exports, but the fsid is there to tell the client that it is the same filesystem on both nodes. Finally we need clustering to maintain a failover relationship between the nodes. We have already installed heartbeat, now we need to configure it. (it should be done on both nodes) Before we start configuring heartbeat, we need to disable the nfs server during boot.

sudo update-rc.d -f nfs-kernel-server remove

First we will create the /etc/ha.d/ha.cf file, it should look something like this.

autojoin none
auto_failback off
keepalive 1
warntime 3
deadtime 5
initdead 20
bcast eth0
node openvs1.dansbo.local
node openvs2.dansbo.local
logfile /var/log/heartbeat-log
debugfile /var/log/heartbeat-debug

The parameter deadtime tells Heartbeat to declare the other node dead after this many seconds. Heartbeat will send a heartbeat every keepalive number of seconds. Next we will protect the heartbeat configuration by editing /etc/ha.d/authkeys, it should look something like this:

auth 3
3 md5 my_secret_key

Set permissions on the file

sudo chmod 600 /etc/ha.d/authkeys

Next we need to tell heartbeat about the resources we want it to manage. It is done in the file /etc/ha.d/haresources and it should look something like this:

node1.mydomain.local \
IPaddr::192.168.0.45/24/eth0 \
drbddisk::myredundantstorage \
zfsmount \
nfs-kernel-server

In this example node1 will be primary, but if it fails, the other node will take over. We need a “floating” IP address as our clients needs a fixed IP to connect to. Heartbeat needs to ensure that the drbd device is active on the primary node. zfsmount is a script I have created in /etc/init.d, more about in a short while Lastly, heartbeat should start the nfs server. In haresources there is a reference to the zfsmount script. This is a script I have created my self to ensure that both the drbd device and the zfs filesystem gets activated on the secondary node in the event of primary node failure. The script should be located in /etc/init.d and accept at least the start and stop parameters. My script looks like this:

#!/bin/bash
EXITCODE=0
case "$1" in
 start)
  # Try to make this node primary for the drbd
  drbdadm primary myredundantstorage
  # Ensure that zfs knows about our filesystem
  zpool import myredundantfilesystem -f
  # Try to mount the zfs filesystem
  zfs mount myredundantfilesystem
  EXITCODE=$?
 ;;
 stop)
  # If the filesystem is mounted, it should be unmounted
  df -h | grep myredundantfilesystem > /dev/null
  if [ $? -eq 0 ]; then
   zfs unmount myredundantfilesystem
   EXITCODE=$?
  else
   EXITCODE=0
  fi
 ;;
esac
exit $EXITCODE

Some trial and error as well as a lot of looking through /var/log/heartbeat-debug log file helped me create this script. The script must exit with success if it is called with stop even though it has already been stopped.

Initially I thought that it would be sufficient to make the node primary (drbdadm primary myredundantstorage) during a failover, but it seems that the zfs driver does not necessarily recognize that there is a zfs filesystem just because the device becomes available. That is why the import is done. The import will fail if the filesystem is already recognized, but that does not matter in this script, it will still try to mount the filesystem.

I hope this will be helpful for other people as well. It took me a long time to find anything dealing with drbd and zfs at the same time. If you have any questions, don’t hesitate to contact me. ( jimmy at dansbo dot dk )

Btw. I can say that this setup performs a hell of a lot better than zfs and gluster on the same hardware.

So long, and thanks for all the fish

This is it. I have not been able to find the time or the energy to keep this project updated. The domain pmsapp.org will expire on the 2. of March this year and I do not plan on renewing it. I am sorry to see the project die, but lack of interest and a strained schedule on my part forces me to stop this.

If you are interested in this project or need assistance in setting up your own storage appliance, you can reach me at pmsapp at dansbo.dk.

Update in lack of news

Just wanted to let everyone know that this project is still alive. I am continuing work on creating a web interface to ease deployment of a pmsApp storage cluster. Unfortunately it will probably be some time before version 0.3 with the new web interface is released.

In the meantime I am still hoping for a nice logo for the pmsApp project, with a bit of luck something will have been submitted before the release of the next version.

I would like to thank manyrootsofallevil for his great work in performance testing the pmsApp and at the same time ask for input on how to increase performance. If you have any suggestions or tips on how to increase performance on software RAID created from iSCSI targets, please let us know.

Performance testing the pmsApp with filebench – RAID 5

I finally managed to get the 3 node pmsApp running, although there are some odd issues relating to failover that I will need to investigate further.

I altered the methodology slightly this time, I decided to use a single guest and move it around to the various storage devices.

In addition to the pmsApp, I tested on our SAN array (M2000 storage works configured as RAID10) and direct to an ESX host disk. The direct to host results are better than last time, particularly the fileserver workload. The result for the 2 node pmsApp (RAID1) are displayed for convenience although they are the same as from last’s post.

varmail webserver fileserver randomwrite (100 MB file) randomread (100 MB file) webproxy
SAN 25.8 MB/s 25.45 MB/s 25.85 MB/s 76.1 MB/s 90.0 MB/s 15.3 MB/s
Direct 6.15 MB/s 3.9 MB/s 9.1 MB/s 48.12 MB/s 88.9 MB/s 3.2 MB/s
PMSARAID1 1.24 MB/s 6.83 MB/s 4.20 MB/s 13.63 MB/s 89.1 MB/s 1.5 MB/s
PMSARAID5 1.56 MB/S 12.86 MB/s 4.6 MB/s 15.2 MB/s 89.3 MB/s 4.41 MB/s

As might be expected the SAN array is miles faster than anything else, so I won’t mention anything else about it, I simply wanted to provide a comparison point.

Direct to host tests show a significant speed up on the fileserver workload over the results on last’s post. I don’t really have an explanation for this. The results were consistent, standard deviation of only 0.7 MB/s. The other workloads are faster but not as much. This is a better control test, though, as it was exactly the same guest.

There is still the mystery of the webserver workload, which has now been amplified by the 3 node pmsApp, which triples the performance of the direct to host configuration.

The good news is that the 3 node pmsApp is faster than the 2 node. This shouldn’t be all that surprising as now it is only the parity data that needs to be written across the network rather than all data, so the results show a nice improvement throughout. Consistency has also improved, with standard deviation around 5%, except for the webserver workload.

I would love to compare the performance of the pmsApp to the actual VMware app, but I don’t have the hardware to do it. Not sure, whether the app will work with single disks.

Performance testing the pmsApp with filebench – RAID1

Following on from my previous post, I’ve decided to use filebench  (Version 1.4.9.1)
for a more structured testing approach.  Filebench is easy to work with, as it has already defined several application types (workload) profiles and is easy to install. The workloads can be defined easily (using WML scripting) to mimic any sort of application load on the IO system. I cannot comment on how accurate these workloads are, but they make testing easy :).

The method I’ve used was very simple. Three runs of 60 seconds each, with the default settings for the workload profiles, except for the randomwrite and randomread workloads where I had to reduce the file size to 100 MB. I disabled randomization ( echo 0 > /proc/sys/kernel/randomize_va_space ) as recommended by filebench. Filebench allocated 170 MB of shared memory on all the runs (this appears to be the default).

The results are the average results of the three runs, except for the randomread workload where I only did two runs as the results were very similar. I did not repeat any tests as there was a lot less variation in the results. The results presented below are the IO summary result, rather than the individual operations results.

So without further ado here are the results:

Workload varmail webserver fileserver randomwrite randomread webproxy
Server
PMSTest (RAID1) 1.24 MB/s 6.83 MB/s 4.20 MB/s 13.63 MB/s 89.1 MB/s 1.50 MB/s
PMSControl 5.02 MB/s 3.73 MB/s 4.93 MB/s 39.83 MB/s 88.9 MB/s 2.33 MB/s

The webserver results are somewhat mystifying. The workload consists of opening, reading and closing files, so the results should be fairly similar but for some reason the pmsApp is faster!

The rest of the results are as expected:  Writes are significantly slower due to nature of the pmsApp (namely a RAID array across a network link) and reads are comparable.

It is also worth noting that regardless of the results, the test prep phase took longer, sometimes a lot longer, when running off the pmsApp, i.e on PMSTest.

Ver 0.2 – Finally ready for download

I finally managed to create an OVA that can be deployed on both ESXi4, VirtualBox and VMware Workstation.

I had to create and OVF with VirtualBox, edit it to change the hardware type from virtualbox-2.2 to vmx-7 and then use VMware ovftool to convert the ovf to OVA.

When deployed on ESXi4, it will complain about the guest os not being recognized, but you can still deploy it and later change the guest os setting.

Happy downloading.

Performance testing the pmsApp

So a few weeks ago a colleague pointed out Jimmy’s website about the pmsApp and seeing as we have a few underused blades in our development environment I offered my help to Jimmy.

He’s already published my findings regarding failover testing of the pmsApp and today I thought I would post some performance metrics. The first set of tests will be using the pmsApp configured to use a RAID1 (software) array.

In order to run the tests, I have created two guests:

One running off the pmsApp, PMSTest, and the other running locally on the blade disk (It’s a single disk), PMSControl.

I have installed CentOS 6.2 minimal install on both of them and installed the openssh-clients package to allow me to use scp.

Each guest has 512 MB of RAM, a single NIC (dynamically configured (lease time is 36 hours)) and a 8 GB hard drive.

Generally, speaking I ran three instances of the tests, unless there was a great disparity between the results, in which case I ran extra tests (normally two but sometimes three) and discarded the top and bottom results.

Generally speaking, the results of the guest running off the pmsApp were considerably less stable.

Reboot Time :  This measures the time taken from typing init 6 and pressing enter to the logon screen appearing. I found this test to be more reliable that pressing reset on the vsphere client or even using powercli.

PMSControl    PMSTEST (RAID1)
34.66 s           30.44 s
33.02 s           27.75 s
34.07 s           28.63 s

This was a clear win for the pmsApp.

The remaining tests are file copying tests. The source and destination server (depending on the test)  was a third blade server running RHEL 6.0. In theory traffic bound for another blade never leaves the enclosure, but I’m not so sure. I ran these tests out of hours so there would have been very little traffic, if any, on the network. I did repeat some tests during the working day for good measure and results were comparable.

SCP Write -  Copy File VMware ESXi 5.0 iso (297 MB) from blade server to Guest

PMSControl    PMSTEST (RAID1)
24.675 s          45.195 s
30.980 s          45.197 s
28.505 s          56.486 s

The pmsApp is clearly slower.

SCP Read -  Copy File VMware ESXi 5.0 iso (297 MB) from Guest to blade server.

PMSControl    PMSTEST (RAID1)
7.764 s           7.762 s
8.211 s            7.711 s
7.674 s           8.017 s

Reading is a tie.

SCP Write – Copy Directory containing multiple files (total size 203 MB) from blade server to guest. There is an 87 MB file and a 16 MB file, but the rest are sub 1 MB files.

PMSControl    PMSTEST (RAID1)
12.717 s         20.572 s
13.092 s         18.772 s
11.752 s          23.380 s

Another writing test and the pmsApp is slower again.

SCP Read – Copy Directory containing multiple files (total size 203 MB) from guest to blade server. There is an 87 MB file and a 16 MB file, but the rest are sub 1 MB files.

PMSControl    PMSTEST (RAID1)
11.531 s          11.623 s
10.972 s          11.082 s
11.294 s          11.245 s

Both the pmsApp and the direct guest are faster copying multiple small files than one big one, which is odd as I would’ve expected one big file to be quicker, but the results are consistent on both guests so I’m not too worried. I’d be happy to hear an explanation though.

A very small set of tests has shown that reading performance is comparable to hosting a guest directly on a disk. Unfortunately, write performance does suffer quite a bit, it’s ~ 40% slower and not very consistent.

In my next post I intend to use filebench to carry out some more tests using a more standardized approach.

Failover testing

A friend of the pmsApp project – A.Z. has done some failover testing. This is what he has found.

As long as there is no I/O taking place while the cluster is failing over, all is ok. If there is I/O then the results are mixed.

  • SCP stopped and would not restart
  • A script I wrote (writing the current date to a file every second) is unable to write to the file, but when the cluster comes back up, it continues on its merry way.

The problem seems to be that an ongoing session is not moved from the failed node to the node that takes over. A new session needs to be established. I am looking in to the possibility of having the sessions transferred from the failing node to the new active node. So far all I have found, is an article on how to create a failover nfs cluster. This is however with drbd, but that should not matter much.

A big thank you to A.Z. for testing the pmsApp. Looking forward to your performance tests.