Creating an OVA

While working on creating the next version of the pmsApp OVA, I ran into some issues so I thought I would try to make a small guide to creating an OVA here. Mostly for my own benefit.

I installed version 0.2 of the pmsApp and updated it to get the latest software versions. When I then tried to create a new OVA it turned out to be quite a lot larget than the original. I needed to make the image smaller.

I started out by clearing the yum cache and uninstalling the old kernel, that shrinked the used space to around the same as before the update. I found this guide as to how to prepare a VM to be a template, that freed up a bit more space.

Using VirtualBox, i created an OVA from the VM. Now the problem is that an OVA created with VirtualBox can not be imported into ESXi. Fortunately an OVA is actually just a tar archive so I was able to extract the .ovf file and the vmdk’s without issues. After that it is a queston of editing the ovf and changing the hardware version from virtualbox-2.2 to vmx-Y where Y is the vmware hardware version wanted. I plan on going with vmx-8.

Now the problem is assemling the OVA again, to do that, I used vmware’s ovftool, but it wants a .mf file to ensure that the files are not corrupted. I found this article, explaining how to create the .mf file.

To have everything in one place, here is what I have done to the VM to make it into an OVA:

First, we are working “inside” the VM to reduce space used.

Stop logging:

service rsyslog stop
service auditd stop

Remove old kernels:

uname -a
rpm -qa | grep kernel
#remove old kernel versions with. yum remove

Clean yum cache:

yum clean all

Empty logs:

logrotate -f /etc/logrotate.conf
rm –f /var/log/*-???????? /var/log/*.gz /var/log/*.old /var/log/anac*
cat /dev/null > /var/log/audit/audit.log
cat /dev/null > /var/log/wtmp
cat /dev/null > /var/log/lastlog
#there might be other logfiles to empty out with cat /dev/null
#*I will update this next time I go through the process

Clean tmp space:

rm -rf /tmp/*
rm -rf /var/tmp/*

Zero out freespace to ensure disk images can be made as small as possible:

dd if=/dev/zero of=/boot/zerofile bs=1M
sync
rm -rf /boot/zerofile

dd if=/dev/zero of=/zerofile bs=1M
sync
rm +rf /zerofile

Finally shut down the vm:

shutdown -h now

Using VirtualBox, I exported the VM as pmsApp-0.3.ova. Next are the steps taken to ensure that the OVA is as small as possible and can be imported by ESXi.

Extract OVA:

mv pmsApp-0.3.ova pmsApp-0.3.tar
tar xf pmsApp-0.3.tar

Above step can be skipped if the VM is exported as .ovf instead of .ova

Replace the system type to ensure that the OVA can be imported by ESXi:

sed 's/virtualbox-2.2/vmx-8/g' pmsApp-0.3.ovf > pmsApp-tmp.ovf
mv pmsApp-tmp.ovf pmsApp-0.3.ovf

Ensure that the disk image is as small as possible:

qemu-img convert -p -O vmdk pmsApp-0.3-disk1.vmdk thindisk.vmdk
mv thindisk.vmdk pmsApp-0.3-disk1.vmdk

Create a manifest file to ensure data consistency in OVA:

openssl sha1 *.ovf *.vmdk > pmsApp-0.3.mf

Create the OVA:

ovftool pmsApp-0.3.ovf pmsApp-0.3.ova

A new version of the pmsApp OVA

For some time now, I have been contemplating updating the pmsApp OVA to make it even easier to deploy the Poor Mans Storage Appliance.

Some of the things I am thinking of updating are:

  • Set a password for the ricci user (probably going to be pmsapp)
  • Make it possible to create the shared storage from web interface
  • Include Luci so that the cluster can be created via web browser
  • Create a guide to configuring the cluster using Luci
  • Possibly update the web interface to provide cluster and storage information.

This list is mostly for my own benefit, but stay tuned for updates.

Expand a zpool with a single supporting device on linux

Initial status:

Because I have had trouble with the way ZFS on Linux (ZOL) handles failing drives, I have decided to create a software
RAID set of my drives before creating my zpool.

This means that my zpool is created with a single device (/dev/md0). If I remember correctly, it was created like this:

zpool create pool0 md0

Expansion:

By default pool’s are not set to autoexpand so it is needed to set the autoexpand to on:

zpool set autoexpand=on pool0

Before I could get the pool expanded I needed to grow my RAID set. It is done by first adding a device to the existing
RAID set and then “growing” it with the new amount of devices.

mdadm /dev/md0 --add /dev/sdd
mdadm --grow /dev/md0 --raid-devices=4

After the RAID set have reshaped, the extra space is available for use. According to all information I could find,
it should be possible to expand the pool by either exporting and importing it

zpool export pool0
zpool import pool0

or by expanding the online pool

zpool online -e pool0 md0

I trid both several times and in different order, but to no avail. My pool was fixed at its original size.

Reboot:

Finally I decided to do a reboot to see if that would change anything. Remember to save the new mdadm configuration before
rebooting.

# Remove existing array configuration from mdadm.conf file
grep -v ARRAY /etc/mdadm/mdadm.conf > /tmp/mdadm.conf
# Add new array configuration to mdadm.conf
cat /tmp/mdadm.conf > /etc/mdadm/mdadm.conf
mdadm --examine --scan >> /etc/mdadm/mdadm.conf

After the reboot the pool was still the original size, but now it actually worked to do an online expansion.

zpool online -e pool0 md0

And that solved the problem. Unfortunately I do not know why I could not perform the online expansion without a reboot.

General ramblings on fencing

I have been playing around with pmsapp v0.2 in my new vmware environment and I got to thinking that maybe it was possible to fence through vmware… And it is.

A bit of googling gave me this result https://fedorahosted.org/cluster/wiki/VMware_FencingConfig it turns out that the fence_vmware agent depends on the VI Perl Toolkit so I googled a bit more to find an installation guide from VMware: https://www.vmware.com/support/pubs/beta/viperltoolkit/doc/perl_toolkit_install.html

As I am working on CentOS the perl dependencies can be installed through RPMs. The following list of RPMs seems to satisfy the dependencies.

perl-Crypt-SSLeay \
perl-XML-LibXML \
perl-libwww-perl \
perl-Class-MethodMaker \
perl-devel \
perl-Test-Simple

I followed VMwares guide to build and install the VI Perl Toolkit, but afterwards it still did not work.

I got an error message saying:

fence_vmware_helper returned Undefined subroutine &Opts::add_options called at /usr/sbin/fence_vmware_helper line 82.

Googling for that error gave me this result: http://www.redhat.com/archives/linux-cluster/2011-January/msg00134.html

Basicly, I need to edit /usr/local/share/perl5/VMware/VIRuntime.pm and add:

use VMware::VILib;

above the line:

use VMware::VIM2Runtime;

Now I am able to check the status of a VM with fence_vmware

fence_vmware -o status -a vcenterserver -l [username] -p [password] -n [vmname]

Shared storage with compression using drbd and zfs

This not an update of the original Poor Mans Storage Appliance, but rather a new take on it. This time it is built through DRBD and ZFS. DRBD is designed for mirroring over network and the zfs filesystem is able to do compression and deduplication although deduplication is not recommended unless you have lots and lots of memory. I started out with Debian 7.5 (wheezy), but it comes stock with drbd 8.3. Most guides I could find on configuring and using drbd is for drbd 8.4. By using backports, I was able to get drbd up to version 8.4, but then I had trouble getting zfs to work correctly. After some trial and error I decided that it was too much hassle and instead opted for Ubuntu. Ubuntu 14.04 LTS comes with drbd 8.4 and zfs installs without any issues. I started with a VM on each of my two physical hosts. Each VM has a virtual disk for the OS and a virtual disk for the shares storage. The following resources were used as reference:

After installing Ubuntu Server, configuring static IP and ensuring host names resolve (either through your own DNS server or /etc/hosts files on both VMs), I started installing software (everything needs to be done on both VMs).

# Update repository list and upgrade packages to latest version
sudo apt-get update && sudo apt-get -y dist-upgrade

To install zfs on linux, an extra repository needs to be added

# Add zfs on linux repository
sudo add-apt-repository ppa:zfs-native/stable

Install the software. drbd to distribute data between nodes, ntp to ensure time is synchronized, zfs to create filesystem capable of doing compression and deduplication, nfs to export the distributed storage and heartbeat to manage the two nodes in a cluster.

# Update repository list and install software
sudo apt-get update && sudo apt-get -y install \
drbd8-utils ntp ntpdate ubuntu-zfs nfs-kernel-server heartbeat

Reboot to ensure that new versions are being used.

# Reboot to use new kernel and software
sudo reboot

After the VMs have restarted, we start by configuring the drbd device, this is done by creating a resource description file in /etc/drbd.d

sudo nano /etc/drbd.d/myredundantstorage.res

The file should look something like this:

resource myredundantstorage {
 protocol C;
 startup { wfc-timeout 5; degr-wfc-timeout 15; }

 disk { on-io-error detach; }

 syncer { rate 10M; }

 on node1.mydomain.local {
 device /dev/drbd0;
 disk /dev/vdb;
 meta-disk internal;
 address 192.168.0.41:7788;
 }

 on node2.mydomain.local {
 device /dev/drbd0;
 disk /dev/vdb;
 meta-disk internal;
 address 192.168.0.42:7788;
 }
}

Make sure that the VMs host names match the host names specified in above configuration file. The names in the file should be the same as the name provided by:

uname -n

Now that the resource file is created on both nodes, describing how the resource should be handled, we can create the actual resource.

# Create and enable storage on both nodes
sudo drbdadm create-md myredundantstorage
sudo drbdadm up myredundantstorage

You should now be able to see that both nodes are up although they are both secondary.

# Show status of our drbd device
cat /proc/drbd

On the node that you want to be primary, you can run either of these two commands:

# Create primary node without syncing devices (should only
# be done if you are sure that disks does not contain any
# data
sudo drbdadm -- \
--clear-bitmap new-current-uuid myredundantstorage

or:

# Create primary node by syncing the contents of local disk
# to the remote disk (can be very time-consuming)
sudo drbdadm -- \
--overwrite-data-of-peer primary myredundantstorage

Depending on which command is used, you should see the device being ready or syncing, but you should see that now one node is primary and the other node is secondary. (cat /proc/drbd) Now we have a device on which we can create a filesystem (/dev/drbd0). To enable us to choose compression and deduplication we will use zfs. The filesystem will only be created on the primary node.

# Create zfs filesystem on the redundant device
sudo zpool create myredundantfilesystem /dev/drbd0

The zfs documentation says that after the zpool is created, a zfs should be created inside that pool. I have opted not to do that for 3 reasons. The zpool functions as a filesystem, I do not neet more that one filesystem in my pool and if a zfs is created inside the zpool, zfs will try to mount it during boot. If you want compression enabled on the filesystem:

sudo zfs set compression=on myredundantfilesystem

If you want deduplication enabled on the filesystem:

sudo zfs set dedup=on myredundantfilesystem

Now we need to ensure that the mountpoint for the zfs filesystem exists on both nodes:

# Create directory for mounting zfs if it does not exist
if [ ! -d /myredundantfilesystem ]; then
 sudo mkdir /myredundantfilesystem
fi

On the primary node the directory should already exist and the zfs filesystem should be mounted. Now we can edit /etc/exports on both nodes to ensure that it can export the zfs filesystem to clients. The export should look something like this:

/myredundantfilesystem 192.168.0.0/24(rw,async,no_root_squash,no_subtree_check,fsid=1)

Most options are “normal” for nfs exports, but the fsid is there to tell the client that it is the same filesystem on both nodes. Finally we need clustering to maintain a failover relationship between the nodes. We have already installed heartbeat, now we need to configure it. (it should be done on both nodes) Before we start configuring heartbeat, we need to disable the nfs server during boot.

sudo update-rc.d -f nfs-kernel-server remove

First we will create the /etc/ha.d/ha.cf file, it should look something like this.

autojoin none
auto_failback off
keepalive 1
warntime 3
deadtime 5
initdead 20
bcast eth0
node openvs1.dansbo.local
node openvs2.dansbo.local
logfile /var/log/heartbeat-log
debugfile /var/log/heartbeat-debug

The parameter deadtime tells Heartbeat to declare the other node dead after this many seconds. Heartbeat will send a heartbeat every keepalive number of seconds. Next we will protect the heartbeat configuration by editing /etc/ha.d/authkeys, it should look something like this:

auth 3
3 md5 my_secret_key

Set permissions on the file

sudo chmod 600 /etc/ha.d/authkeys

Next we need to tell heartbeat about the resources we want it to manage. It is done in the file /etc/ha.d/haresources and it should look something like this:

node1.mydomain.local \
IPaddr::192.168.0.45/24/eth0 \
drbddisk::myredundantstorage \
zfsmount \
nfs-kernel-server

In this example node1 will be primary, but if it fails, the other node will take over. We need a “floating” IP address as our clients needs a fixed IP to connect to. Heartbeat needs to ensure that the drbd device is active on the primary node. zfsmount is a script I have created in /etc/init.d, more about in a short while Lastly, heartbeat should start the nfs server. In haresources there is a reference to the zfsmount script. This is a script I have created my self to ensure that both the drbd device and the zfs filesystem gets activated on the secondary node in the event of primary node failure. The script should be located in /etc/init.d and accept at least the start and stop parameters. My script looks like this:

#!/bin/bash
EXITCODE=0
case "$1" in
 start)
  # Try to make this node primary for the drbd
  drbdadm primary myredundantstorage
  # Ensure that zfs knows about our filesystem
  zpool import myredundantfilesystem -f
  # Try to mount the zfs filesystem
  zfs mount myredundantfilesystem
  EXITCODE=$?
 ;;
 stop)
  # If the filesystem is mounted, it should be unmounted
  df -h | grep myredundantfilesystem > /dev/null
  if [ $? -eq 0 ]; then
   zfs unmount myredundantfilesystem
   EXITCODE=$?
  else
   EXITCODE=0
  fi
 ;;
esac
exit $EXITCODE

Some trial and error as well as a lot of looking through /var/log/heartbeat-debug log file helped me create this script. The script must exit with success if it is called with stop even though it has already been stopped.

Initially I thought that it would be sufficient to make the node primary (drbdadm primary myredundantstorage) during a failover, but it seems that the zfs driver does not necessarily recognize that there is a zfs filesystem just because the device becomes available. That is why the import is done. The import will fail if the filesystem is already recognized, but that does not matter in this script, it will still try to mount the filesystem.

I hope this will be helpful for other people as well. It took me a long time to find anything dealing with drbd and zfs at the same time. If you have any questions, don’t hesitate to contact me. ( jimmy at dansbo dot dk )

Btw. I can say that this setup performs a hell of a lot better than zfs and gluster on the same hardware.

So long, and thanks for all the fish

This is it. I have not been able to find the time or the energy to keep this project updated. The domain pmsapp.org will expire on the 2. of March this year and I do not plan on renewing it. I am sorry to see the project die, but lack of interest and a strained schedule on my part forces me to stop this.

If you are interested in this project or need assistance in setting up your own storage appliance, you can reach me at pmsapp at dansbo.dk.

Update in lack of news

Just wanted to let everyone know that this project is still alive. I am continuing work on creating a web interface to ease deployment of a pmsApp storage cluster. Unfortunately it will probably be some time before version 0.3 with the new web interface is released.

In the meantime I am still hoping for a nice logo for the pmsApp project, with a bit of luck something will have been submitted before the release of the next version.

I would like to thank manyrootsofallevil for his great work in performance testing the pmsApp and at the same time ask for input on how to increase performance. If you have any suggestions or tips on how to increase performance on software RAID created from iSCSI targets, please let us know.

Performance testing the pmsApp with filebench – RAID 5

I finally managed to get the 3 node pmsApp running, although there are some odd issues relating to failover that I will need to investigate further.

I altered the methodology slightly this time, I decided to use a single guest and move it around to the various storage devices.

In addition to the pmsApp, I tested on our SAN array (M2000 storage works configured as RAID10) and direct to an ESX host disk. The direct to host results are better than last time, particularly the fileserver workload. The result for the 2 node pmsApp (RAID1) are displayed for convenience although they are the same as from last’s post.

varmail webserver fileserver randomwrite (100 MB file) randomread (100 MB file) webproxy
SAN 25.8 MB/s 25.45 MB/s 25.85 MB/s 76.1 MB/s 90.0 MB/s 15.3 MB/s
Direct 6.15 MB/s 3.9 MB/s 9.1 MB/s 48.12 MB/s 88.9 MB/s 3.2 MB/s
PMSARAID1 1.24 MB/s 6.83 MB/s 4.20 MB/s 13.63 MB/s 89.1 MB/s 1.5 MB/s
PMSARAID5 1.56 MB/S 12.86 MB/s 4.6 MB/s 15.2 MB/s 89.3 MB/s 4.41 MB/s

As might be expected the SAN array is miles faster than anything else, so I won’t mention anything else about it, I simply wanted to provide a comparison point.

Direct to host tests show a significant speed up on the fileserver workload over the results on last’s post. I don’t really have an explanation for this. The results were consistent, standard deviation of only 0.7 MB/s. The other workloads are faster but not as much. This is a better control test, though, as it was exactly the same guest.

There is still the mystery of the webserver workload, which has now been amplified by the 3 node pmsApp, which triples the performance of the direct to host configuration.

The good news is that the 3 node pmsApp is faster than the 2 node. This shouldn’t be all that surprising as now it is only the parity data that needs to be written across the network rather than all data, so the results show a nice improvement throughout. Consistency has also improved, with standard deviation around 5%, except for the webserver workload.

I would love to compare the performance of the pmsApp to the actual VMware app, but I don’t have the hardware to do it. Not sure, whether the app will work with single disks.

Performance testing the pmsApp with filebench – RAID1

Following on from my previous post, I’ve decided to use filebench  (Version 1.4.9.1)
for a more structured testing approach.  Filebench is easy to work with, as it has already defined several application types (workload) profiles and is easy to install. The workloads can be defined easily (using WML scripting) to mimic any sort of application load on the IO system. I cannot comment on how accurate these workloads are, but they make testing easy :).

The method I’ve used was very simple. Three runs of 60 seconds each, with the default settings for the workload profiles, except for the randomwrite and randomread workloads where I had to reduce the file size to 100 MB. I disabled randomization ( echo 0 > /proc/sys/kernel/randomize_va_space ) as recommended by filebench. Filebench allocated 170 MB of shared memory on all the runs (this appears to be the default).

The results are the average results of the three runs, except for the randomread workload where I only did two runs as the results were very similar. I did not repeat any tests as there was a lot less variation in the results. The results presented below are the IO summary result, rather than the individual operations results.

So without further ado here are the results:

Workload varmail webserver fileserver randomwrite randomread webproxy
Server
PMSTest (RAID1) 1.24 MB/s 6.83 MB/s 4.20 MB/s 13.63 MB/s 89.1 MB/s 1.50 MB/s
PMSControl 5.02 MB/s 3.73 MB/s 4.93 MB/s 39.83 MB/s 88.9 MB/s 2.33 MB/s

The webserver results are somewhat mystifying. The workload consists of opening, reading and closing files, so the results should be fairly similar but for some reason the pmsApp is faster!

The rest of the results are as expected:  Writes are significantly slower due to nature of the pmsApp (namely a RAID array across a network link) and reads are comparable.

It is also worth noting that regardless of the results, the test prep phase took longer, sometimes a lot longer, when running off the pmsApp, i.e on PMSTest.

Ver 0.2 – Finally ready for download

I finally managed to create an OVA that can be deployed on both ESXi4, VirtualBox and VMware Workstation.

I had to create and OVF with VirtualBox, edit it to change the hardware type from virtualbox-2.2 to vmx-7 and then use VMware ovftool to convert the ovf to OVA.

When deployed on ESXi4, it will complain about the guest os not being recognized, but you can still deploy it and later change the guest os setting.

Happy downloading.