Last week we had a failure on a server were two disks in a md raidset reported bad sectors at the same time. This caused the server to lock up and hang. Normally a quick reboot would have solved the problem. In this case the reboot got stuck when mounting the partitions, with the message “Recovering journal”. The server hung there for a long time and nothing happened. This was caused by a corrupt journal. The journal on the ext3 filesystem was probably affected by the bad blocks on the disks.

It proved to be quite difficult to recover from this error. The following steps were needed to get the server to boot normally again:

  1. Remove the needs_filesystemcheck flag if it is enabled on the partition. Otherwise the journal can not be removed: debugfs -w -R “feature ^needs_recovery” /dev/VolGroupXX/LogVolXX
  2. Remove the journal from the partition: tune2fs -f -O ^has_journal /dev/VolGroupXX/LogVolXX
  3. Check the filesystem: fsck -y /dev/VolGroupXX/LogVolXX
  4. Enable journalling again: tune2fs -j /dev/VolGroupXX/LogVolXX
  5. Reboot

Needless to say that when this happens the disks are ready for the bin. Get rid of them as soon as possible! 🙂

Update 2013-02-28: I’ve updated the driver to include the datastore id in the path, so it is now possible to use the driver with multiple datastores. Also the driver now correctly downloads drivers that are imported from URLs, eg. via the marketplace.

Based on this blog article I created an updated datastore driver that allows you to use a ZFS backend with OpenNebula. This datastore driver implements snapshot functionalitiy to clone images. I have created this driver to be able to start a VM with a persistent disk without having to wait untill the full image file is copied in the datastore.

The driver implements updated versions of the cp, the clone and the rm commands. I don’t use the mkfs command, so I have not implemented this with the ZFS datastore driver yet.

Due to the nature of the OpenNebula datastore layout and the ZFS snapshot capabilities, I had to make a workaround. For the sake of simplicity I dediced to use the filesystem driver as a basis. This means that the images are files in a datastore directory. ZFS snapshots are based on a directory level. To make snapshotting possible I created a link in the datastore location to the image file in a ZFS backed NFS directory.

Please find a short outline based on the blog article, and my own additions. The ZFS datastore driver is available as a tar donwload:zfs-datastore-v1.1

Setup the ZFS part
Install openindiana, create a zfs pool, create all the necessary ZFS volumes and share this as an NFS share with the frontend server.

# zfs create tank/export/home/cloud
# zfs set mountpoint=/srv/cloud tank/export/home/cloud
# zfs create tank/export/home/cloud/images
# chown -R oneadmin:cloud /srv/cloud
# zfs set sharenfs='rw=@93.188.251.125/32' tank/export/home/cloud
# zfs set sharenfs='root=@93.188.251.125/32' tank/export/home/cloud
# zfs allow oneadmin destroy,clone,create,mount,share,sharenfs tank/export/home/cloud

Setup ZFS volume per datastore
For each datastore that you want to host on the ZFS server you have to create a volume and allow oneadmin to manage it. Replace datastoreid with the id of the datastore you will create (eg. 100). Make sure to chown the directory to allow oneadmin to access it.

# zfs create tank/export/home/cloud/datastoreid
# zfs allow oneadmin destroy,clone,create,mount,share,sharenfs tank/export/home/cloud/datastoreid
# chown oneadmin:other /srv/cloud/datastoreid

Install ZFS datastore driver

Unpack the tar file with the driver in the datastore remote directory (/var/lib/one/remotes/datastore/)

$ tar xvf zfs-datastore.tar

Configure the ZFS datastore driver with the correct parameters and make sure passwordless connectivity is possible between the ZFS host and the frontend.

zfs.conf:

ZFS_HOST=10.10.10.3
ZFS_POOL=tank
ZFS_BASE_PATH=/export/home/cloud ## this is the path that maps to /srv/cloud
ZFS_LOCAL_PATH=/srv/cloud ## relative one ZFS_BASE_PATH
ZFS_CMD=/usr/sbin/zfs
ZFS_SNAPSHOT_NAME=golden

Configure OpenNebula to use ZFS datastore driver

Update the datastore configuration in oned.conf with zfs driver. See example below:

DATASTORE_MAD = [
executable = "one_datastore",
arguments = "-t 15 -d fs,vmware,iscsi,zfs"
]

Create new ZFS datastore

When the ZFS stuff is all done, make sure this NFS share is mounted on the frontend. In my example this is mounted on /srv/cloud. Now you can create a datastore with this new driver.

zfstest.conf:

NAME = zfstest
DS_MAD = zfs
TM_MAD = ssh


$ onedatastore create zfstest.conf

Create new image in ZFS datastore

Create conf file to create new image:

# cat /tmp/centos63-5gb.conf
NAME = "Centos63-5GB-zfs"
TYPE = OS
PATH = /home/user/centos63-5gb.img
DESCRIPTION = "CentOS 6.3 5GB image contextualized"

Run oneimage command on the right datastore:

[oneadmin@cloudcontroller1 ~]$ oneimage create -d 100 /tmp/centos63-5gb.conf
ID: 111

The image path to be used to create a snapshot can be found by checking the image details of the newly created image (id 111 in our example):

$ oneimage show 111
[…]
SOURCE : /var/lib/one/datastores/100/4ce405866cf95a4d77b3a9dd9c54fa73
[…]

To use ths image as a golden image, create a snapshot on the ZFS server, so this snapshot can be the basis for the future clones. Instant cloning relies on the relevant ZFS capabilities, which allows creating a new ZFS dataset (clone) based on an existing snapshot. This means the snapshot has to be created first. So after you upload the golden image, manually create a snapshot of this image. This only needs to be done once as this snapshot can be used as many times as possible. This command needs to be run on the ZFS server, not the frontend!


# zfs snapshot tank/export/home/cloud/100/4ce405866cf95a4d77b3a9dd9c54fa73@golden

Now this image can be used for instand cloning.

Example ZFS datastore content on the frontend

[oneadmin@cloudcontroller1 ~]$ cd /var/lib/one/datastores/100/
[oneadmin@cloudcontroller1 100]$ ls -al
total 28
drwxr-xr-x 2 oneadmin oneadmin 4096 Nov 7 02:43 .
drwxr-xr-x 6 oneadmin oneadmin 4096 Oct 15 17:04 ..
lrwxrwxrwx 1 oneadmin oneadmin 76 Oct 16 03:21 25473f081ba733822f3e9ba1df347753 -> /srv/cloud/25473f081ba733822f3e9ba1df347753/25473f081ba733822f3e9ba1df347753
lrwxrwxrwx 1 oneadmin oneadmin 76 Oct 16 02:53 2bf829fedb6e1728f204be8a19ff8f8c -> /srv/cloud/2bf829fedb6e1728f204be8a19ff8f8c/2bf829fedb6e1728f204be8a19ff8f8c
lrwxrwxrwx 1 oneadmin oneadmin 76 Oct 19 17:41 4ce405866cf95a4d77b3a9dd9c54fa73 -> /srv/cloud/4ce405866cf95a4d77b3a9dd9c54fa73/4ce405866cf95a4d77b3a9dd9c54fa73
lrwxrwxrwx 1 oneadmin oneadmin 76 Oct 16 03:24 a00d08dd9b7447818e110115cbc33056 -> /srv/cloud/a00d08dd9b7447818e110115cbc33056/25473f081ba733822f3e9ba1df347753
lrwxrwxrwx 1 oneadmin oneadmin 76 Oct 16 03:18 cab665db977255c4c76c7aa3d687a6d6 -> /srv/cloud/cab665db977255c4c76c7aa3d687a6d6/2bf829fedb6e1728f204be8a19ff8f8c

Example output of ZFS volumes

root@openindiana:/home/rogierm# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 5.68G 361G 45.5K /rpool
rpool/ROOT 1.56G 361G 31K legacy
rpool/ROOT/openindiana 1.56G 361G 1.55G /
rpool/dump 2.00G 361G 2.00G -
rpool/export 133K 361G 32K /export
rpool/export/home 101K 361G 33K /export/home
rpool/export/home/oneadmin 34K 361G 34K /export/home/oneadmin
rpool/export/home/rogierm 34K 361G 34K /export/home/rogierm
rpool/swap 2.12G 362G 133M -
tank 35.3G 1.04T 32K /tank
tank/export 35.3G 1.04T 32K /tank/export
tank/export/home 35.3G 1.04T 31K /tank/export/home
tank/export/home/cloud 35.3G 1.04T 15.3G /srv/cloud
tank/export/home/cloud/1872dba973eb2f13ef745fc8619d7c30 1K 1.04T 5.00G /srv/cloud/1872dba973eb2f13ef745fc8619d7c30
tank/export/home/cloud/25473f081ba733822f3e9ba1df347753 5.00G 1.04T 5.00G /srv/cloud/25473f081ba733822f3e9ba1df347753
tank/export/home/cloud/28e04ccbc4e55779964330a2131db466 1K 1.04T 5.00G /srv/cloud/28e04ccbc4e55779964330a2131db466
tank/export/home/cloud/2bf829fedb6e1728f204be8a19ff8f8c 40.0M 1.04T 40.0M /srv/cloud/2bf829fedb6e1728f204be8a19ff8f8c
tank/export/home/cloud/4ce405866cf95a4d77b3a9dd9c54fa73 5.00G 1.04T 5.00G /srv/cloud/4ce405866cf95a4d77b3a9dd9c54fa73
tank/export/home/cloud/8b86712ae314bc80eef1dfc303740a87 1K 1.04T 5.00G /srv/cloud/8b86712ae314bc80eef1dfc303740a87
tank/export/home/cloud/931492cd32cb96aad3b8dce4869412f3 1K 1.04T 5.00G /srv/cloud/931492cd32cb96aad3b8dce4869412f3
tank/export/home/cloud/a00d08dd9b7447818e110115cbc33056 5.00G 1.04T 5.00G /srv/cloud/a00d08dd9b7447818e110115cbc33056
tank/export/home/cloud/cab665db977255c4c76c7aa3d687a6d6 1K 1.04T 40.0M /srv/cloud/cab665db977255c4c76c7aa3d687a6d6
tank/export/home/cloud/images 5.00G 1.04T 32K /srv/cloud/images
tank/export/home/cloud/images/centos6 5.00G 1.04T 5.00G /srv/cloud/images/centos6
tank/export/home/cloud/one 63K 1.04T 32K /srv/cloud/one
tank/export/home/cloud/one/var 31K 1.04T 31K /srv/cloud/one/var

Several blogs and manuals with examples on kvm or xen setups use NFS as storage backend. Mostly they state that for production use iSCSI is recommended. However there are examples where NFS is part of the architecture, eg. OpenNebula. I tried to find specific statistics on the performance differences between NFS, iSCSI and local storage. During this search I encountered some pointers that NFS and Xen is not a good combination, but never a straight comparison.

I decided to invest some time and setup a small test environment and run some bonnie++ statistics. This is not a scientific designed experiment, but a test to show the differences between the platforms. Two test platforms are setup, 1 with a Xen server (DL360G6) (xen1) and a 12 disk SATA storage server (storage1), and another with a KVM server (DL360G5) (kvm1) and a 2 disk SATA storage server (storage2) . Both servers are connected with a gigabit network. I’ve also run a test with a 100mb/s network between the kvm1 and storage2 server. For reference I’ve also done tests with the images on localdisk.

I realize that LVM and iSCSI storage is most efficient, but storage with image files is very convenient and in case of cloud setups sometimes the only option.

Seq output Seq input Random
Per Chr Block Rewrite Per Chr Block Seeks
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
Xen-guest-via-nfs-tapaio 1G 3570 5 2436 0 1366 0 26474 41 24831 0 6719.0 1
xen-guest-via-iscsi 1G 25242 40 12071 1 15175 0 32071 42 47742 0 7331.3 1
kvm-guest-nfs-1gb-net 1G 8140 16 17308 3 11864 2 40861 81 71711 3 2126.6 54
kvm-guest-nfs-qcow-100mb 1G 1922 3 9874 1 3994 0 10720 22 10441 0 595.4 33
kvm-guest-nfs-qcow-100mb-2nd 1G 9735 21 2039 0 3197 0 10729 22 10463 0 685.3 38
kvm-guest-nfs-qcow-100mb-3rd 1G 5327 10 7378 1 4421 0 10655 18 10512 0 706.3 39
xenserver-nfsmount 1G 41507 60 60921 7 29687 1 33427 48 64147 0 4674.4 11
kvmserver-nfs-1G 20G 31158 52 32044 17 10749 2 19152 28 18987 1 90.3 1
localdisk-on-nfs-server-cloudtest3 4G 41926 65 43805 7 18928 3 52943 72 56616 3 222.6 0

The  conclusion of the tests is that local storage is fastest. NFS storage with Xen is not a good combination. Xen runs best with iSCSI backed storage. KVM with NFS runs significantly better. It is safe to say that if you want to use NFS use it with KVM, not with Xen. In any case iSCSI is always the best option for Xen. I have not yet tested KVM with iSCSI but I expect this to perform better than NFS.

A SAN is often implemented as a dedicated network that is considered to be a secure network. However, the nature of a SAN is that it is a shared network. This involves some serious security risks, that should be evaluated when using an iSCSI based SAN. Some vendors consider an iSCSI network save when it is implemented as a dedicated switches network (Dell EqualLogic. Securing storage area networks with iSCSI. EqualLogic Inc., 2008.). They consider it virtually impossible to snoop or inject packets in a switched network. We all know this is not the case. If this is true, why do we use firewalls, ids and tons of other security measures? Even if iSCSI runs on an isolated network, and only the management interface of the storage devices are connected to a shared/general-purpose network, security is just as good as the hosts that are connected to the dedicated network. A single compromised host connected to the dedicated iSCSI network can attack the storage devices to get access to LUNs for other hosts.

When implementing an iSCSI network you should be aware of the security risks that this imposes on the environment. To estimate the risk, awareness of the methods that can be used to secure iSCSI is paramount. The iSCSI protocol allows for the following security measures to prevent unintended or unauthorized access to storage resources:

  • Authorization
  • Authentication
  • Encryption

Because iSCSI setups are generally shared environments access to the storage elements (LUNs) by unauthorized initiators should be blocked. Authorization is implemented by means of the iQN. The iQN is the initiator node name (iSCSI Qualified Name), this can be seen as a mac-address. During an audit, storage systems must demonstrate controls to ensure that a server under one regime cannot access the storage assets of a server under another.
Typically, iSCSI storage arrays explicitly map initiators to specific target LUNs; an initiator authenticates not to the storage array, but to the specific storage asset it intends to use.

As an added security method, the iSCSI protocol allows initiators and targets to use CHAP to authenticate each other. This prevents simple access by spoofing the iQN. And last, because iSCSI runs on IP, IPSec can be used to secure and encrypt the data flowing between the client (initiator) and the storage server (target).

Now that we know there are multiple ways to secure access to the storage resouces, you might conclude that iSCSI must be safe and secure to use. Unfortunately this is not evident. There are several flaws in the iSCSI security design:

  • iQN’s are trusted, but are easy to spoof, sniff and guessed
  • iSCSI authorization is the only required security method, and this uses only the iQN
  • Authentication is disabled by default
  • Authentication is (mostly) only implemented as CHAP
  • IPSec is difficult to implement

Because iQN’s are manually configured in the iSCSI driver on the client, it is easy to change them. To get access to a LUN that is only protected by a iQN restriction, you can sniff the communication to get the iQN, or guess the iQN as it is often a default string (eg.: iqn.1991-05.com.microsoft.hostname), configure the iscsi driver to use this name and get access to the LUN.

The CHAP protocol is basically the only authentication mechanism that is supported by iSCSI vendors. The protocol allows for other mechanisms like Kerberos. The CHAP protocol is not a protocol know for its strong security on shared networks. The CHAP protocol is vulnerable to dictionary attacks, spoofing, or reflection attacks. Because the security issues with CHAP are well known, the RFC even mentions ways to deal with the limitations of CHAP (http://tools.ietf.org/html/rfc3720#section-8.2.1).

While IPSec could stop or reduce most of the security issues outlined above, it is hard to implement and manage. Therefor not many administrators will feel the need to use it. It should not only be possible to make a secure network, it should also be made easy.

To reduce the risk, and make your iSCSI network as safe as possible, you should do the following:

  • Enable mutual (incoming/outgoing) authentication
  • Follow advice to secure CHAP
  • Enable CRC checksums
  • Do not only rely on iQN for authorization
  • Enable IPSec (if performance allows it)

Also vendors/distributors should enable authentication by default, and add other authentication mechanisms to the iSCSI target and initiator software.

References:
http://www.blackhat.com/presentations/bh-usa-05/bh-us-05-Dwivedi-update.pdf
http://en.wikipedia.org/wiki/ISCSI#Authentication
http://weird-hobbes.nl/reports/iSCSI%20security/