Home

ZFS High-Availability NAS

The ZFS filesystem has been a game-changer in the way I approach local data storage, shared storage, replication and general data backup and protection.

I've been a long-time proponent of ZFS storage in a variety of scenarios, going back to my first experiments with OpenSolaris in 2008, buying my own ZFS Thumper/Thor in 2009, adopting ZFS on Linux for production use in 2012, and through my continued contributions to StackExchange|ServerFault.

ZFS Advantages:

More intelligent storage for application servers and a serious replacement for LVM.
Great shared storage options to back virtualization environments.
Useful for expandable backup targets.
Atomic snapshots.
Flexible replication.
Transparent filesystem compression.

ZFS Downsides:

Good ZFS implementations sometimes require specialized knowledge.
It's easy to make bad or irreversible mistakes. (often due to poor design or inappropriate hardware choices).
There's lots of contradictory ZFS information online due to ZFS's appeal to "home lab" users.
Options for high availability are either costly, tightly associated with commercial storage solutions or do not work well.

Expanding on the final point, the traditional method of achieving high availability on a ZFS storage array means absorbing the high cost and complexity associated with commercial products. The most common options in the marketplace are NexentaStor, QuantaStor, RSF-1 and Zetavault.

Each of these products are proven to work and have existing user bases, but have particularly unattractive licensing schemes. Notably, NexentaStor and QuantaStor have capacity-based licensing models that don't work well for my use cases. The high-availability licensing add-ons to both solutions are expensive, but still place the onus of the hardware build, validation, deployment and ongoing support on the end-user.

As a consultant, I work with many small businesses who have simple VMware estates (2-6 hosts), but need robust storage to back the solution. They typically have less than 6 Terabytes of data and the cost of lower tier SAN storage is way out of line with their technology budgets.

A fully-licensed build of a 16TB RAW commercial ZFS array built atop commodity hardware approaches $25-30k. That's very close to the price of an integrated storage array like Nimble or Tegile.

Objectives

Build a highly-available dual-controller storage array using open-source technologies.
This solution should be capable of presenting shared storage to any NFS client.
This should be the ZFS equivalent of an dual-node array like the HP StorageWorks MSA2040.

The key to this high-availability storage design is a shared SAS-attached storage enclosure, or JBOD. Examples of this include:

HP StorageWorks D2600/D2700/D3600/D3700 2U enclosures.
HP StorageWorks MDS600 and D6000-series high-density enclosures
Sun Microsystems/Oracle J4410 JBOD.
Dell PowerVault MD1200.
DataON DNS-1660.

These external SAS enclosures all feature:

Redundant power supplies and fans.
SAS expander logic on the backplane.
Redundant SAS controllers (I/O modules) with end-to-end multipathing to accommodate dual-ported SAS disks. Dual-ported SAS drives are critical to the success of this design!
SCSI enclosure services (SES) sensors for communication of enclosure component status.

To complement the shared JBOD enclosure, we need two servers (head nodes or controllers) to provide client connectivity and compute/RAM resources for the ZFS array.

It make sense to scale the specifications of the head nodes to meet the anticipated workload of the installation. Variables may include:

PCIe slot connectivity - For NICs, SAS host bus adapters and future expansion.
RAM - ZFS leverages system RAM for read and write caching. Maximizing RAM amount (within reason) can help accelerate I/O workloads.
Speed and number of CPUs - This impacts data compression performance.
Cost - .

My recommendation is to use an Intel Nehalem, Westmere or newer CPU with four or more cores.

Example budget parts manifest

Low-cost build manifest for a simple 12TB usable storage array:

2 x HP ProLiant DL360 G7 rackmount servers ($500-$800 total)

2 x HP H221 SAS HBA ($240 total)

4 x SAS SFF-8088 external SAS cables ($100 total)

1 x HP StorageWorks D2600 12-bay SAS enclosure ($400-$600 total)

10 x HP 2TB dual-port SAS hard drives ($700 total)

1 x SAS SSD for L2ARC ($200)

1 x Stec ZeusRAM SAS DRAM SSD for ZIL ($600)

HP StorageWorks D2600 fully disassembled.

Front view of HP ProLiant DL360 G7 head nodes and D2600 JBOD.

Rear view of servers and JBOD with 6G SAS cabling.

Example high-end parts manifest

Build manifest for a 24TB usable storage array using newer component:

2 x HP ProLiant DL360 Gen8/Gen9 rackmount servers

2 x HP H241 SAS HBA

4 x SAS SFF-8644 external cables

1 x HP StorageWorks D3600 12-bay SAS enclosure

10 x HP 4TB dual-port SAS hard drives

1 x SSD for L2ARC

1 x SSD for ZIL

Front view of HP ProLiant DL360p Gen8 head nodes and D3600 JBOD.

Rear view of servers and JBOD with 12G SAS cabling.

Physical setup

The purpose of the shared JBOD in this setup is to have a multipath configuration that provides cabling, controller, host adapter and path redundancy.

The most basic recommended setup between two head nodes and a single JBOD enclosure is a single 2-port SAS HBA in each host and four SAS cables with the following arrangement.

node1 HBA port 1 -> JBOD controller1 port 2
node1 HBA port 2 -> JBOD controller2 port 2

node2 HBA port 1 -> JBOD controller1 port 1
node2 HBA port 2 -> JBOD controller2 port 1

Examples of how to scale with multiple enclosures and SAS cabling rings. Oracle SAS cabling diagram

Cluster configuration

Assumptions and requirements:

Root access.
Two server systems with RHEL or CentOS 7 installed on local disks.
The ZFS filesystem configured via the ZFS on Linux repository.
Familiarity with basic ZFS operations: zpool/zfs filesystem creation and modification.
Understanding of network bonding under RHEL Linux.

These steps describe the construction of a two host, single JBOD cluster that manages a single ZFS pool comprised of 10 data drives (9 pool+hot spare) and separate SLOG (ZIL) and L2ARC drives.

Corosync/Pacemaker setup:
Additional information about the RHEL High Availability Add-On.

For simplification, disable firewalld for the build process. This can definitely be modified later.

systemctl stop firewalld
systemctl disable firewalld

Install the ZFS filesystem by downloading the ZFS and kernel-devel packages.

yum localinstall http://archive.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm
yum install kernel-devel zfs

Install the RHEL cluster suite and multipath software.

yum install pcs fence-agents-all device-mapper-multipath

The multipath daemon will not start without a configuration file present.

touch /etc/multipath.conf
systemctl start multipathd
systemctl enable multipathd

Download the ZFS zpool Pacemaker OCF agent. This allows the ZFS pool to be exported and imported to each cluster node during failover.

cd /usr/lib/ocf/resource.d/heartbeat/
wget https://github.com/skiselkov/stmf-ha/raw/master/heartbeat/ZFS
chmod +x ZFS

Networking depends on how clients will consume data from this NAS.

For most of my builds, I use LACP bonding from the storage array to the switches and maintain an IP on the data network for system management. The most basic requirement is to have an IP for each host, plus a virtual IP (VIP) that can float to the active node. NFS clients will use this virtual IP.

I prefer to segregate traffic by VLAN, so I create a single master LACP bond interface with separate VLAN interfaces, e.g. bond0.10,bond0.777,bond0.91.

Example /etc/sysconfig/network-scripts/ interface script files here.

NetworkManager isn't necessary for this since the bond interfaces will likely require hand-editing.

systemctl stop NetworkManager
systemctl disable NetworkManager

Populate /etc/hosts files with the cluster members' hostnames and add a second heartbeat address ring for Corosync.

# Management addresses of both nodes
172.16.40.15	zfs-node1.ewwhite.net zfs-node1
172.16.40.16	zfs-node2.ewwhite.net zfs-node2

# Cluster ring address for heartbeat
192.168.91.1	zfs-node1-ext
192.168.91.2	zfs-node2-ext

Ensure each of the addresses can be reached from either host.

Determine a drive layout:

Due to the use of dm-multipath, we want to build the ZFS pool using the device mapper disk identifiers rather than the normal /dev entries.

There are a few ways to determine the drive layout and identify disks. This will depend on the HBAs and JBOD enclosure in use. It makes sense to record the SAS WWN of each of the disks you'll be using.

In some cases, it is possible to enumerate drives and identifying WWNs programmatically or by examining sysfs entries. For example, with the HP H221 SAS HBA:

cd /sys/class/enclosure

[root@zfs1-1 /sys/class/enclosure]# ls
0:0:11:0  0:0:23:0

[root@zfs1-1 /sys/class/enclosure/0:0:11:0]# ll
total 0
drwxr-xr-x 3 root root    0 Jul  6 11:57 0
drwxr-xr-x 3 root root    0 Jul  6 11:57 1
drwxr-xr-x 3 root root    0 Jul  6 11:57 10
drwxr-xr-x 3 root root    0 Jul  6 11:57 11
drwxr-xr-x 3 root root    0 Jul  6 11:57 2
drwxr-xr-x 3 root root    0 Jul  6 11:57 3
drwxr-xr-x 3 root root    0 Jul  6 11:57 4
drwxr-xr-x 3 root root    0 Jul  6 11:57 5
drwxr-xr-x 3 root root    0 Jul  6 11:57 6
drwxr-xr-x 3 root root    0 Jul  6 11:57 7
drwxr-xr-x 3 root root    0 Jul  6 11:57 8
drwxr-xr-x 3 root root    0 Jul  6 11:57 9
-r--r--r-- 1 root root 4096 Jul  6 11:57 components
lrwxrwxrwx 1 root root    0 Jul  6 11:57 device -> ../../../0:0:11:0
drwxr-xr-x 2 root root    0 Jul  6 11:57 power
lrwxrwxrwx 1 root root    0 Jul  6 11:57 subsystem -> ../../../../../../../../../../../../../class/enclosure
-rw-r--r-- 1 root root 4096 Jul  4 23:27 uevent

The 0 through 11 above represent the 12 drive slots in the HP StorageWorks D2600 I'm using in this build.

[root@zfs1-1 /sys/class/enclosure/0:0:11:0/0]# ll
total 0
-rw-r--r-- 1 root root 4096 Jul  6 11:59 active
lrwxrwxrwx 1 root root    0 Jul  6 11:59 device -> ../../../../../../../port-0:0:0/end_device-0:0:0/target0:0:0/0:0:0:0
-rw-r--r-- 1 root root 4096 Jul  6 11:59 fault
-rw-r--r-- 1 root root 4096 Jul  6 11:59 locate
drwxr-xr-x 2 root root    0 Jul  6 11:59 power
-rw-r--r-- 1 root root 4096 Jul  6 11:59 status
-r--r--r-- 1 root root 4096 Jul  6 11:59 type
-rw-r--r-- 1 root root 4096 Jul  6 11:59 uevent

Note the device directory and locate options. Running echo 1 > locate will illuminate the drive beacon on the device present in that slot.

[root@zfs1-1 /sys/class/enclosure/0:0:11:0/0]# cd device

[root@zfs1-1 /sys/class/enclosure/0:0:11:0/0/device]# cat sas_address
0x5000c500236004a2

The output of multipath -ll should show the paths, selection policy, SCSI device name and /dev/mapper device name (e.g. 35000c500236061b3) for each of the drives in the enclosure.

[root@zfs1-1 ~]# multipath -ll
35000c500236061b3 dm-11 HP      ,EF0450FARMV
size=419G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:3:0  sde 8:64   active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 0:0:16:0 sdq 65:0   active ready running
35000c5007772e5ff dm-3 HP      ,EF0450FARMV
size=419G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:2:0  sdd 8:48   active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 0:0:15:0 sdp 8:240  active ready running
35000c500236032f7 dm-0 HP      ,EF0450FARMV
size=419G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:19:0 sdt 65:48  active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 0:0:6:0  sdh 8:112  active ready running

The lsscsi command can also help display topology. Note how the drives appear twice in the SAS layout.

[root@zfs1-2 ~]# lsscsi
[0:0:0:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdb
[0:0:1:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdc
[0:0:2:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdd
[0:0:3:0]    disk    HP       EF0450FARMV      HPD6  /dev/sde
[0:0:4:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdf
[0:0:5:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdg
[0:0:6:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdh
[0:0:7:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdi
[0:0:8:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdj
[0:0:9:0]    disk    HP       EF0450FARMV      HPD6  /dev/sdk
[0:0:10:0]   disk    HITACHI  HUSSL4010BSS600  A110  /dev/sdl
[0:0:11:0]   disk    STEC     ZeusRAM          C023  /dev/sdm
[0:0:12:0]   enclosu HP       D2600 SAS AJ940A 0150  -
[0:0:13:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdn
[0:0:14:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdo
[0:0:15:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdp
[0:0:16:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdq
[0:0:17:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdr
[0:0:18:0]   disk    HP       EF0450FARMV      HPD6  /dev/sds
[0:0:19:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdt
[0:0:20:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdu
[0:0:21:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdv
[0:0:22:0]   disk    HP       EF0450FARMV      HPD6  /dev/sdw
[0:0:23:0]   disk    HITACHI  HUSSL4010BSS600  A110  /dev/sdx
[0:0:24:0]   disk    STEC     ZeusRAM          C023  /dev/sdy
[0:0:25:0]   enclosu HP       D2600 SAS AJ940A 0150  -
[1:0:0:0]    disk    HP       LOGICAL VOLUME   6.64  /dev/sda
[1:3:0:0]    storage HP       P410i            6.64  -

Once there's some idea of the SAS address layout of the drives, we can create a zpool.

# Create a zpool using the /dev/mapper devices.
# The most critical cluster pool creation option is cachefile=none
zpool create vol1 -o autoexpand=on -o autoreplace=on -o cachefile=none \ 
raidz1 35000c500236004a3 35000c50023614aef 35000c5007772e5ff \ 
raidz1 35000c500236061b3 35000c5002362f347 35000c500544508b7 \ 
raidz1 35000c500236032f7 35000c50023605c1b 35000c5002362ffab \
spare 35000c500236031ab

The result:

[root@zfs1-1 ~]# zpool status -v
  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 1h14m with 0 errors on Sun Aug  7 05:00:49 2016
config:

	NAME                   STATE     READ WRITE CKSUM
	vol1                   ONLINE       0     0     0
	  raidz1-0             ONLINE       0     0     0
	    35000c500236004a3  ONLINE       0     0     0
	    35000c50023614aef  ONLINE       0     0     0
	    35000c5007772e5ff  ONLINE       0     0     0
	  raidz1-1             ONLINE       0     0     0
	    35000c500236061b3  ONLINE       0     0     0
	    35000c5002362f347  ONLINE       0     0     0
	    35000c500544508b7  ONLINE       0     0     0
	  raidz1-2             ONLINE       0     0     0
	    35000c500236032f7  ONLINE       0     0     0
	    35000c50023605c1b  ONLINE       0     0     0
	    35000c5002362ffab  ONLINE       0     0     0
	logs
	  35000a7203008de44    ONLINE       0     0     0
	cache
	  35000cca0130c0a00    ONLINE       0     0     0
	spares
	  35000c500236031ab    AVAIL

Cluster initialization:

The EL6/EL7 Corosync and Pacemaker packages install a service account named "hacluster". Assign a password to the user. This will be used later for the cluster authorization and the management GUI.

passwd hacluster

Enable and start the Pacemaker and Corosync system services.

systemctl enable pcsd
systemctl enable corosync
systemctl enable pacemaker
systemctl start pcsd

Authorize the cluster nodes. Here, I use the name "zfs-cluster", but choose something appropriate.

pcs cluster auth node1 node2
pcs cluster setup --start --name zfs-cluster zfs-node1,zfs-node1-ext zfs-node2,zfs-node2-ext

Create a STONITH device using SCSI reservations of the ZFS pool disks. List all of the devices that should be associated with this specific pool.

pcs stonith create fence-vol1 fence_scsi pcmk_monitor_action="metadata" devices="/dev/mapper/35000c500236061b3,/dev/mapper/35000c500236032f7,/dev/mapper/35000c5007772e5ff,/dev/mapper/35000c50023614aef,/dev/mapper/35000a7203008de44,/dev/mapper/35000c500236004a3,/dev/mapper/35000c5002362ffab,/dev/mapper/35000c500236031ab,/dev/mapper/35000c50023605c1b,/dev/mapper/35000c500544508b7,/dev/mapper/35000c5002362f347" meta provides=unfencing

Create a ZFS pool resource to correspond to the zpool configuration. The pool name here is "vol1".

pcs resource create vol1 ZFS params pool="vol1" importargs="-d /dev/mapper/" op start timeout="90" op stop timeout="90" --group=group-vol1

Define a virtual IP that floats between nodes. This is associated with the ZFS pool. A benefit is that this gives you the option to run dual-active clustering with mutual failover by pinning a ZFS pool to a head node.

pcs resource create vol1-ip IPaddr2 ip=192.168.77.18 cidr_netmask=24 --group group-vol1

Set a default "stickiness" value to prevent flapping between nodes after a failover event.

pcs resource defaults resource-stickiness=100

The result from the pcs status command should look like:

[root@zfs1-1 ~]# pcs status 
Last change: Tue Jul 19 19:29:16 2016 by root via crm_attribute on zfs1-1 Stack: corosync Current DC: zfs1-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 2 nodes and 3 resources configured

Online: [ zfs1-1 zfs1-2 ]

Full list of resources:

 fence-vol1	(stonith:fence_scsi):	Started zfs1-1  Resource Group: group-vol1
     vol1	(ocf::heartbeat:ZFS):	Started zfs1-1
     vol1-ip	(ocf::heartbeat:IPaddr2):	Started zfs1-1

PCSD Status:   zfs1-1: Online   zfs1-2: Online

Daemon Status:   corosync: active/enabled   pacemaker: active/enabled  pcsd: active/enabled

NFS export

The elegance of this design comes from eschewing the traditional NFS cluster resources and instead relying on the ZFS sharenfs property for exports.

The old way of doing this would mean having separate resources for:

The ZFS zpool
The zpool virtual IP
The NFS server daemon
An exportfs resource for each NFS server export
An NFS notify resource to inform clients during failover
And Pacemaker cluster constraints to maintain the ordering of the above resources.

In this design, we just use a fencing/STONITH resource, a resource for the zpool and the virtual IP.

For NFS exports, just define the sharenfs property on the filesystem(s) you wish to export.

zfs set [email protected]/24,sync,no_root_squash,no_wdelay vol1/management

This condenses the NFS daemon, exports and NFS notify steps into the zpool export/import process and speeds up failover times.

ZFS HA Cluster failover under VMware.

VMware NFS path state during controlled failover enter image description here

Tuning

(in progress)

tuned profiles.
/etc/modprobe.d/zfs.conf options.
Multiple queue I/O scheduling under EL7.2+.

Operations

(in progress)

Monitoring

CLI operations

# Get cluster status
pcs status

# Place cluster node in standby (manual failover)
pcs cluster standby [node name]

# Remove cluster node from standby state
pcs cluster unstandby [node name]

# Clean up cluster resources
pcs resource cleanup

Web GUI https://nodename:2224

Provide feedback

Saved searches

Use saved searches to filter your results more quickly