-
Notifications
You must be signed in to change notification settings - Fork 78
Home
The ZFS filesystem has been a game-changer in the way I approach local data storage, shared storage, replication and general data backup and protection.
I've been a long-time proponent of ZFS storage in a variety of scenarios, going back to my first experiments with OpenSolaris in 2008, buying my own ZFS Thumper/Thor in 2009, adopting ZFS on Linux for production use in 2012, and through my continued contributions to StackExchange|ServerFault.
ZFS Advantages:
- More intelligent storage for application servers and a serious replacement for LVM.
- Great shared storage options to back virtualization environments.
- Useful for expandable backup targets.
- Atomic snapshots.
- Flexible replication.
- Transparent filesystem compression.
ZFS Downsides:
- Good ZFS implementations sometimes require specialized knowledge.
- It's easy to make bad or irreversible mistakes. (often due to poor design or inappropriate hardware choices).
- There's lots of contradictory ZFS information online due to ZFS's appeal to "home lab" users.
- Options for high availability are either costly, tightly associated with commercial storage solutions or do not work well.
Expanding on the final point, the traditional method of achieving high availability on a ZFS storage array means absorbing the high cost and complexity associated with commercial products. The most common options in the marketplace are NexentaStor, QuantaStor, RSF-1 and Zetavault.
Each of these products are proven to work and have existing user bases, but have particularly unattractive licensing schemes. Notably, NexentaStor and QuantaStor have capacity-based licensing models that don't work well for my use cases. The high-availability licensing add-ons to both solutions are expensive, but still place the onus of the hardware build, validation, deployment and ongoing support on the end-user.
As a consultant, I work with many small businesses who have simple VMware estates (2-6 hosts), but need robust storage to back the solution. They typically have less than 6 Terabytes of data and the cost of lower tier SAN storage is way out of line with their technology budgets.
A fully-licensed build of a 16TB RAW commercial ZFS array built atop commodity hardware approaches $25-30k. That's very close to the price of an integrated storage array like Nimble or Tegile.
Build a highly-available dual-controller storage array using open-source technologies.
This solution should be capable of presenting shared storage to any NFS client.
This should be the ZFS equivalent of an dual-node array like the HP StorageWorks MSA2040.
The key to this high-availability storage design is a shared SAS-attached storage enclosure, or JBOD. Examples of this include:
- HP StorageWorks D2600/D2700/D3600/D3700 2U enclosures.
- HP StorageWorks MDS600 and D6000-series high-density enclosures
- Sun Microsystems/Oracle J4410 JBOD.
- Dell PowerVault MD1200.
- DataON DNS-1660.
These external SAS enclosures all feature:
- Redundant power supplies and fans.
- SAS expander logic on the backplane.
- Redundant SAS controllers (I/O modules) with end-to-end multipathing to accommodate dual-ported SAS disks. Dual-ported SAS drives are critical to the success of this design!
- SCSI enclosure services (SES) sensors for communication of enclosure component status.
To complement the shared JBOD enclosure, we need two servers (head nodes or controllers) to provide client connectivity and compute/RAM resources for the ZFS array.
It make sense to scale the specifications of the head nodes to meet the anticipated workload of the installation. Variables may include:
- PCIe slot connectivity - For NICs, SAS host bus adapters and future expansion.
- RAM - ZFS leverages system RAM for read and write caching. Maximizing RAM amount (within reason) can help accelerate I/O workloads.
- Speed and number of CPUs - This impacts data compression performance.
- Cost - .
My recommendation is to use an Intel Nehalem, Westmere or newer CPU with four or more cores.
Low-cost build manifest for a simple 12TB usable storage array:
- 2 x HP ProLiant DL360 G7 rackmount servers ($500-$800 total)
- 2 x HP H221 SAS HBA ($240 total)
- 4 x SAS SFF-8088 external SAS cables ($100 total)
- 1 x HP StorageWorks D2600 12-bay SAS enclosure ($400-$600 total)
- 10 x HP 2TB dual-port SAS hard drives ($700 total)
- 1 x SAS SSD for L2ARC ($200)
- 1 x Stec ZeusRAM SAS DRAM SSD for ZIL ($600)
HP StorageWorks D2600 fully disassembled.
Front view of HP ProLiant DL360 G7 head nodes and D2600 JBOD.
Rear view of servers and JBOD with 6G SAS cabling.
Build manifest for a 24TB usable storage array using newer component:
- 2 x HP ProLiant DL360 Gen8/Gen9 rackmount servers
- 2 x HP H241 SAS HBA
- 4 x SAS SFF-8644 external cables
- 1 x HP StorageWorks D3600 12-bay SAS enclosure
- 10 x HP 4TB dual-port SAS hard drives
- 1 x SSD for L2ARC
- 1 x SSD for ZIL
Front view of HP ProLiant DL360p Gen8 head nodes and D3600 JBOD.
Rear view of servers and JBOD with 12G SAS cabling.
The purpose of the shared JBOD in this setup is to have a multipath configuration that provides cabling, controller, host adapter and path redundancy.
The most basic recommended setup between two head nodes and a single JBOD enclosure is a single 2-port SAS HBA in each host and four SAS cables with the following arrangement.
node1 HBA port 1 -> JBOD controller1 port 2
node1 HBA port 2 -> JBOD controller2 port 2node2 HBA port 1 -> JBOD controller1 port 1
node2 HBA port 2 -> JBOD controller2 port 1
Examples of how to scale with multiple enclosures and SAS cabling rings.
Assumptions and requirements:
- Root access.
- Two server systems with RHEL or CentOS 7 installed on local disks.
- The ZFS filesystem configured via the ZFS on Linux repository.
- Familiarity with basic ZFS operations: zpool/zfs filesystem creation and modification.
- Understanding of network bonding under RHEL Linux.
These steps describe the construction of a two host, single JBOD cluster that manages a single ZFS pool comprised of 10 data drives (9 pool+hot spare) and separate SLOG (ZIL) and L2ARC drives.
Corosync/Pacemaker setup:
Additional information about the RHEL High Availability Add-On.
For simplification, disable firewalld for the build process. This can definitely be modified later.
systemctl stop firewalld
systemctl disable firewalld
Install the ZFS filesystem by downloading the ZFS and kernel-devel packages.
yum localinstall http://archive.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm
yum install kernel-devel zfs
Install the RHEL cluster suite and multipath software.
yum install pcs fence-agents-all device-mapper-multipath
The multipath daemon will not start without a configuration file present.
touch /etc/multipath.conf
systemctl start multipathd
systemctl enable multipathd
Download the ZFS zpool Pacemaker OCF agent. This allows the ZFS pool to be exported and imported to each cluster node during failover.
cd /usr/lib/ocf/resource.d/heartbeat/
wget https://github.com/skiselkov/stmf-ha/raw/master/heartbeat/ZFS
chmod +x ZFS
Networking depends on how clients will consume data from this NAS.
For most of my builds, I use LACP bonding from the storage array to the switches and maintain an IP on the data network for system management. The most basic requirement is to have an IP for each host, plus a virtual IP (VIP) that can float to the active node. NFS clients will use this virtual IP.
I prefer to segregate traffic by VLAN, so I create a single master LACP bond interface with separate VLAN interfaces, e.g. bond0.10
,bond0.777
,bond0.91
.
Example /etc/sysconfig/network-scripts/
interface script files here.
NetworkManager isn't necessary for this since the bond interfaces will likely require hand-editing.
systemctl stop NetworkManager
systemctl disable NetworkManager
Populate /etc/hosts
files with the cluster members' hostnames and add a second heartbeat address ring for Corosync.
# Management addresses of both nodes
172.16.40.15 zfs-node1.ewwhite.net zfs-node1
172.16.40.16 zfs-node2.ewwhite.net zfs-node2
# Cluster ring address for heartbeat
192.168.91.1 zfs-node1-ext
192.168.91.2 zfs-node2-ext
Ensure each of the addresses can be reached from either host.
Determine a drive layout:
Due to the use of dm-multipath
, we want to build the ZFS pool using the device mapper disk identifiers rather than the normal /dev entries.
There are a few ways to determine the drive layout and identify disks. This will depend on the HBAs and JBOD enclosure in use. It makes sense to record the SAS WWN of each of the disks you'll be using.
In some cases, it is possible to enumerate drives and identifying WWNs programmatically or by examining sysfs entries. For example, with the HP H221 SAS HBA:
cd /sys/class/enclosure
[root@zfs1-1 /sys/class/enclosure]# ls
0:0:11:0 0:0:23:0
[root@zfs1-1 /sys/class/enclosure/0:0:11:0]# ll
total 0
drwxr-xr-x 3 root root 0 Jul 6 11:57 0
drwxr-xr-x 3 root root 0 Jul 6 11:57 1
drwxr-xr-x 3 root root 0 Jul 6 11:57 10
drwxr-xr-x 3 root root 0 Jul 6 11:57 11
drwxr-xr-x 3 root root 0 Jul 6 11:57 2
drwxr-xr-x 3 root root 0 Jul 6 11:57 3
drwxr-xr-x 3 root root 0 Jul 6 11:57 4
drwxr-xr-x 3 root root 0 Jul 6 11:57 5
drwxr-xr-x 3 root root 0 Jul 6 11:57 6
drwxr-xr-x 3 root root 0 Jul 6 11:57 7
drwxr-xr-x 3 root root 0 Jul 6 11:57 8
drwxr-xr-x 3 root root 0 Jul 6 11:57 9
-r--r--r-- 1 root root 4096 Jul 6 11:57 components
lrwxrwxrwx 1 root root 0 Jul 6 11:57 device -> ../../../0:0:11:0
drwxr-xr-x 2 root root 0 Jul 6 11:57 power
lrwxrwxrwx 1 root root 0 Jul 6 11:57 subsystem -> ../../../../../../../../../../../../../class/enclosure
-rw-r--r-- 1 root root 4096 Jul 4 23:27 uevent
The 0 through 11 above represent the 12 drive slots in the HP StorageWorks D2600 I'm using in this build.
[root@zfs1-1 /sys/class/enclosure/0:0:11:0/0]# ll
total 0
-rw-r--r-- 1 root root 4096 Jul 6 11:59 active
lrwxrwxrwx 1 root root 0 Jul 6 11:59 device -> ../../../../../../../port-0:0:0/end_device-0:0:0/target0:0:0/0:0:0:0
-rw-r--r-- 1 root root 4096 Jul 6 11:59 fault
-rw-r--r-- 1 root root 4096 Jul 6 11:59 locate
drwxr-xr-x 2 root root 0 Jul 6 11:59 power
-rw-r--r-- 1 root root 4096 Jul 6 11:59 status
-r--r--r-- 1 root root 4096 Jul 6 11:59 type
-rw-r--r-- 1 root root 4096 Jul 6 11:59 uevent
Note the device
directory and locate
options.
Running echo 1 > locate
will illuminate the drive beacon on the device present in that slot.
[root@zfs1-1 /sys/class/enclosure/0:0:11:0/0]# cd device
[root@zfs1-1 /sys/class/enclosure/0:0:11:0/0/device]# cat sas_address
0x5000c500236004a2
The output of multipath -ll
should show the paths, selection policy, SCSI device name and /dev/mapper device name (e.g. 35000c500236061b3) for each of the drives in the enclosure.
[root@zfs1-1 ~]# multipath -ll
35000c500236061b3 dm-11 HP ,EF0450FARMV
size=419G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:3:0 sde 8:64 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 0:0:16:0 sdq 65:0 active ready running
35000c5007772e5ff dm-3 HP ,EF0450FARMV
size=419G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:2:0 sdd 8:48 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 0:0:15:0 sdp 8:240 active ready running
35000c500236032f7 dm-0 HP ,EF0450FARMV
size=419G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:19:0 sdt 65:48 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 0:0:6:0 sdh 8:112 active ready running
The lsscsi
command can also help display topology.
Note how the drives appear twice in the SAS layout.
[root@zfs1-2 ~]# lsscsi
[0:0:0:0] disk HP EF0450FARMV HPD6 /dev/sdb
[0:0:1:0] disk HP EF0450FARMV HPD6 /dev/sdc
[0:0:2:0] disk HP EF0450FARMV HPD6 /dev/sdd
[0:0:3:0] disk HP EF0450FARMV HPD6 /dev/sde
[0:0:4:0] disk HP EF0450FARMV HPD6 /dev/sdf
[0:0:5:0] disk HP EF0450FARMV HPD6 /dev/sdg
[0:0:6:0] disk HP EF0450FARMV HPD6 /dev/sdh
[0:0:7:0] disk HP EF0450FARMV HPD6 /dev/sdi
[0:0:8:0] disk HP EF0450FARMV HPD6 /dev/sdj
[0:0:9:0] disk HP EF0450FARMV HPD6 /dev/sdk
[0:0:10:0] disk HITACHI HUSSL4010BSS600 A110 /dev/sdl
[0:0:11:0] disk STEC ZeusRAM C023 /dev/sdm
[0:0:12:0] enclosu HP D2600 SAS AJ940A 0150 -
[0:0:13:0] disk HP EF0450FARMV HPD6 /dev/sdn
[0:0:14:0] disk HP EF0450FARMV HPD6 /dev/sdo
[0:0:15:0] disk HP EF0450FARMV HPD6 /dev/sdp
[0:0:16:0] disk HP EF0450FARMV HPD6 /dev/sdq
[0:0:17:0] disk HP EF0450FARMV HPD6 /dev/sdr
[0:0:18:0] disk HP EF0450FARMV HPD6 /dev/sds
[0:0:19:0] disk HP EF0450FARMV HPD6 /dev/sdt
[0:0:20:0] disk HP EF0450FARMV HPD6 /dev/sdu
[0:0:21:0] disk HP EF0450FARMV HPD6 /dev/sdv
[0:0:22:0] disk HP EF0450FARMV HPD6 /dev/sdw
[0:0:23:0] disk HITACHI HUSSL4010BSS600 A110 /dev/sdx
[0:0:24:0] disk STEC ZeusRAM C023 /dev/sdy
[0:0:25:0] enclosu HP D2600 SAS AJ940A 0150 -
[1:0:0:0] disk HP LOGICAL VOLUME 6.64 /dev/sda
[1:3:0:0] storage HP P410i 6.64 -
Once there's some idea of the SAS address layout of the drives, we can create a zpool.
# Create a zpool using the /dev/mapper devices.
# The most critical cluster pool creation option is cachefile=none
zpool create vol1 -o autoexpand=on -o autoreplace=on -o cachefile=none \
raidz1 35000c500236004a3 35000c50023614aef 35000c5007772e5ff \
raidz1 35000c500236061b3 35000c5002362f347 35000c500544508b7 \
raidz1 35000c500236032f7 35000c50023605c1b 35000c5002362ffab \
spare 35000c500236031ab
The result:
[root@zfs1-1 ~]# zpool status -v
pool: vol1
state: ONLINE
scan: scrub repaired 0 in 1h14m with 0 errors on Sun Aug 7 05:00:49 2016
config:
NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
35000c500236004a3 ONLINE 0 0 0
35000c50023614aef ONLINE 0 0 0
35000c5007772e5ff ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
35000c500236061b3 ONLINE 0 0 0
35000c5002362f347 ONLINE 0 0 0
35000c500544508b7 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
35000c500236032f7 ONLINE 0 0 0
35000c50023605c1b ONLINE 0 0 0
35000c5002362ffab ONLINE 0 0 0
logs
35000a7203008de44 ONLINE 0 0 0
cache
35000cca0130c0a00 ONLINE 0 0 0
spares
35000c500236031ab AVAIL
Cluster initialization:
The EL6/EL7 Corosync and Pacemaker packages install a service account named "hacluster". Assign a password to the user. This will be used later for the cluster authorization and the management GUI.
passwd hacluster
Enable and start the Pacemaker and Corosync system services.
systemctl enable pcsd
systemctl enable corosync
systemctl enable pacemaker
systemctl start pcsd
Authorize the cluster nodes. Here, I use the name "zfs-cluster", but choose something appropriate.
pcs cluster auth node1 node2
pcs cluster setup --start --name zfs-cluster zfs-node1,zfs-node1-ext zfs-node2,zfs-node2-ext
Create a STONITH device using SCSI reservations of the ZFS pool disks. List all of the devices that should be associated with this specific pool.
pcs stonith create fence-vol1 fence_scsi pcmk_monitor_action="metadata" devices="/dev/mapper/35000c500236061b3,/dev/mapper/35000c500236032f7,/dev/mapper/35000c5007772e5ff,/dev/mapper/35000c50023614aef,/dev/mapper/35000a7203008de44,/dev/mapper/35000c500236004a3,/dev/mapper/35000c5002362ffab,/dev/mapper/35000c500236031ab,/dev/mapper/35000c50023605c1b,/dev/mapper/35000c500544508b7,/dev/mapper/35000c5002362f347" meta provides=unfencing
Create a ZFS pool resource to correspond to the zpool configuration. The pool name here is "vol1".
pcs resource create vol1 ZFS params pool="vol1" importargs="-d /dev/mapper/" op start timeout="90" op stop timeout="90" --group=group-vol1
Define a virtual IP that floats between nodes. This is associated with the ZFS pool. A benefit is that this gives you the option to run dual-active clustering with mutual failover by pinning a ZFS pool to a head node.
pcs resource create vol1-ip IPaddr2 ip=192.168.77.18 cidr_netmask=24 --group group-vol1
Set a default "stickiness" value to prevent flapping between nodes after a failover event.
pcs resource defaults resource-stickiness=100
The result from the pcs status
command should look like:
[root@zfs1-1 ~]# pcs status
Last change: Tue Jul 19 19:29:16 2016 by root via crm_attribute on zfs1-1 Stack: corosync Current DC: zfs1-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 2 nodes and 3 resources configured
Online: [ zfs1-1 zfs1-2 ]
Full list of resources:
fence-vol1 (stonith:fence_scsi): Started zfs1-1 Resource Group: group-vol1
vol1 (ocf::heartbeat:ZFS): Started zfs1-1
vol1-ip (ocf::heartbeat:IPaddr2): Started zfs1-1
PCSD Status: zfs1-1: Online zfs1-2: Online
Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
NFS export
The elegance of this design comes from eschewing the traditional NFS cluster resources and instead relying on the ZFS sharenfs
property for exports.
The old way of doing this would mean having separate resources for:
- The ZFS zpool
- The zpool virtual IP
- The NFS server daemon
- An exportfs resource for each NFS server export
- An NFS notify resource to inform clients during failover
- And Pacemaker cluster constraints to maintain the ordering of the above resources.
In this design, we just use a fencing/STONITH resource, a resource for the zpool and the virtual IP.
For NFS exports, just define the sharenfs
property on the filesystem(s) you wish to export.
zfs set [email protected]/24,sync,no_root_squash,no_wdelay vol1/management
This condenses the NFS daemon, exports and NFS notify steps into the zpool export/import process and speeds up failover times.
<iframe src="https://player.vimeo.com/video/178221882" width="640" height="538" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>ZFS cluster HA failover from Edmund White on Vimeo.
VMware NFS path state during controlled failover
(in progress)
-
tuned
profiles. -
/etc/modprobe.d/zfs.conf
options. - Multiple queue I/O scheduling under EL7.2+.
(in progress)
Monitoring
CLI operations
# Get cluster status
pcs status
# Place cluster node in standby (manual failover)
pcs cluster standby [node name]
# Remove cluster node from standby state
pcs cluster unstandby [node name]
# Clean up cluster resources
pcs resource cleanup
Web GUI https://nodename:2224