2015/10/18

Server Administration, LARTC, and All The Pain Money Can Buy - Fast Networks, Backups and Lots of Disks

Hi Interwebs, long time no post.  But I'm feeling creative today.  Since my last post I've changed companies to take a job as a software engineer across the country and I now live in the Twin Cities.  A lot has changed, but my home server rack is essentially the same as a few years ago although it has had a couple of upgrades.

Right now I'm sitting around, surrounded by tools and parts, waiting for drives to finish zeroing out so I can make a second ZFS pool.  I'll get back to that in a minute.  First I should break down how I currently have my server setup.

My VM server is being rebuilt at the moment, so I've offloaded running virtual machines to my desktops; they're beefy enough to handle the job for the moment.  That leaves my core switch, router and storage server in the rack.  You can skip the break down for the tl;dr section after the list.

  • Switch
    • TPLink TLSG2424 - it's what you'd expect from a ~$200 switch that has higher end abilities, it's kind of junky and doesn't support the higher end features great, but it mostly works most of the time.  Has VLANs, LAGG, SSH, QOS, SNMP, etc.  A couple features are buggy and the documentation is terrible, but once again, what did you expect for ~$200?
    • Connects to the core switch to the core router and servers via patch panel.

  • Router
    • PFSense box (dual core Celeron w/ 2 GB RAM) with 7 NICs, 6 of which have drivers for BSD, 5 of which work flawlessly.  One interface on the PCIE bus is a backup-backup wireless AP, the driver tends to wedge from time to time and the AP goes MIA.  Not a huge deal, but I wish it worked without issue.  The on board NIC doesn't work at all which isn't a huge deal, but it forces me to run the WAN link over the PCI bus.
    • The NICs are spread over the buses (PCI and PCIE) such that one NIC on each bus makes up a bond - 1 PCI and 1 PCIE NIC to spread the traffic and interrupts.  The WAN is on the PCI bus (lower bandwidth requirements - but I'll get to bus flooding later), and the WLAN is on the PCIE bus.  1 bonded pair goes to the LAN subnet/VLAN and another bonded pair go to the storage network/VLAN.
    • Uses LACP to switch for bonding
  • Storage Server
    • Slackware Linux box (AMD 6 core, 32 GB RAM) with hand compiled kernel + ZFS on Linux patches with 4 NICs and 10 disks.
    • NICs form 2 bonded interfaces from a PCI/PCIE NIC pair each to spread the bus load and are bonded to the "storageGateway" subnet on the router, via the switch, using adaptive load balancing (ALB) on one bond and LACP on the other.
    • Disks are spread over the PCIE and SATA buses into 4 ZFS mirrored vdevs; 3 vdevs have 1 internal disk and 1 external hot swappable disk where the internal drives use a PCIE controller and the external drives are wired to the SATA bus and primary onboard disk controller. 1 vdev has both disks in external hot swappable sleds for upgrading the vdev without opening the rig - this mirror uses the second disk controller on the motherboard since the primary onboard controller can only handle 4 SATA channels.  1 disk is the OS drive and the other is an SSD dedicated to read caching and write buffering.
Whew!  That's a lot to take in.  I know because I designed it and I have a bit of trouble remembering it all off the top of my head.  The basic design principle is that all loads should be spread over available buses and disk controllers for performance reasons.  One of the mirrors in the ZFS pool should have both drives externally accessible so the storage server doesn't have to be opened for an emergency storage upgrade.  Here's the logical and physical breakdown of the wiring (my server closet is a mess, I'm in the middle of bundling the wires today, also sorry that none of the pictures can be lined up side by side, Google decided to make the sizes exactly half of the width of the page and then padded them for some reason... after a good 20 minutes of trying to make this look nice I gave up - you're lucky I left any pictures at all.) :

Switch logical view : Rows are LAG #, VLAN #, Port # (colors are bond association on other pictures)

Router NICs by bus : "Wire #" is the physical wire ID per box going to the patch panel and "Port" is the terminating port on the switch (row 3 of the switch pic above, 1 is bottom left and 2 is above it, etc.).
Router logical view

Storage Server logical view

Big Picture logical view



Yes, my server closet is literally a closet.

2009/07/08

My Home Network (Part 1)

Developers have unique storage needs as a result of the work we do. I've got a subversion server that needs storage, directories full of libraries, IDE stacks, and a metric crap-ton of virtual machines and their install ISOs.

When I first started this project, I decided that I needed reliable, scalable, centrally located and accessible disk space. It also had to be fast! Finally, it had to be tidy and easy to administer. I might as well have asked for a pony while I was at it.

Having an old audio rack, a 1000' spool of cat 5E, and some power tools at my disposal, I decided that everything would go right in to a rack-mounted box where I could maintain the wiring easily. Lighting would come in handy, too, for those times when I have to crawl behind the rack and hook up a monitor or keyboard.

After some trial and error, I found a very fast, versatile, fault resistant, and scalable setup that uses mostly commodity hardware (for those hardware failures that only happen at

8PM on a Saturday night) and open source software. The rack utilizes three servers, running three operating systems tuned for the servers primary purpose.


All of the boxes and virtual machines can be controlled from my jail broken iPod Touch or my Blackberry Storm. A bonus to this approach is that I can use the iPod to control my iTunes from anywhere in my apartment. The Linux and BSD boxes can be administered via SSH with a private key authentication, and the Windows boxes can be administered using RDP with Samba authentication. Samba provides the user names and accounts for both the Windows and Linux boxes as a single point of administration for users/passwords.


Here is the basic setup...

Boxes

  • Storage Server

The storage server, named 'storage', is a 4U chassis with an external hot swappable 5 drive bay. It sports an Asus A8N32-SLI Deluxe which rocks an AMD Opteron 146, 4 gigabit NICs, 1 100mbit NIC, a GeForce 8500 GT, and two gigs of RAM (had 3 gigs, but burnt out a dual channel kit.... dropping in 4 gig sticks on next upgrade cycle) which acts as a dedicated disk buffer/cache. Storage is running Fedora-10-x86_64 Linux and exports disk space over iSCSI on two subnets.

  • Router

The router, named 'drawbridge', is an old EMachines EMonster 600 with three NICs, an old Quantum BigFoot 5.25" drive, and about half a gig of RAM. It runs a BSD derived OS, pfsense which is actually a m0n0wall fork. The services provided for the network are DNS, DHCP, DHCP forwarding to DNS, UPNP, NAT, traffic virus scanning, traffic graphs, and a 3-way firewall.


Unfortunately, this box is getting long in the tooth. When I upgrade my server, that box will replace the router. The rule of thumb for network traffic is 1 MHz per mbit per second; 600 MHz just doesn't cut it

for running several gigabit trunks.

  • Slacker

Slacker, the primary server, is an AMD 900 MHz Duron with half a gig of RAM, and 5 NICs. It runs a customized and slimmed down version of Slackware-10.2 Linux (ergo, the name slacker!). Slacker provides domain authentication and storage via Samba, as well as being a Subversi

on server. The storage for the services running on Slacker is on the storage server and imported via multiple, redundant, striped iSCSI links.


Network Design

The primary server gets its storage from the storage server over a dedicated switch using two connections from each box for redundancy. This subnet is 192.168.10.0/28 and is isolated from the rest of the network. There is also a dedicated switch and subnet (192.168.5.0/28) for storage accessed by the desktops. For performance, neither of those subnets go through the router.

Publicly addressable servers are on a 192.168.0.0/24 subnet and are only accessible through the router (all traffic has to pass through the routers firewall). The wireless access point shares the switch and subnet of the rest of the LAN on 192.168.1.0/24.

Any traffic headed from the LAN to anywhere else has to pass through the firewall on the gateway. Internet traffic comes from a cable modem and hits a patch panel which is directed to the routers WAN NIC.

Each box has two links to the backbone switch so that one NIC on each box can fail without causing a problem. Multipathd is utilized for round robin striping over the iSCSI connections on the backbone. Every few MB the active connection to the storage server is changed. A non-functioning link is dropped from the rotation schedule until it is functional again. Since I'm not using a managed switch with LACP, Linux policy routing is used so that a box responds on the interface that got the incoming packet; for fail-over purposes, ARP proxying is allowed on the backbone NICs for both boxes. It confuses the switch to no end, but the important thing is that it works.


Storage


Each of the drives are partitioned in to 5 100GB slices set for Linux software RAID which allows for automatically adding to RAID groups on the fly. From here 5 RAID-10 devices are created with a 128K chunk size:

[root@storage sbin]# cat /proc/mdstat
Personalities : [raid10]
md3 : active raid10 sde3[0] sdd3[3] sdc3[2] sdb3[1]
195318016 blocks 128K chunks 2 near-copies [4/4] [UUUU]
md5 : active raid10 sde5[0] sdd5[3] sdc5[2] sdb5[1]
195318016 blocks 128K chunks 2 near-copies [4/4] [UUUU]
md6 : active raid10 sde6[0] sdd6[3] sdc6[2] sdb6[1]
195494656 blocks 128K chunks 2 near-copies [4/4] [UUUU]
md1 : active raid10 sde1[0] sdd1[3] sdc1[2] sdb1[1]
195318016 blocks 128K chunks 2 near-copies [4/4] [UUUU]
md2 : active raid10 sde2[0] sdd2[3] sdc2[2] sdb2[1]
195318016 blocks 128K chunks 2 near-copies [4/4] [UUUU]

Four of these devices are then exported over iSCSI. The fifth has an XFS file system laid on top of it which holds a bunch of .img files that are exported over iSCSI. Using pre-allocated files boosts virtual machine performance since the Linux kernel will allocate RAM as a buffer to the file, as well as allowing VirtualBox to import it as a disk directly over iSCSI. As a bonus, the files compress really well (all unused space is filled with zeros) and allow you to carry a single compressed file that has an OS on it.


Performance

Performance is generally decent:

[root@storage sbin]# hdparm -tT /dev/md2
/dev/md2:
Timing cached reads: 1604 MB in 2.00 seconds =
801.86 MB/sec
Timing buffered disk reads: 414 MB in 3.01 seconds =
137.69 MB/sec
The thing to remember is that you can have low latency with low throughput, or high latency with high throughput, per link. For IO bound operations, you'll want the throughput, for CPU bound operations, you'll want low latency. The trick is finding a good middle ground. The other trick is cheating as much as possible to lower the latency.

As with almost all computer related topics, you can trade off space for time. In almost every instance this is a good trade off; time is the only resource you can never get back. Instead of reading and writing directly to disk, throw a metric crap-tonne of hardware at the problem in the form of RAM. It is dirt cheap these days, and performs an order of magnitude faster than a hard drive. Using RAM as a read/write buffer allows for storing read-ahead from the drive, as well as higher cache hits from the page cache. I dedicate all RAM on the storage box as buffer and make sure that the OS is doing all of the heavy lifting for writing out dirty pages to disk; the read cache buffer will shrink as needed to allow more pending writes. Here are the (somewhat aggressive) virtual memory setting for the storage box:
/proc/sys/vm/dirty_background_ratio 1
/proc/sys/vm/dirty_expire_centisecs 1499
/proc/sys/vm/dirty_ratio 90
/proc/sys/vm/dirty_writeback_centisecs 248
/proc/sys/vm/swappiness 95
/proc/sys/vm/vfs_cache_pressure 100
For more info about these Linux kernel settings, see
The Linux Page Cache and pdflush:
Theory of Operation and Tuning for Write-Heavy Loads.

Basically the settings above can be interpreted as, “Swap as much out as possible, write soon and often, make the OS do (almost) all of the writing on behalf of user processes, and use as much RAM (and then some) as needed to accomplish this.”

Having a large RAM buffer between the disks and the clients increases throughput while keeping latency low due to a high cache hit ratio. Read-aheads from disk are kept in RAM before they are requested. A further tweak (depending on workload) is using the anticipatory IO scheduler instead of the CFQ. This values throughput over latency when requesting IO to or from the disks.

The network stack has also been tweaked a bit. Choices of congestion control are reno or cubic. Proxy ARP, dsacks, sacks, and low_latency are utilized while tcp_reordering, tcp_max_syn_backlog, and tcp_mem are cranked all the way up.


Backups
Using a flat file eases offsite backups; sync your disks (use an LVM snapshot, if possible!), copy the flat img file to a temp directory, crunch it at full compression, and split it into 345 Mb chunks. 345 Mb allows for writing either two chunks to a CD, 13 chunks to a single layer DVD, or 24 chunks to a dual layer DVD. Note Mb = 1 million bytes and you need a bit of head room on disc for the file system and such accounting metadata. At any rate, fire up your coaster-maker with a quality dual layer DVD+R and burn off your backups. I use a deposit box at my bank for backups; it's well worth a yearly fee of $75 to have a secure off site repository for your data... unless your data isn't worth anything. For the rest of us, follow SOP, at the very least two medias (I'd go so far as to say 3 is probably optimal-magnetic,optical and flash), two locations, encrypted, at most two keys (or a safe combination split up over 3 or 5 people), and make sure to test the restore procedure.

While having a good backup to restore from makes me feel all warm and fuzzy, I'd prefer to avoid the situation all together, which is more difficult than it sounds. A rack mounted storage box is a somewhat hostile environment for computer components. On bootup, unless you've got drives that'll do a staggered spin up, you strain your power supply as it has to get a bunch of metal platters up to 7200 or more RPMs within a second or two. Once up to speed, the drives rattle around in their cramped, hot, bay. During transfers, the most sluggish drive is going to be under constant strain to keep up with the rest of the drives. As the contents of the drive are moved across the chipset to RAM and various busses, TCP/IP stacks are checksummed, data is stripped out of TCP/IP stacks, and operating system services buzz around flushing buffer caches and building page caches, the box is going to get downright hot! Marginal components aren't going to make it too long under these conditions. This is where using quality parts and having spares on hand is going to make your life easier. In the last two years, I've lost two cheap RAM sticks, two harddrives, and two power supplies.

To help combat hardware failures, I've installed two ball bearing fans that draw the hot air from the top of the box and push it down in to the exhaust stream to get it out of the box and away from the power supply. The air filter on the front intake is coated in a aerosol lithium oil to catch more of the dust that would have otherwise gotten through and clogged the fans. All of the SATA cables are run around the perimeter of the box inside a mesh sleeve to allow for better air flow through the box. All of the fans wires are in sleeves also. To cut down on CPU usage and increase redundancy, the drives were switched from a RAID 5 configuration to a RAID 10 configuration. The tradeoff was 33% less space, but more throughput with less latency.

(More in my next blog...)