Sunday, August 9, 2009

New server (expanded)

I'm the sysadmin at Sprend.
In this post I'll expand on Arnes previous post (read that one first), and dive into technical details.
So consider yourself warned, this is gonna be fairly nerdy stuff :)
(Actually the Wikipedia geek article has a better explanation on the subject of nerds, but when I was a teenager me and my friends here in Sweden always called ourselves nerds, so that's the expression I'm sticking to)

Regarding the Java threads eating 99% CPU:
This might in part have been caused by us running Linux kernel 2.6.18 or Tomcat 5.5.20, but most likely the reason was Java 5.0.10 (it's hard to know since we were too lazy to do any serious debugging or profiling).
Also, the Java threads regularly allocated more memory than they were assigned, sometimes to the point of starving the machine of memory, at which point the kernel (or rather its dreaded OOM Killer) always made the unfortunate choice of killing MySQL instead of something less important, in order to free up memory.
(I didn't know about the oom_adj setting at the time. Not that it would have helped considering that Java and MySQL were the only things consuming any significant amount of memory on the server, and both of them had to stay alive)

Aside from CPU usage being reduced on the new server, memory is not "leaking" anymore and Java & MySQL are using fewer threads.
That's partly due to the faster CPU (dual core Athlon 64 5200+) but also because of more efficient software versions: kernel 2.6.3x, Java 6.x, Tomcat 6.x and MySQL 5.x
When things were at their worst on the old server, Java grew towards 200 threads and 1 GiB of RAM (the server only had 1 GiB of RAM, but no swap because that would have hurt our disk performance even more). MySQL 4.1.22 behaved more gracefully and stayed below 50 threads and 100 MiB of RAM. On the new server Java stays below 300 MiB of RAM and 120 threads. MySQL stays below 50 MiB of RAM and 25 threads. Java now seldom occupy more than 100% of one CPU core (often much less than that) and MySQL consumes virtually zero CPU (and that's how it should be).
We had some other minor problems with Java and MySQL as well that disappeared on the new server.
As a consequence Java and MySQL is roughly an order of magnitude more stable now, which is quite nice for me since I don't need to babysit them anymore.

Regarding moving the db from the USB flash drives to the hard drives:
The reason was that the USB drives are slow when MySQL is doing something that causes heavy and sustained disk IO.
Which is not a surprise considering that USB flash drives typically have IO throughput of merely 5-15 MiB/s.

Also, I separated the system disk (which holds the operating system) from the data disk (which holds the files being uploaded and downloaded to/from
The reason to separate the system disk from the data disk is performance - concurrent reads and writes in particular.
And why is that necessary?
Well, our internet connection is a dedicated 100/100 Mbit/s full duplex ethernet line. This means what we can push a maximum of 25 MiB/s through the line.
That's nothing for our SATA-300 hard drives which I've measured to push approximately 100 MiB/s of sequential IO per drive at peak performance.
But, and this is the crux, at peak hours (noon, afternoon and evenings) we typically have something like 30 to 40 simultaneous file transfers in progress.
And while the aggregate bandwidth of those transfers seldom go beyond 15 MiB/s they do cause simultaneous reads and writes of 30 to 40 different files on the hard drive. Also known as random IO.
This means that the magnetic head inside the hard drive is jumping around like crazy the whole time while accessing the different data blocks belonging to all those files (no matter what you do, the data blocks are gonna get spread out over the platter(s) inside the hard drive over time - especially with our high rate of file creations and deletions - and that's why the magnetic head has to jump around so much).
That in turn translates into increased seek times (and increased wear & tear) on the hard drive.
On the old server we had combined system and data disk, a PATA/100 disk controller and the XFS file system on the hard drives.
That caused the old hard drives to become seriously overworked and slow at peak traffic hours.
Now, there's nothing wrong with XFS. I've done some performance comparisons of the Linux journalling file systems ext3, reiser3, JFS and XFS. All on the same Linux installation on non-enterprise hardware, and XFS was the clear winner.
But the newer generation ext4 (with its extents, pre-allocation, delayed allocation and multiblock allocator) in conjunction with the faster SATA-300 disk subsystem and separated system & data disks proved to be highly effective.
The load on the hard drives can't even be noticed anymore during peak traffic hours.

Of course, ZFS is still the ultimate pr0n when it comes to file systems.
Unfortunately, the CDDL license of ZFS and the GPL license of the Linux kernel are incompatible, preventing ZFS from being incorporated into the Linux kernel.
But the good news is that there is an all new and shiny Linux native file system in full development right now, which is basically an improved clone of ZFS.
It's called Btrfs (sponsored primarily by Oracle) and when it's declared stable we'll switch over to it and get amazing kickass features!

Oh, and the reason that we used USB flash drives is that they're cheap, noiseless, cold, power efficient and small in physical size (the server has room for them, but not for 2 extra hard drives).
All of this except being cheap is also true for SSD drives, which is why we went with USB flash drives instead.
SSD drives have blazing performance, but they're just too expensive at this point in time for this project.
Also, SSD still share a serious technical problem with USB flash - after something like 50-100K of writes, individual memory cells will start to fail (even when utilizing wear levelling).
But that, and write performance, won't be a problem in the next generation of SSD drives.

Other points of interest regarding the new server:
  • We've switched from 32 to 64 bit Gentoo Linux for OS
  • We're now using NAPI in our NIC driver, which reduces interrupt generated CPU load by 5-10% (estimated) on incoming network traffic
  • Security is increased. In particular login security + the number of open network ports is reduced (that number is extremly low now)
  • We're utilizing around 400 GiB of our storage capacity (which is well over 1 TiB now)
  • We will try using APR, which will enable Tomcat to scale better, and seems able to reduce Java CPU usage somewhat.
  • We will connect a UPS to the server
Regarding our second new server, for increased RAS, which is not in use yet:
We have to investigate whether to use clustering, loadbalancing or failover on the servers.

In conclusion, this is how I imagine a discussion with Homer would summarize things (see the video clip below for why this is funny):
Me: The old server is b0rked!
Homer: That's bad.
Me: But the new server is totally sweet!
Homer: That's good.
(Not that Homer has any idea what a server is, but lets pretend that he does)

Edit: FU to Fox for revoking access to the Simpsons video clip on Youtube. Fortunately there are other video sites.


Arne Evertsson said...

What the *&%@ are you doing? You're taking MY serious blog and turning into a comedy show for geeks! Mmm, I like it.

A couple of comments: "30 to 40 simultaneous file transfers" at peak hours. I think it goes well beyond that but I don't have the stats to prove it. That would be something for the next version of the admin applet.

"the magnetic head inside the hard drive is jumping around like crazy": Is it really the case? There is caching in Java, on the OS level and then on the drive itself.

Other than that, it's a great post with really good educational value for a minor geek such as myself.

Joakim Signal said...

I was considering (no, not really) ending the post with the Star Wars Asciimation (if you're a real nerd you do telnet instead though), but watching that fine piece of art actually takes longer than reading my post.

Yeah, we might have 50 or 60 or (?) transfers at peak hours these days. I haven't checked that in a while...

I did factor in the caches on the hard drive and in the file system in my analysis.
But the hard drive cache is only 32 MB, i.e. it gets evicted in a heartbeat.
I don't know how large the file system cache gets, but it should constitute the majority of the free RAM in the machine.
But we have 2 problems that prevents even a cache that's a couple of GiB:s in size from being highly effective:

1) A cache only seriously benefits us on files that have already been completely written to, or (partially?) read from physical disk. So all files that are in the process of being uploaded to the server cannot benefit much from the cache.
Files being uploaded do however get the benefit that the cache can delay writes to physical disk for something like a few seconds if the drive reports that it's really busy. But when our drives report that they are very busy, then we have already "lost the battle" (but the delayed writes caused by the cache of course decrease the high burden on the hard drives somewhat).

2) Files being read can benefit, but I think a majority of files don't get downloaded ( = read from the server's storage) within, say, 20 minutes of being uploaded. And after 20 minutes, even a multi-GiB sized cache has been completely "flushed" (all its data replaced) at our peak hours traffic level. And thus, most files will have to be re-read from physical disk at the point in time when they actually do get downloaded.

Of course, during low traffic hours the file system cache will take much longer to get flushed, but the hard drives have little work to do then, so their magnetic heads aren't moving around much at that point anyway.

So, nice try, but we get bitten in the ass by the laws of physics once again.

Joakim Signal said...

So, I thought I should post some updates regarding facts and numbers as of 2010-04-16.

We no longer use USB flash drives on the server. It was a nice experiment, but I was unable to sufficiently reduce the amount of disk writes going to: the filesystem journal, application log files and database tables. So even though the flash drives were protected through a RAID-1 setup, we had to abandon them due to the short life span they were facing due to the excessive amount of writes.

Looking at our traffic, we're now at this level:
- Around 100 ongoing file transfers at peak hours (which completely saturates our full duplex 100/100 Mbit internet connection).
- Around 8000 visitors per day.
- Around 18 TiB (yes, terabytes!) of file transfers per month.
- Around 700 GiB of user data constantly in our storage (a file is currently stored 7 days maximum).

We feel these numbers are kind of impressive for a small website like ours :)

Arne Evertsson said...

Short life span, well perhaps. More troublesome however was the fact that the flash drives kept going into read-only mode. Since we weren't using them purely to store the movie collection it became a problem.

Joakim Signal said...

Well, technically it wasn't the USB sticks themselves that went into read-only mode, it was the filesystem we installed on top of them that did.
I don't know why that happened. It could have been a bug in the ext3 file system code, a bug in the FTL (Flash Translation Layer), damaged flash memory cells, or something else _really_ weird.

Because of the problem with short life span that we faced with our USB sticks, I didn't investigate the read-only problem in detail.

Anyway, FTL is software that handles low level operations such as block erasing, wear levelling, etc in flash memory devices (SSD, USB sticks, etc). Most of the time it's embedded as firmware on the hardware device itself. Also most of the time it's proprietary software from the flash device manufacturer (or one of its business partners). And every aspect of it is fiercely guarded as business trade secrets. Which means it's extremly difficult fo find out if there exists any serious bugs in these FTL's, such as causing file systems going into read-only mode for example. Or if these FTL's report hardware events to the OS kernel, for example "I have discovered faulty memory cells"-events. Which means that it might be difficult to know if your file system is writing or reading corrupted data to/from damaged blocks on the flash device.

At least most flash devices these days do come with an FTL. And these FTL's usually have the feature of automatically blacklisting and re-mapping damaged memory cells (up to a point), which hard drives have had for a long time now.

But still, the fact that most FTL's are proprietary and obfuscated to death makes life miserable for us sysadmins, powerusers and hackers trying to effectively use or enhance these products.
And while this is Standard Operating Procedure (and has been for a long time) in the computer industry, there is at least good hope for the future considering that the demands for open standards and open source increase every day. Which means that the manufacturers probably can't keep doing this retarded bullshit for much longer.

Joakim Signal said...

Btw, why doesn't blogspot allow us to edit our previously posted comments?

I wanted to add this to my earlier comment:
"- Around 700 GiB of user data, spread out on 6500 files, is constantly in our storage"

In this blogg..

..I write about the development of the file transfer service Sprend.