The LOAD Monster, the consequences of careless deployment

Product owner key skills with Gojko Adzic
March 3, 2017
Meet NetworkedAssets! This time: @ Scalar Conference 2017 in Warsaw, Poland
April 6, 2017
Show all

The LOAD Monster, the consequences of careless deployment

How the LOAD Monster was born

This story starts the same way as most of the stories of that kind. At some point in time, someone introduced nice and shiny web applications to make Management Teams life swift and easy.
A small group of people started using the software and because they were pretty happy with how it worked and what they could do with it, they also started bringing more and more employees on board.
Another “someone” had a brilliant idea and said: “Hey, if that software is so cool, why don’t we bring other apps from that gang so they can work with each other and make our life even easier?” And so they did. Because more and more people at the company were using mentioned apps very quickly they became THE toolkit for literally everyone from a developer to high-level executive.
At this point, IT team stepped in and said: “Hey, you can’t have an application, on which so many people depend on, without having a BIG FAT cluster with 99,999 uptime and bunch of other critical stuff like SLAs and so on.”

This is when the monster was born.

Money was not as much of an issue so they threw in big guns: A cluster of two nodes, each node being an IBM System x3690 X5 with two six hyper threading cores Xeon E7- 2803 processors, 128 GB of RAM and 1,2 TB storage. The high availability layer was built using good old Pacemaker/Corosync combo, backed up by DRBD-based storage. The main actors -the applications- were:

  • JIRA
  • Confluence
  • Bitbucket
  • Bamboo
  • Crowd
  • Fisheye
  • Nexus
  • CollabNet SVN
  • MySQL

Obviously, all these applications, including the database, were deployed on that DRBD shared storage I’ve already mention – clustering remember? More than that – there was a file level backup script, running every single night to create a copy of the data set on that same DRBD shared storage. Time was passing the data was growing, more and more people were using the whole set of apps and they even started adding subcontractors on board.

The monster was growing pretty fast.

First, the Pacemaker went on strike, no wonder it had to manage Java applications – I would do that too, and the cluster turned to be one node entity with DRBD replication. The data set grew so much that the whole backup procedure was taking more than 24 hours to finish, leaving the data inconsistent across the iterations.
Because of things like DevOps became more and more popular Bamboo started being heavily used for compilations, tests and deployments adding, even more, heat to the whole set. Obviously, there were signs that something might go south but mostly they were ignored and blamed on the usual suspect – Java and Java’s garbage collector. Sometimes they tried to do something with the growing problem and again usual things were happening – JVM heap increase. That’s right JVM’s heap was bumped to 20GB per application, they had loads of RAM so why not?

The monster matured at the beginning of 2016.

The load, especially IO traffic, on the server was so high that twice a day Pacemaker was failing to check the status of the applications and because it thought that the apps were down it was restarting them causing, even more, load. You can imagine how happy people were editing a Confluence page and as soon as they pressed the “Save” button the F word campaign was starting.
One thing which went well was an uptime. At the time of this writing server’s uptime is something about 480 days. That sort of a number is pretty impressive but one might ask: Did they update at all? – I guess I don’t have to answer that question…

 

How we fought the LOAD Monster

Like I’ve mentioned before, the bottleneck was an I/O subsystem of the cluster. Having so much CPU power and lots of memory was pretty much useless when the CPUs had to wait for data to be processed. Our experience with DRBD technology was rather limited at the time, so the first thought was to replace it with GlusterFS or some external data array. Unluckily, the customer did not agree to any changes to the cluster architecture. A justification behind it was that there was a migration already planned to a newer hardware, which was expected to improve overall performance and user experience. Knowing all that we started digging into DRBD technology and looking how we could bump the performance of I/O subsystem.

DRBD

Since both of our DRBD nodes were located on 1,2 TB MegaRAID block device each, first thing we looked at was an I/O scheduler. We knew that the hard work of caching and scheduling is done by a hardware controller, therefore, the operating system should not double up the work. We picked up deadline as a simple and almost FIFO [first in, first out] IO scheduler and it worked pretty well. We also disabled front merges, reduced read I/O deadline to 150 milliseconds and write I/O deadline to 1500 milliseconds to improve latency and IO traffic costs.

echo deadline > /sys/block/sdx/queue/scheduler
echo 0 > /sys/block/sdx/queue/iosched/front_merges
echo 150 > /sys/block/sdx//queue/iosched/read_expire
echo 1500 > /sys/block/sdx/queue/iosched/write_expire

JIRA indexes

As Atlassian Experts, we know that JIRA application highly relies on its issue indexes that are stored on the internal file system. The purpose of it is to offload most frequent application tasks and use local files instead of running multiple queries on the database. This does not come without a cost – at the time it reduces database processing effort, it does generate huge I/O traffic on the device that we’ve already identified as a bottleneck. We decided to move indexes to a different location, one with way higher I/O throughput – a RAM disk. Since the data stored in application cache does not need to be persistent and can be rebuilt at any time, there is no risk in keeping them in RAM. The only downside of this solution is that issue index has to be rebuilt every time server restarts, and this may take several minutes depending on the size of the JIRA instance. But in our case restarting does not happen too often – remember the uptime?
Just to make sure that things won’t go out of control, it is good practice to limit the size of your ramdisks. Determining ideal size is not an easy task, as it is mostly impacted by two factors – a number of issues and custom fields. Obviously, they both tend to change a lot with time, therefore some exceptional forecasting skills may be required to get perfect numbers, but… we do have a lot of spare RAM, don’t we?

tmpfs /atlassian/jira/caches tmpfs nodev,nosuid,nodiratime,size=4096M 0 0

Note: Make sure your ramdisk is at least twice as big as your index directory, otherwise JIRA won’t be able to do a background reindex.

Apache httpd access/error logs

Apache httpd was an another heavy writer to the block device in our DRBD cluster. Someone in the past put together pretty nice rewrite rules for all the applications so that employees could operate using DNS backed up by mod_proxy for redirecting HTTP requests. That very same person also enabled ReWrite logs to test mentioned rules but guess what? They were still enabled when we looked at httpd config, logging all the unnecessary data on top of standard access/error logs. Looking at the traffic and amount of data we decide to move logs off DRBD altogether at the same time we’ve fixed logrotate and log verbosity so only valid data gets to the logs now. Yet another IO traffic saver.

MySQL Database

The original plan was to rely on the block level replication of DRBD for clustering, therefore, MySQL instance data were also located on that very same block device allowing MySQL failover without typical master-(slave|master) replica. That was pretty much OK when the customer had roughly 200 users using only JIRA application, but when the number multiplied by 10 things started going bad. All databases of all instances grew rapidly and with them a number of bits and bytes needed to be synchronised over DRBD. More users meant also more SQL operations and adding more sophisticated plugins doing all sorts of JOINTs did not help either.
To our understanding, all databases servers are pretty much the same they need loads of RAM to perform well. It is especially true for MySQL located on a relatively slow block device. So once again we started tuning the damn thing. Luckily for us, MySQL has a looong history of all sorts of performance issues. There are even tools like mysqltuner.pl which made diagnostic pretty easy. We already had all the needed RAM and with a help of mysqltuner.pl && /etc/init.d/mysql restart we came to the state where we cache pretty much everything:

[mysqld]
character-set-server=utf8
 
datadir=/atlassian/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
transaction-isolation = READ-COMMITTED
max_allowed_packet = 150M

max_connections = 250
query_cache_size = 2G
key_buffer_size = 2G
thread_cache_size=12
tmp_table_size = 256M
max_heap_table_size = 256M
table_open_cache = 2000

innodb_buffer_pool_size=8G
innodb_log_buffer_size=32M

Database content

We couldn’t do much about schemas design as this is under control of the applications themselves. What we did instead is that we ensure that all tables use INNODB engine and also innodb_file_per_table is enabled to prevent huge INNODB cache issues. We’ve also implemented a procedure for periodical tables defragmentation.

NUMA and numad

Non-Uniform Memory Access architecture on the IBM System x3690 X5 servers is a blessing and a curse at the same time. Back in the days, CPUs were using a memory controller on a separate chip so-called north bridge. This solution turned up to be insufficient for modern CPUs as the time needed to address RAM was too long causing IOWaits and performance degradation. Engineers from Intel and AMD decided to move the memory controller to the CPU itself shortening the allocation times. This is especially great for one socket units where everything is on the same chip [SoC] but what if there are more CPU sockets on the same mainboard?
The answer is QPI [Intel] the bus which connects all CPUs in a sort of a logical cluster allowing main units to allocate memory from banks assigned to other chips. But wait… is it not the same story as on the processors using the northbridge architecture? Sadly, it is… at least as it comes to performance. The rule of the thumb is that a process should not cross allocate memory between NUMA nodes so the data does not need to travel through an interconnecting bus. Linux kernel improved memory allocation within NUMA enabled systems but CentOS 6 still runs a 2.6.32 version of the kernel and simply needs help.
There are several ways to manage memory allocations on a NUMA enabled servers one being numactl [LXC] but we needed something dynamic especially for all these JVMs hosting Atlassian applications. Lucky for us Red Hat knew what’s cookin’ and they developed a daemon called numad which runs in a background and tries to force processes to allocate memory within the same NUMA node. I have to say that numad is not a perfect solution it is especially beneficial for long-running applications like JVMs but things like quick Bamboo builds would not benefit of it. The recommendation is to restart each of the running apps after enabling numad as we faced several crashes when numad was trying to move big chunks of memory between NUMA nodes.

So the simple answer is:

yum install numad
/etc/init.d/numad start
chkconfig numad on

Backup script

We do have a very strong development team so fixing one backup script was not an issue and with quite a few simple tricks we manage to improve time needed for a complete backup from over 24 hours to just 6 hours. Remember that there is ~220 GB of data to be processed each day. First thing was to switch compression from an old bzip2 to gzip and that was pretty much it. We’ve also added some excludes for irrelevant data like application caches and catalina.out logs. From the design perspective, we separated MySQL backups from data backups which are now also separated per application basis so instead of one huge backup file, we have several smaller ones. This enables us to restore faster if needed.

The current state

Well, now everyone seems to be happy.
Applications are running much faster now, which was also noted by a customer – I know this does not happen a lot. Backup runs during the night hours and it does not bother people anymore when they are working. We as a Support Team have less work to do and we can focus on future improvements.
Some of those would be to enable MTU 9000 jumbo frames between DRBD servers and we are just waiting for the customer to enable this on the hardware site. We are also looking into CPU masks to isolate and prioritise DRBD computations.
The biggest change of all would be an upcoming migration of JIRA and Confluence applications to the new cluster which will allow us to reduce Bamboo builds impact on these two apps.
We hope that we would be careful enough to not grow another monster as fighting two of them at the same time is not even possible to Marvel heroes.

– NetworkedAssets Support Team

Comments are closed.