Dec 272014
 

What is Offloading?

Offloading is in gist, transferring the resource consumption from one infrastructure to another without affecting the overall functionality of the service and retaining full control over the transactional components.

Therefore offload is never 100% unless you have 100% static website. However it can be close to 100% even for dynamic sites.

We know that using a CDN like Akamai (or what Akamai calls Dynamic Site Accelerator) for Static files etc on a website can help offload much of the work from the Servers. This can provide significant load reduction in the Data Center. This adds more scalability to your infrastructure and offers better experience to the end user. Akamai offers a report to elaborate on the offload but it can only show you the offload of traffic passing via Akamai and hence it’s the Network offload. This does translate to some Disk I/O and CPU offload on your Data Center usage. However it may not be very significant.

Akamai offers an inclusive service for Logging the access logs on their own 100% SLA Storage which is extremely underrated and generally ignored. As I mentioned in my previous post about how logs can bring down your own servers this logging service can actually help. Here is an example of my setup with apache2 web server and already on a CDN. 

The Test

at 10 req/second via CDN as measured by GA my test server is using the following resources

CPU: 105% avg to 146% on a Xeon Quad Core class CPU.

Disk I/O: 14MB/s overall (mostly apache2) Apache2 alone is using 10MB/s (thats Megabytes)

Memory: 3.3 GB (mostly for apache2 alone.)

Apache workers: 21 , process count 43

Linode longview showing high cpu/disk and network usage

 

I disabled logging altogether for the site and kept the Logs only for Critical events. For example 503 Errors would log here. The error logs feed the fail2ban service hence they are needed to dynamically block IPs attempting funny stuff. Akamai does offer the same at the edge as well but I am not using that. I disabled the logs because all the logs are already available on Akamai in Apache common log format with additional fields like referrer , cookies and full headers (if you need it) and it has zero impact on the service as it is all offloaded to Akamai.

Folks, Data Transfer is cheap but CPU and memory is not. When you get a service like Akamai you cannot rely on them alone to solve all your problems. If you are not being charged additional for the CPU usage you might as well make the most of it and maximize offload. Here is what I get after disabling server/load balancer logs.

At now 45 req/sec (so more than 4 times the original)

CPU: 10% (average)

Apache Workers: 7

Memory: 900MB average (again mostly Apache2)

Apache Workers: 7-9 Process count: 21

DISK I/O: 7 KB/s for apache2 and (1.5 MB/sec average overall)

Ok, the DISK IO needs more explanation. The other processes like Database server is also on the same host and they all are using the same resource constrained mechanical disk. When Apache starts using 10000 KB/s it was causing a race condition requiring longer times for other processes to complete their transactions. Now with Web server Disk I/O out of the picture the bottleneck is significantly reduced. The same impacts CPU indirectly.

See for yourself.


2014-12-27_08h50_29

Note that , by the time I took the screenshot the traffic had moved up to 75 req/sec. Normally this would require aggressive caching or adding Nodes. However this time I had to do neither.

The solution is there but it is never actually used by most people. I am hoping it would change once more SysAdmins get to this. And to imagine the time folks spent on Database Caching, memcache and stuff.

 

Tools and Services used:

Linode
Linode LongView
Apache2, CentOS, Apache2 with mod_status
Akamai DSA with Netstorage logging.

 

Dec 172014
 

Classic overflow problem. Bottom-line is that it’s all about IO…. Logs hit both memory and disk.

I should learn to take my own advice. I have always minimized disk hits on my server instead of starting with database caching code.

Recently one of my projects server received higher than normal traffic and it killed the mysqld process. There was no way to keep the dB running as it would get terminated. This is bad…. Really bad. The mysqld was fine, queries are cached, not enough connections either. Problem was it was still using the most amount of memory. With low memory and swap out of space,  the os decided to kill the memory consuming process. I still need to dig deeper into how nix os decides on this. In any case this should not have happened. There is more than enough memory and the traffic wasn’t nearly high enough and the CPU usage was low.

The culprit for memory hogging was rsyslogd process. The second was php-fpm children. Normally I recommend using ramdisk type of location for logs or simply logging stuff only to a remote collector and no local logging. In this instance several logs were split up and some were in verbose mode. So despite being able to support 300 requests per second the site was barely keeping up with 10. The problem with logging is the Disk IO. It is still a write and no matter how much we optimize database , if the IO is spent elsewhere there is still a problem. The problem was more pronounced in this case because of software Raid 5  which is infamous for additional IO overhead and least return on the investment.

 

First thing I did to get back the mysql server to run and stay up was stop rsyslog daemon. Then cleared up my rsyslog configuration. I now only log everything via UDP to Splunk Storm. On AWS I used to run a collector written in Twisted Python, It was a simple script and it still work well. To be posted to github by end of this post.

 

PHP-FPM children:

I use dynamic allocation but the max child was set to 100 on one of the fpm configs. I lowered it to 20. There is really no reason to have 100 child process specially since I use fpm virtual hosts config to split different sites across different users and each site maintains it’s own environment. Each fpm vhost or virtual user had between 20-50 max children.

The overall memory improvement was about 65% and I was able to  serve 100 req/second again without problems and without forking out more money. This could explain why my AWS deployments used fewer resources as opposed to those which I have analysed and found them to spending 5-10 times more on instance usage and serving less a tenth of traffic.

 

Good luck with your Adventures and look before you upgrade.

 

 

Update: Another log I disabled but forgot to mention is Apache/NGinx access logs but kept the error logging to “critical” only. This dropped an additional 90% of IO usage and reduced memory usage to 25%.

If you use CDN services or remote logging then you’d be better off either way. Services like Akamai offer to log at the “Edge” which is far more useful solution. I am going to write another post on this.