Dec 172014

Classic overflow problem. Bottom-line is that it’s all about IO…. Logs hit both memory and disk.

I should learn to take my own advice. I have always minimized disk hits on my server instead of starting with database caching code.

Recently one of my projects server received higher than normal traffic and it killed the mysqld process. There was no way to keep the dB running as it would get terminated. This is bad…. Really bad. The mysqld was fine, queries are cached, not enough connections either. Problem was it was still using the most amount of memory. With low memory and swap out of space,  the os decided to kill the memory consuming process. I still need to dig deeper into how nix os decides on this. In any case this should not have happened. There is more than enough memory and the traffic wasn’t nearly high enough and the CPU usage was low.

The culprit for memory hogging was rsyslogd process. The second was php-fpm children. Normally I recommend using ramdisk type of location for logs or simply logging stuff only to a remote collector and no local logging. In this instance several logs were split up and some were in verbose mode. So despite being able to support 300 requests per second the site was barely keeping up with 10. The problem with logging is the Disk IO. It is still a write and no matter how much we optimize database , if the IO is spent elsewhere there is still a problem. The problem was more pronounced in this case because of software Raid 5  which is infamous for additional IO overhead and least return on the investment.


First thing I did to get back the mysql server to run and stay up was stop rsyslog daemon. Then cleared up my rsyslog configuration. I now only log everything via UDP to Splunk Storm. On AWS I used to run a collector written in Twisted Python, It was a simple script and it still work well. To be posted to github by end of this post.


PHP-FPM children:

I use dynamic allocation but the max child was set to 100 on one of the fpm configs. I lowered it to 20. There is really no reason to have 100 child process specially since I use fpm virtual hosts config to split different sites across different users and each site maintains it’s own environment. Each fpm vhost or virtual user had between 20-50 max children.

The overall memory improvement was about 65% and I was able to  serve 100 req/second again without problems and without forking out more money. This could explain why my AWS deployments used fewer resources as opposed to those which I have analysed and found them to spending 5-10 times more on instance usage and serving less a tenth of traffic.


Good luck with your Adventures and look before you upgrade.



Update: Another log I disabled but forgot to mention is Apache/NGinx access logs but kept the error logging to “critical” only. This dropped an additional 90% of IO usage and reduced memory usage to 25%.

If you use CDN services or remote logging then you’d be better off either way. Services like Akamai offer to log at the “Edge” which is far more useful solution. I am going to write another post on this.