Dec 202012
 

I have had this site hosted on Redhat Openshift for almost an year. Considering I got this hosting free (and you can too) I was a bit apprehensive of whether I should move the site elsewhere. I let it be here anyway. It was surprising that not only do I get awesome performance but the uptime has been incredible. I was going to set this blog up on an AWS EC2 micro instance.  However micro instance being what it is would have costed as much as a Linode VPS and would be a time shared CPU. This can get very annoying as you find that on and off the site would be slower.

Redhat openshift offers a free micro instance equivalent, the difference being that you are probably on a much bigger Instance since the PaaS run atop of AWS cloud making this setup akin to a VPS. This makes more sense instead of spending on a Small instance or settling for micro. In fact I don’t recommend Micro instance at all for any purpose other than testing, compiling or other such process where on-demand CPU is not necessary.

Comparatively my Linode VPS which costs me $40 a month is not doing as well in terms of performance when I use it to host WordPress sites. I am still not clear as to why the memory usage and swap is higher on Linode VPS (perhaps the CPU is over subscribed) but this Openshift instance is at 512 MB Ram and doing just great.

If you are a developer who does not want to get into the hassle of setting up servers and services and just want to get down to coding your stuff I recommend you give Redhat Openshift a try. You will not be disappointed. Specially if you build sites for your clients.

Considering the way things are changing with PaaS and Cloud , the price being what it is, I am thinking why do I still put up with my Dreamhost account which is barely usable and hosts 1000s of users and sites on a single server. I coulld not even do a basic PHP development and test on it. Same goes for pretty much any webhost or reseler like Godaddy, Media temple and blah blah.

 

Since I also use Google App Engine for development and learning it is worth adding why you would choose Redhat over App Engine. Familiarity is possibly number 1. Known that App Engine support MySQL now but it remains that you have shell access to your instance on Openshift much as you would on your own instance. You can also access some basic metrics and new services are being built all the time on Open Shift. Check out the Websockets beta here https://openshift.redhat.com/community/blogs/newest-release-websockets-port-forwarding-more recently launched.

 

My favorite Python web framework Flask is effing supported as well https://openshift.redhat.com/community/get-started/flask . I cannot describe how much pain is involved in hosting these python apps on just about any distro. I think I am going to setup my Flask sites over at Openshift. Ofcourse Django is supported. I have tried neither of them.

 

Now that I am confident about Openshift here are the things I would like to learn to get to production deployment of my Python projects.

  1. How do I add SSL certificates
  2. How do I enable autoscale ( I suppose this is just to do with your AWS account and ofcourse you have to Openshift Enterprise for it it seems)
  3. How do I use my existing RDS with openshift (and securely)
  4. I am sure there is a whole bunch of things I haven’t thought of yet.
All of the above is in the docs somewhere.
Concern: Given that Redhat’s own Enterprise support is not highly regarded among devs and ops, I wonder what does Openshift “Enterprise” will do for us?
OpenShift Enterprise
truth is I do not have experience with Redhat Enterprise Support or with OpenShift Enterprise Sservice. Question is, should I be the one to take the first hand experience myself? Or to bet my job on it? That is going to be some daring. xD

 

Dec 182012
 

Despite most my work involving PHP setup I have found python to be the most useful tool for a whole bunch of supporting tasks. One of them is to run commands or deploy packages to Amazon EC2 instances.
For really large setup this is a very cool way to get the job done. Imagine that you have 10-12 servers and autoscaling tends to change the numbers of servers every now and then. Let’s say you wanted to git update all the servers with the latest copy of the code or restart a process. You can do this with one command SSH. Yes but how on all the servers? So you search for “parallel SSH”. Seems all fine and dandy till you realize you still need to list all the hostnames. “Parallel SSH, why you no just read my mind”. We are going to make something like parallel SSH really quickly that works on AWS and easy to do whatever you want it to.

However this SSH is, well cross platform I suppose all you need to do is be able to run Python (2.5 or higher). I am not going into in-depth details. I want to show you how you can do this yourself, make you a believer. Then ofcourse I recommend you do further reading. There is a lot of literature out there but no working examples that does what I am showing you here. Once you get the idea, you will be unstoppable crazy lunatic and be quite pleased with your megalomaniac self. Back to reality….

 

Prepare

Fabric: prepare your python install by installing this package.

Boto: Next you need the Boto packages for Python. Install that too.

Get your AWS security keys with atleast the permission to Read information about ALL EC2 instances. You don’t need anymore if you just want to SSH into the systems.

Also prepare your SSH key ofcourse.  Place it anywhere and now you can begin writing some code.

There are two parts to this.

1st part:

Use Boto to choose your EC2 Instances.

All instances have some attributes. Plus good Devops always Tag their instances… you tagged your instance didn’t you? Well no matter.

Code below (ignore fabric references, we’ll get to that in a bit)

fabfile.py (name fabfile.py is important or it won’t work)

import boto
from fabric.api import env, run, parallel

AWS_ACCESS_KEY_ID = 'GET_YOUR_OWN'
AWS_SECRET_ACCESS_KEY = 'GET_TO_DAT_CHOPPA'

def set_hosts():
    from boto.ec2.connection import EC2Connection

    ec2conn = EC2Connection(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
    i = []
    for reservation in ec2conn.get_all_instances(filters={'key-name' : 'privatekey','instance-state-name' : 'running','root-device-type': 'instance-store'}):
        #print reservation.instances (for debug)
        for host in reservation.instances:
            # build the user@hostname string for ssh to be used later
            i.append('USERNAME@'+str(host.public_dns_name))

    return i

env.key_filename = ['/path/to/key/privatekey.pem']
env.hosts = set_hosts()

Quick explanation: I have used a couple of filters above. Clearly outlined as the keyword  “filters” in code above. Here I chose to filter by Private Keyname for a bunch of servers that are in the state “running” (always good to have) and the root-device-type is instance-store. Now you see if you had tagged your servers, the key/value filter would be like this

{‘tag:Role’ : ‘WobblyWebFrontend’}

 

You can use the reference here to find more filters. We are basically filtering by Instance metadata or what you would get from ec2 describe command. You can even use autoscale group name. The idea is for you to select the servers you want to run a particular command on, so that is left to you. “I can only show you the code, you must debug it on your own”. lulz
2nd part:

Now that assuming I am right, you have chosen all the servers that you need to run your command on, we are going to write those commands in the same file.

Notice that the function set_hosts returns a list of hostnames. Thats all we needed.

continued…fabfile.py

@parallel
def uptime():
    run('uptime',env.hosts)

 

…and we are done with coding. No really.

Cd in to the directory where you saved fabfile.py . Run the program like so.

$fab uptime

Satisfying splurge of output on the command line follows….

No wait! What happened here?
When you invoke the command “fab” fabric looks for the fabfile.py and runs the function that matches the first argument. So you can keep writing multiple function for say “svn checkout”, “wget” or “shutdown now”  whatever.

The @parallel decorator before the function, tells fabric to execute the command “SIMULFKINGTANEOUSLY” on all servers. That is your parallel SSH.

May the force be with you. Although I know the look on your face right now.

Dec 182012
 

There is already a nice port of the rewrite rules for lighttpd available somewhere on the gallery2 site.
I found it quite a challenge to get nginx to work with gallery2 for several reasons one of them being that there was no written example of working NGINX rules.
Gallery2 allows you to control which URLs you want to rewrite. Here I will rewrite a few of them as an example of the overall implementation. Anyone looking to do this can simply adapt for their own use case. Just a note that you should adapt and test the setup instead of copy pasting it as is. Lets get right to it then.

 

server {
        listen       IP_ADDRESS;
        server_name  HOST_NAME;

    rewrite_log on;
    root        /path/to/gallery2;
    index       index.php index.html index.htm main.php;

#file upload size that you have configured in PHP and G2
        client_max_body_size 2m;
        client_body_buffer_size 256k;

#albums#
        location /v/
{
    if ($request_uri !~ /main.php)
    {
        rewrite ^/v/(.*)$ /main.php?g2_view=core.ShowItem&g2_path=$1 last;
    }
}

#photos#
location /d/
{

   if ($request_uri !~ /main.php)
   {

rewrite /d/(d+)-(d+)/([^/?]+)  /main.php?g2_view=core.DownloadItem&g2_itemId=$1&g2_serialNumber=$2&g2_fileName=$3
last;
    }

}

        location /lib/yui/ {
alias /path/to/gallery2/lib/yui/;
         break;
}

## Dynamic Albums Plugin and RSS (optional)###
        location /updates {
        rewrite /updates(.*)$ /main.php?g2_view=dynamicalbum.UpdatesAlbum last;
        }

        location /popular {
        rewrite /popular(.*)$ /main.php?g2_view=dynamicalbum.PopularAlbum last;
        }

        location /srss/ {
        rewrite /srss/(d+)$ /main.php?g2_view=rss.SimpleRender&g2_itemId=$1 last;
        }
##  End Dynamic Albums ###

        location / {
                try_files $uri $uri /main.php?url=$1;
                #rewrite ^/(.+)$ /main.php?url=$1 last;
        }

        location ~ .php$ {
        fastcgi_pass   127.0.0.1:9000;
        fastcgi_index  main.php;
        fastcgi_intercept_errors on; # to support 404s for PHP files not found
        fastcgi_split_path_info ^(.+.php)(/.+)$;
        fastcgi_param  PATH_INFO          $fastcgi_path_info;
        fastcgi_param PATH_TRANSLATED $document_root$fastcgi_path_info;
        fastcgi_param  SCRIPT_FILENAME $document_root$fastcgi_script_name;
         fastcgi_param REMOTE_ADDR $remote_addr;
        include        fastcgi_params;
                fastcgi_buffer_size 4k;
                fastcgi_buffers 256 4k;
                fastcgi_busy_buffers_size 16k;
                fastcgi_read_timeout 300s;
        }

        # Static files.
        # Set expire headers, Turn off access log
        location ~* favicon.ico$ {
                access_log off;
                expires 1d;
add_header Cache-Control public;
            }
       location ~ ^/(img|cjs|ccss)/ {
             access_log off;
                expires 7d;
                add_header Cache-Control public;
        }

    # Deny access to .htaccess files,
    # git & svn repositories, etc
    location ~ /(.ht|.git|.svn) {
        deny  all;
    }

    }

This is a working setup. As it is with NGINX there is no one way to do the same thing so feel free to suggest your improvements for everyone’s benefit. Note that I know some of the things like “try_files” is not recommended but I used what works.

Known Issue:
Gallery2 generates URLs for filenames with spaces by substituting a + sign. therefore http://example.com/gallery2/v/a+picture+with+spaces.jpg will not work with nginx at all. I did not find an nginx solution for this but it seems G2 throws a security violation error. You can ofcourse edit the code and handle it because it seems apache handles “+” differently (url encoding) the details of which I don’t wan’t to get into. Suffice to say, it is manageable and I was able to do so though it may not be necessary for everyone. If anyone requires that fix I can provide it though I am not sure if that is the best way to do it.

Gallery2 is still in use. I have not tried nginx with Gallery 3 “G3”. I did do an install on regular apache2 for testing purposes and that’s about it.

Dec 172012
 
aws autoscale threshold

Recently in discussion I noted several admins and managers complain that AWS autoscale is not really cutting it for them. Some of the issues reported were

  • The autoscale takes time to respond to unexpected surge in traffic thereby causing intermittent downtimes for several users.
  • The scale-down or scaling-in causes severe issues and web app errors are thrown.

Other reported problems are actually similar to the two above. Some suggestions to handle the issue and understanding how auto scale works will help.

Before going any further, let me remind you: Autoscale is slow process. It is not designed to help in “sudden” and “unexpected” surge in traffic. That being said it is doable because “unexpected” here is human component. Just because Amazon AWS made autoscale you cannot throw capacity planning out the window.

You need to ask these questions and delve into your autoscale setup in detail.

1. How do you set your triggers?
You can set triggers based on time. For e.g. if you expect every Friday to be a high traffic day don’t use CPU as an indicator. Just make pre-launch instances from Friday, automate it and forget it. Shut them down when traffic reached near lows.
If you are using CPU or other metric it is important to look at two factors. The threshold metric for e.g. CPU and the time range. A lot of it is trial and error and depends on your usage scenario. I will take one example here. I expect my 3 web app instances to generally stay between 60-70% CPU aggregate througout the day and no more than 15 minutes on anything above 75% which is possible when deploying large OS updates and app updates.
If I set my time range to 30 minutes and CPU Threshold  to 75% it would sound logical but it is not. Traffic does not work that way and neither does averages. If you get sudden traffic spike then it cannot be helped already since we always keep only 25% additional resources. The problem is also miscalculated easily because of the aggregate CPU measure.

Spikes are handled with what you have NOW not what you can have 15 minutes from now. Therefore try to keep your permanent instance count enough to be at 40% or less average on a regular traffic basis or at 50% of peak (unexpected) traffic you have seen. Therefore by doing this alone you have made your CPU threshold trigger more pronounced and also the time needed to judge whether additional resources is required.

It would seem appropriate to set 60% as the threshold but we need the time range.

The way CPU aggregate is being calculated is average over a time range. e.g. 300 seconds or 5 minutes. A sample taken every second would add to the total and get divided by the 300 seconds.

Therefore if you had 40% CPU first second followed by 100% cpu for the next 299 seconds, Autoscale would not trigger until the 301st second or thereafter. Do you see the problem here? Which is why it is important to anticipate as much peak traffic as you can and prepare in advance. End of the day you have to find something that fits your cost so keeping your own allowances in mind for finances here is what you can additionally look to factor in.

Estimate your time factor and CPU threshold to be at the point the traffic is building towards “Peak”. Using the above example of my case of 40% average

aws autoscale threshold

aws autoscale threshold

In the graph above we need to be able to launch an instance at 70% or the 4.5 Minute mark ideally. But no way to know that the very moment is indeed traffic or just a yum update check. Allowing for the benefit of doubt we should still have a CPU running at the 100 % CPU and give it time to attach to the ELB and start serving requests. let’s say that time is 3 minutes. I have simplified the graph for calculation using absolute numbers and will simplify the calculation to arithmetic instead of Integral effing calculus because nothing should be that complicated in real life for us.

I develop two Scale-up policies. First one uses a time range of 15 minutes and looks for aggregate of 70% to launch additional instance. This is to keep to my below 50% mark at all times. This is straightforward. We do this to ensure that our CPU is not normally at high usage and be able to cover some surge in traffic without waiting for additional instances.

I make a second policy which looks for 3 minute mark or 5 minutes. Noobie admins will say “5 minutes should be 80%” but now you know that this is obviously incorrect. The average of 5 minutes in our case is 40+40+45+60+80 / 5 = 53% (assuming per minute sample) Detailed monitoring gives you per minute sampling which is why your autoscale instances are enabled for detailed monitoring whether you like it or not. This is my theory. Therefore if you were to set 60% threshold and 5 minute duration for additional Instance you should be safe from two different perspectives.
Either policy “one” would have launched an instance maintaining your CPU below 50% or policy two would have. Either of them will satisfy both criteria and you will not be seeing twice the instances.

All you have to do is ensure that you are planning “in-advance”. Autoscale is not magic. If your company’s marketing is running a campaign that predicts a “a large frenzied sale surge” despite their failings so far you should nevertheless be prepared and launch additional instances as permanent for that time period. Note here: when I say manually launch you have manually added the instances to ELB and want to keep it attached, don’t modify the Autoscale policies to do so.

 

 

2. Termination, scaling -in or scaling-down

This is rather tricky. Fact outside the control of Amazon or any provider is that when you shutdown any system it does not really care about the state of the User’s of the application unless you choose to program that in some way. For example when Windows server warns that users are connected to the Terminal Server. For web application this is left to us to decide. Therefore once you accept that connections (active) at the time of termination will exit with errors for that particular instance, the question is what is acceptable error rate. If say you choose to remove instances at 10% aggregate CPU so you are back to 30% to 40%, how many users are affected and in what way. If 10% means 1 million users who are posting memes then it is something you have to decide. If 10% means its just 10 users but they are involved in commercial transaction worth thousands or millions of dollars then the problem is not about numbers. One way to do this is handle errors as gracefully as possible. For example if you use AJAX based user registrations you can display a message “Please resubmit the form, an unknown error has occurred on our end” or if your’s a streaming app you could try delayed reconnect. Something along those lines.  If the Terminated Instance is removed from ELB then the subsequent request should direct to an online Instance. Thusly averting the much unwanted management trial room hassles where you have chosen benefit-of-the-doubt (“wtf are you talking about, looks fine to me -.- “) instead of the long rather unwieldy technical explanation which may be ill-construed as another lazy excuse (“well it begins like this… you see on Abhishek’s blog..and hence we can do nothing about those.. 🙁 ” ).

If at all possible I will avoid too many Terminations or scale down.  I would prefer to maintain the equipment that supports any unhinged marketing campaigns or random Reddit front page for entirely irrelevant reasons.

 

Speaking of termination….”I’ll be back”