Apr 062015
 

I am bored. I have decided to code for WordPress but instead of using the regular code base I am going to write a fork which is AWS specific or rather cloud specific API like Ceph Object storage API if possible. You could theoretically  re use the code base from AWS S3 on to other platforms like DreamObjects etc. There are a few wordpress plugins around that copy your images and files to S3 and do basic CDN. But they are inadequate and buggy. For e.g. on www.GamingIO.com I am using a plugin for pushing to S3 but there are files locally and it defeats the purpose. Not to mention that due to a bug the image resize functionality in WordPress seems to create several duplicate copies of each image. I had to delete all the images to make space on the site and recovered 50% of it. I got a few broken links for all the trouble. But it was free plugin…what did I expect.

When I was digging through WordPress I noticed how much database it uses. Every post has a version and it all goes to DB. This is great for small sites not looking at more than 200 posts in it’s lifetime. Not so much for news publishing where we had 3 people doing 6-9 posts a day with several revisions. That crapped the database. With Application accelerators like Akamai, Cloudflare or even Cloudfront it isn’t so hard to reduce Disk IO on the images but Database is the one thing that kills it all. This is where S3 can be useful. It’s what I want to start with.

My setup.

So far….I have looked at the awesome guide for wannabe WordPress Developers here . Since I am familiar with WordPress and have played with a bit I went ahead with setting up my dev.
I chose to go with Turnkey WordPress even though I hate that they have all these API keys etc to link their VM to their “Cloud”. Not to mention for a development environment it lacks any repository management tools to be able to merge and update code. Seems like it wasn’t necessary but well…

Back to square one I decided to check out code on to my dev machine which has a ready LAMP setup. Did a simple,

svn co http://core.svn.wordpress.org/trunk/

An edit of config file, populate test DB  and it’s all good. I could load the wordpress just fine.

Since the dev machine is connected via Samba to my regular Windows do-it-all desktop I can access the files and load it in my editor. Normally I use Notepad++ since it doesn’t crap my directory structure but this time I am going with Eclipse. Yes I am solid broke after attending Black Hat Asia 2015 (more on that in another post). Even if I had cash to spare I would prioritize my PyCharm License.

Forward.

It might be difficult to change the wordpress core and keep up with nightly builds. I might just end up with the Plugin but hopefully it won’t be a disaster.

 

 

 

 

Dec 182012
 

Despite most my work involving PHP setup I have found python to be the most useful tool for a whole bunch of supporting tasks. One of them is to run commands or deploy packages to Amazon EC2 instances.
For really large setup this is a very cool way to get the job done. Imagine that you have 10-12 servers and autoscaling tends to change the numbers of servers every now and then. Let’s say you wanted to git update all the servers with the latest copy of the code or restart a process. You can do this with one command SSH. Yes but how on all the servers? So you search for “parallel SSH”. Seems all fine and dandy till you realize you still need to list all the hostnames. “Parallel SSH, why you no just read my mind”. We are going to make something like parallel SSH really quickly that works on AWS and easy to do whatever you want it to.

However this SSH is, well cross platform I suppose all you need to do is be able to run Python (2.5 or higher). I am not going into in-depth details. I want to show you how you can do this yourself, make you a believer. Then ofcourse I recommend you do further reading. There is a lot of literature out there but no working examples that does what I am showing you here. Once you get the idea, you will be unstoppable crazy lunatic and be quite pleased with your megalomaniac self. Back to reality….

 

Prepare

Fabric: prepare your python install by installing this package.

Boto: Next you need the Boto packages for Python. Install that too.

Get your AWS security keys with atleast the permission to Read information about ALL EC2 instances. You don’t need anymore if you just want to SSH into the systems.

Also prepare your SSH key ofcourse.  Place it anywhere and now you can begin writing some code.

There are two parts to this.

1st part:

Use Boto to choose your EC2 Instances.

All instances have some attributes. Plus good Devops always Tag their instances… you tagged your instance didn’t you? Well no matter.

Code below (ignore fabric references, we’ll get to that in a bit)

fabfile.py (name fabfile.py is important or it won’t work)

import boto
from fabric.api import env, run, parallel

AWS_ACCESS_KEY_ID = 'GET_YOUR_OWN'
AWS_SECRET_ACCESS_KEY = 'GET_TO_DAT_CHOPPA'

def set_hosts():
    from boto.ec2.connection import EC2Connection

    ec2conn = EC2Connection(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
    i = []
    for reservation in ec2conn.get_all_instances(filters={'key-name' : 'privatekey','instance-state-name' : 'running','root-device-type': 'instance-store'}):
        #print reservation.instances (for debug)
        for host in reservation.instances:
            # build the user@hostname string for ssh to be used later
            i.append('USERNAME@'+str(host.public_dns_name))

    return i

env.key_filename = ['/path/to/key/privatekey.pem']
env.hosts = set_hosts()

Quick explanation: I have used a couple of filters above. Clearly outlined as the keyword  “filters” in code above. Here I chose to filter by Private Keyname for a bunch of servers that are in the state “running” (always good to have) and the root-device-type is instance-store. Now you see if you had tagged your servers, the key/value filter would be like this

{‘tag:Role’ : ‘WobblyWebFrontend’}

 

You can use the reference here to find more filters. We are basically filtering by Instance metadata or what you would get from ec2 describe command. You can even use autoscale group name. The idea is for you to select the servers you want to run a particular command on, so that is left to you. “I can only show you the code, you must debug it on your own”. lulz
2nd part:

Now that assuming I am right, you have chosen all the servers that you need to run your command on, we are going to write those commands in the same file.

Notice that the function set_hosts returns a list of hostnames. Thats all we needed.

continued…fabfile.py

@parallel
def uptime():
    run('uptime',env.hosts)

 

…and we are done with coding. No really.

Cd in to the directory where you saved fabfile.py . Run the program like so.

$fab uptime

Satisfying splurge of output on the command line follows….

No wait! What happened here?
When you invoke the command “fab” fabric looks for the fabfile.py and runs the function that matches the first argument. So you can keep writing multiple function for say “svn checkout”, “wget” or “shutdown now”  whatever.

The @parallel decorator before the function, tells fabric to execute the command “SIMULFKINGTANEOUSLY” on all servers. That is your parallel SSH.

May the force be with you. Although I know the look on your face right now.

Dec 172012
 
aws autoscale threshold

Recently in discussion I noted several admins and managers complain that AWS autoscale is not really cutting it for them. Some of the issues reported were

  • The autoscale takes time to respond to unexpected surge in traffic thereby causing intermittent downtimes for several users.
  • The scale-down or scaling-in causes severe issues and web app errors are thrown.

Other reported problems are actually similar to the two above. Some suggestions to handle the issue and understanding how auto scale works will help.

Before going any further, let me remind you: Autoscale is slow process. It is not designed to help in “sudden” and “unexpected” surge in traffic. That being said it is doable because “unexpected” here is human component. Just because Amazon AWS made autoscale you cannot throw capacity planning out the window.

You need to ask these questions and delve into your autoscale setup in detail.

1. How do you set your triggers?
You can set triggers based on time. For e.g. if you expect every Friday to be a high traffic day don’t use CPU as an indicator. Just make pre-launch instances from Friday, automate it and forget it. Shut them down when traffic reached near lows.
If you are using CPU or other metric it is important to look at two factors. The threshold metric for e.g. CPU and the time range. A lot of it is trial and error and depends on your usage scenario. I will take one example here. I expect my 3 web app instances to generally stay between 60-70% CPU aggregate througout the day and no more than 15 minutes on anything above 75% which is possible when deploying large OS updates and app updates.
If I set my time range to 30 minutes and CPU Threshold  to 75% it would sound logical but it is not. Traffic does not work that way and neither does averages. If you get sudden traffic spike then it cannot be helped already since we always keep only 25% additional resources. The problem is also miscalculated easily because of the aggregate CPU measure.

Spikes are handled with what you have NOW not what you can have 15 minutes from now. Therefore try to keep your permanent instance count enough to be at 40% or less average on a regular traffic basis or at 50% of peak (unexpected) traffic you have seen. Therefore by doing this alone you have made your CPU threshold trigger more pronounced and also the time needed to judge whether additional resources is required.

It would seem appropriate to set 60% as the threshold but we need the time range.

The way CPU aggregate is being calculated is average over a time range. e.g. 300 seconds or 5 minutes. A sample taken every second would add to the total and get divided by the 300 seconds.

Therefore if you had 40% CPU first second followed by 100% cpu for the next 299 seconds, Autoscale would not trigger until the 301st second or thereafter. Do you see the problem here? Which is why it is important to anticipate as much peak traffic as you can and prepare in advance. End of the day you have to find something that fits your cost so keeping your own allowances in mind for finances here is what you can additionally look to factor in.

Estimate your time factor and CPU threshold to be at the point the traffic is building towards “Peak”. Using the above example of my case of 40% average

aws autoscale threshold

aws autoscale threshold

In the graph above we need to be able to launch an instance at 70% or the 4.5 Minute mark ideally. But no way to know that the very moment is indeed traffic or just a yum update check. Allowing for the benefit of doubt we should still have a CPU running at the 100 % CPU and give it time to attach to the ELB and start serving requests. let’s say that time is 3 minutes. I have simplified the graph for calculation using absolute numbers and will simplify the calculation to arithmetic instead of Integral effing calculus because nothing should be that complicated in real life for us.

I develop two Scale-up policies. First one uses a time range of 15 minutes and looks for aggregate of 70% to launch additional instance. This is to keep to my below 50% mark at all times. This is straightforward. We do this to ensure that our CPU is not normally at high usage and be able to cover some surge in traffic without waiting for additional instances.

I make a second policy which looks for 3 minute mark or 5 minutes. Noobie admins will say “5 minutes should be 80%” but now you know that this is obviously incorrect. The average of 5 minutes in our case is 40+40+45+60+80 / 5 = 53% (assuming per minute sample) Detailed monitoring gives you per minute sampling which is why your autoscale instances are enabled for detailed monitoring whether you like it or not. This is my theory. Therefore if you were to set 60% threshold and 5 minute duration for additional Instance you should be safe from two different perspectives.
Either policy “one” would have launched an instance maintaining your CPU below 50% or policy two would have. Either of them will satisfy both criteria and you will not be seeing twice the instances.

All you have to do is ensure that you are planning “in-advance”. Autoscale is not magic. If your company’s marketing is running a campaign that predicts a “a large frenzied sale surge” despite their failings so far you should nevertheless be prepared and launch additional instances as permanent for that time period. Note here: when I say manually launch you have manually added the instances to ELB and want to keep it attached, don’t modify the Autoscale policies to do so.

 

 

2. Termination, scaling -in or scaling-down

This is rather tricky. Fact outside the control of Amazon or any provider is that when you shutdown any system it does not really care about the state of the User’s of the application unless you choose to program that in some way. For example when Windows server warns that users are connected to the Terminal Server. For web application this is left to us to decide. Therefore once you accept that connections (active) at the time of termination will exit with errors for that particular instance, the question is what is acceptable error rate. If say you choose to remove instances at 10% aggregate CPU so you are back to 30% to 40%, how many users are affected and in what way. If 10% means 1 million users who are posting memes then it is something you have to decide. If 10% means its just 10 users but they are involved in commercial transaction worth thousands or millions of dollars then the problem is not about numbers. One way to do this is handle errors as gracefully as possible. For example if you use AJAX based user registrations you can display a message “Please resubmit the form, an unknown error has occurred on our end” or if your’s a streaming app you could try delayed reconnect. Something along those lines.  If the Terminated Instance is removed from ELB then the subsequent request should direct to an online Instance. Thusly averting the much unwanted management trial room hassles where you have chosen benefit-of-the-doubt (“wtf are you talking about, looks fine to me -.- “) instead of the long rather unwieldy technical explanation which may be ill-construed as another lazy excuse (“well it begins like this… you see on Abhishek’s blog..and hence we can do nothing about those.. 🙁 ” ).

If at all possible I will avoid too many Terminations or scale down.  I would prefer to maintain the equipment that supports any unhinged marketing campaigns or random Reddit front page for entirely irrelevant reasons.

 

Speaking of termination….”I’ll be back”

 

 

 

 

Aug 312012
 

Scenario is this:

  •  Django App is running on an instance with web server and has no SSL installed.
  • SSL cert is installed on the ELB and the ELB is accepting requests for the django App (which is still non SSL)

 

 

Problem happens here is that URL’s that django generates is not secure i.e. HTTP as well as django isnot enforcing secure mode.

For this we can possibly use a Django middleware, example code

 

from django import http

class ELBMiddleware(object):
  def process_request(self, request):
   if 'HTTP_X_FORWARDED_PROTO' in request.META:
    if request.META['HTTP_X_FORWARDED_PROTO'] == 'https':
    request.is_secure = lambda: True
   return None

 

Remember to save this middleware in your django directory and enable it in settings.py You know how right. hint: filename.Classname 🙂

suggestions? ideas, improvements are welcome.

Aug 272012
 

Collected from experiences of self and other largely honest group of admins, AWS users group meeting (yes it’s like AA, we all say our name and begin our story)
This post will not make sense to those who have not actually used Amazon AWS even once. If you have had a dab at it or your boss asked you to go figure AWS and “set us up a server or you have no chance to survive” then this might help you.

Drink Driving

1. Treating Amazon AWS resources as a regular Data Center or Dedicated Host provider.

AWS is just that, a web service. All servers are virtual and disks are theoretical, sysadmins are asleep. Do not put all your beers in one basket. Also there is no Basket, only snapshots and AMIs
Largely the idea that servers are not unique and do not exist as physical hardware is hard to grasp. This leads to elaborate single server setup backed completely by EBS.
Both EBS and Instances (not servers as they are not) can and will fail – Yoda ( It was either him or me. )

Best practise both in and out of AWS is to assume that after you have done your perfect CentOS6 LAMP setup the whole thing will disappear. Write Bash scrips, WMI, Images whatever to be able to recreate everything on a new hardware from scratch.

Assuming AWS resources as if you were in a data center with a bunch of noisy servers and your expensive SAN disks, is just an accident waiting to happen. Psychological help is advised.

2. I know ” I will use NFS”

That’s just the alcohol talking. When faced with multi-tenanted architecture most sysadmins and developers realize they need to share “these” files between “those” autoscale servers. noobie sysadmin pronto hooks up another “server” with large “harddisk” and NFS shares the hell out of it “voila, i made bomb”. The problem here is you cannot scale EBS (which is a network attached storage at best) and neither can you scale your NFS server. Entire setup is henceforth considered fail and waste of startup funding. Solution here is to code to use S3, databases, memcache. For not frequently updating files like source code there is Source control (github? codesion? take your pick) which can just as easily be checked out on every system at specific events (push/pull). But please do not commit your user uploaded files to Source control as well.

3. We have now added ELB, bring it on “spiky traffic”

Rum is widely assumed to bring on unrealistic and sometimes fatal acts of bravery. ELB is a slow autoscale system and takes time to “warm up” and serve requests. If you have mountain shaped traffic spikes its best to email Amazon support with your peak traffic data and ask them to “pre warm” your ELB. They know what to do. Also advisable to check on Amazon SES rates, DynamoDB table limits , RDS etc which are not autoscale resources. Setting up SNS alerts for thresholds is a must and should be set a bit earlier than “you shall not pass” messages starts hitting users. I use 70% as a good number for DynamoDB rates, RDS CPU usage, Connection usage etc.

4. Haha, 2 instances before product launch and 100+ after launch. I’ll just sit here and refresh my dashboard…

The cops will get to you first, period. Amazon AWS has resources limits. It is unlikely you can allocate unlimited number of resources. It is best to email support and find out. Ec2 Instances are limited to 20 instances per account by default. Therefore if you needed more and getting flooded by traffic, you are out of luck as support can take at best a day to lift your limits, justifiable reasons aside.
In case you do find yourself in this situation that you have exceeded instance limit and need more right now, Spot Instances can come to your rescue. So far we have not seen any limits on those. They will buy you the time. Spot instances do not count towards your account quota limits.

5. I will use ELB as internal proxy, now have a neat gun of a MySQL load balancer as well.

Why would you snail mail your AA meeting flyers when you have everyone’s email address and facebook groups? We need to talk about this problem of yours.
ELBs are essentially gateways and can be compared to your home modem. They are not routers and cannot determine if you are talking to a PC on the Home Network or trying to SSH tunnel to your office. Therefore you are essentially adding lag to your traffic with additional round trip. It’s like sharing files using DropBox between two home computes on the same LAN. If you notice the ELB does not seem to have an internal domain name unlike Instances. The VPC ELBs operation is different and we are not going to discuss it with someone like you till you recover from this hangover.

6. Lets use a single zone for all instances for autoscaling and save on the stupid inter AZ data transfer cost. I should get a bonus for this.

Getting free drinks is no excuse. You are obviously not making progress here.

ELB will refer to your instance zones and setup one ELB endpoint per zone (you only see one ELB for simplicity). If you have only one zone, you my friend, are going to have a problem, you also have only one zone ELB. It is quite common to see a single, or sometimes more than one zone, experience issues like network latency, faults etc. I personally was once baffled when 2 of the zones had no instances to spare and every one was spot bidding above the on demand price just to stay in the same zone. I should get some of the free drinks those guys got but I am trying to be sober here while I attend this night beach party.

7. I save cost of EBS IO by starting an instance store and then dd copying from EBS attached volumes to instance store. Why do I have x and y problems?

You are obviously in the wrong meeting. You need to get in touch with the nearest hospital or something.
Why, just why?

8. We use resources from more than one REGION to server a single request.

Good Luck on your new adventures. You can add gambling to your list of problems.

More..coming too soon.

Jul 032012
 

I will briefly describe the tools and idea behind using the Amazon AWS for Web deployments to it’s full extent without overdoing it. This is the first Step of at least three I am going to write down for reference and critic. I have not read any books on the subject, having worked as both server engineer in data centers and 3 tier coder I had accumulated a fair share of failures and ideas of what could have been done right. It was just plain hindsight behind doing things the way I did.
I used AWS documentation for most of my work as I do now with Openshift.

Parts
1. Architecture (Infrastructure requirements)
2. Tools – Platforms, tools, scripts and programs
3. Building, Testing and Automating

This is the simplest form of web deployment that is scalable, fault tolerant and very redundant as well as stable.
Below is an architecture of the basic AWS IaaS building blocks we need. Here I use Amazon RDS for MySQL as it takes care of a lot of headaches involved with maintaining a MySQL server with backups and failover.

Amazon AWS WEB deployment

Amazon AWS WEB deployment, highly scalable and redundant

I currently have at least 3 deployments in production between one to three years and can maintain an up time of 99.9x% despite occasional DOS attacks and several Database migrations.

For PHP5 backed applications:

App Servers – These EC2 Instances can be just plain web servers with Nginx + PHP-FPM and your source code. All static files, all user data (uploaded files , avatars pictures, comments etc) that can change should not be here on local drives. They should be in S3 or other locations that can be shared by any EC2 Instance.
Note: Amazon Sales will tell you to use huge EBS drives and copy files over etc. I do not understand their reasoning for this. All App servers I have are Instance Store Machine Images which contain source code/executables only. While boot time is slightly longer (as claimed by AWS sales) I see no significant difference. I also found EBS backed instances are more prone to failures, reboots, corruption and cost more. A badly configured EBS backed instance can leave behind unused EBS volumes like bird shit.

WARNING: NEVER USE NFS TO SHARE FILES BETWEEN AUTOSCALED SERVERS. IN FACT NEVER USE NFS… period.

RDS: Always use Multi-AZ. Try to keep to one DB only for best performance , now you know why programmers used table prefixes. Do not use anything less than a small instance on production.
WARNING FOR MYSQL RDS USERS: NEVER USE MYISAM Tables. INFACT , ONLY USE INNODB unless you have a damn good reason to not use InnoDB.
Activate Snapshots to atleast 5 days. They are very useful when someone accidentally deletes your database , muti AZ will not save you, it will happily delete your failover copy. Yes it happens in real life! If ou want to make your own homegrown backup server , go ahead but leave snapshots on. They are the fastest recovery mechanism and is virtually hands free.

I think the rest is self explanatory for now. Unless there are questions then I can update the post.

In Part 2, I will cover the tools like memcache, syslog, configuring an AMI with startup scripts, automating stuff.

Jun 302012
 

This is not the best or perfect way, however this works for me and I was asked to share. This is going to be my first post. The AMI I am using is Instance Store backed. You could use the same for EBS backed AMI though I have not tried.

Let’s assume you are creating the original AMI in US-East and you want to put a copy in US-WEST.

  • When you create AMI simply upload bundle to US-WEST bucket as well.
  • Proceed to register AMI in US-East to ensure it works there. If it already works then try to register in US-WEST.
  • You can at this point simply register the AMI from the bucket Manifest URL or it will fail complaining about Kernel.
  • When this happens you need to fix the kernel-id. The rest of the steps are only done on the Target region US-West.
  • In US-West you will find the OS of your choice in the publicly avalable AMIs and then proceed launch it. However you don’t need to start an instance.
  • Assuming you used the AWS web console, in the following screens you can see a dropdown list of Kernel ID. Simply select the dropdown list and choose one.
  • For e.g. my original AMI is based on CentOS 5.5 so in US-West I looked for CentOS5.5 and found a kernel ID.
  • Note down a bunch of kernel ID from the list. the kernel ids start with aki-xxxx
  • Go to the US-West S3 bucket where you uploaded the AMI bundle files you wanted to migrate and find the xxxx-Manifest.xml. Download and open it in an editor of your choice.
  • Let us fix our AMI’s xxxx-Manifest.xml. Look for the section of XML about machine configuration. Something like this…


<machine_configuration>
<architecture>x86_64</architecture>
<block_device_mapping>
<mapping>
<virtual>ami</virtual>
<device>/dev/sda1</device>
</mapping>
<mapping>
<virtual>root</virtual>
<device>/dev/sda1</device>
</mapping>
</block_device_mapping>
<kernel_id>aki-9ba0f1de</kernel_id>
</machine_configuration>

You will note there is a kernel_id element there. So edit it and replace the kernel id with the one we got in US-WEST.

  • Remember to make a copy of the original Manifest.xml in case you make a mistake. Save it somewhere for future reference or rename the manifest.xml name if you are saving in the same bucket as the orignal manifest to avoid confusion.
  • Upload your edited xxxx-Manifest.xml back to your AMI bucket replacing the one that is already there.
  • Try to register the AMI again. If that fails try another kernel-id following the same process as above

That’s about it. Pretty simple and straightforward. I am sure Tech team at AWS are working on a better solution for this.