Where Google App Engine Spanks Amazon’s Web Services: S3, EC2, Simple DB, SQS
A short summary of the differences between Google’s App Engine and Amazon AWS.
Pricing for App Engine has also been announced.
Where Google App Engine Spanks Amazon’s Web Services: S3, EC2, Simple DB, SQS
A short summary of the differences between Google’s App Engine and Amazon AWS.
Pricing for App Engine has also been announced.
Hunting memory leaks in Python
Interesting post showing how to use the Python garbage collector’s introspection features and Graphviz to track down object reference problems.
Gin, Television, and Social Surplus
People new to open source software, blogging and other participatory Internet activities often wonder where others find the time. In short, it comes from not wasting a lot of time on things such as TV. The two-way nature of the Internet has made it possible for normal people to be part of the creative process in their spare time in a way that one-way media like TV and radio do not.
The article linked to above refers to the time wasted on TV etc as the cognitive surplus. It even goes on to define a ‘cognitive unit’ based on the total amount of work that has gone into creating Wikipedia. Using this unit, the amount of cognitive resources that are wasted on TV every year is estimated at 2,000 Wikipedias or 200 billion hours in the U.S. alone.
The linked article is worth your time. The suggested link between Gin, TV and societal change is fascinating.
Environment Canada is nice enough to publish radar data for their many weather radars across the country. The Exeter WSO radar covers the area in which I live.
A friend of mine has created a great Google maps mashup which makes use of the data provided by Environment Canada in combination with the KML file format which has recently been standarized. Note that because of the multiple overlay layers you may need to turn off some of the data (checkboxes on the right) to make Google maps faster.
Environment Canada Weather Radar on Google Maps
Environment Canada Weather Radar KMZ (Google Earth)
Kier has also made a couple of blog posts describing the development of the mashup [1, 2].
There has been lots of discussion and buzz around the Amazon Web Services (AWS) lately. I posted a few links about this last week. Most of the articles that I have read on AWS speak of it from a high level. General discussions about how the service allows your web application to increase capacity as required are interesting but I was curious about the interface that these services present application developers and to the Internet. More specifically, how do the AWS interfaces compare with normal server colocation services.
Amazon AWS is actually a collection of services. The Elastic Compute Cloud (EC2) is the service most commonly discussed. Other interesting services that are part of AWS include the Simple Storage Service (S3), SimpleDB and the Simple Queue Service (SQS). This article will only discuss EC2 but does not aim to be a EC2 tutorial. Amazon provides a good user guide if you are sufficiently interested.
Everything below comes from an afternoon of experimentation with EC2. Please leave a comment with any corrections or other useful bits of information you might have.
EC2 operates on a pay for what you use model. As a result you need a credit card to use EC2 so Amazon can bill you once per month based on your usage. The first step is to sign-up for an AWS account. This account will give you access to AWS documentation and other content. After you have an AWS account you can then enroll in EC2. It is at this point that the credit card is required.
All interaction with EC2 occurs over web service APIs. Both REST and SOAP style interfaces are supported. Web service authentication occurs via X.509 certificates or secret values depending on the web service API used. Amazon nicely offers to generate an X.509 certificate and public/private keys for you. Letting Amazon create the keys and the certificate is probably a good idea for most people since it is not an entirely trivial task. However, depending on how paranoid you are you might want to create the keys locally. Amazon says they don’t store the private keys they generate and I have no reason to doubt them but generating the keys locally reduces the possibility that your private key will be compromised.
The fundamental unit in EC2 is a virtual machine. If you have experience with Xen or VMWare you can think of EC2 as a giant computer capable of hosting thousands of virtual machines. In fact, the virtualization technology used by EC2 is Xen. At present only Linux based operating systems are supported but Amazon says that they are working towards supporting additional OS’s in the future. Since Xen already has the capability to host Windows and other operating systems this certainly should be possible.
All virtual machine images in EC2 are stored in Amazon’s S3 data storage service. Think of S3 as a file system in this context. Each virtual machine image stored in S3 is assigned an Amazon Machine Image (AMI) identifier. It is this identifier that serves as the name of the virtual machine image within EC2.
Virtual machine images within EC2 can be instantiated to become a running instance. Many instances of an image can be running at any one time. Each instance has its own disks, memory, network connection etc so it is completely independent from the other instances booted from the same image. Think of the virtual machine image as an operating system installation disk. This is all very similar to VMWare and other virtualization technologies.
Amazon and the AWS community provide a large number of AMIs for various Linux distributions. Some are general images while others are configured to immediately run a Ruby on Rails application or fill some other specialized role. Of course it is also possible to create new AMIs either for public or private use. Private images are encrypted such that only EC2 has access to them. Since private images will likely contain proprietary code this is a necessary feature.
For an example of why you might want multiple images consider a three tier web application which consists of a web server tier, application tier and a database tier. By having an AMI for each of these machine types the application author can quickly bring new virtual machines in any tier online without having to make configuration changes after the new instance has booted. EC2 also allows a small amount of data to be passed to new instances. This data can be used like command line arguments. For example the address of a database server could be passed to the new instance.
All interaction with EC2 occurs via very extensive web service APIs. Creating and destroying new instances is trivial as is obtaining information on the running instances. There is even a system in place for instances to obtain information about themselves such as their public IP address. Where applicable, such as when starting a new virtual machine instance, these web service calls must be authenticated via a X.509 certificate or a secret value.
Since not everyone will want to write their own EC2 management software Amazon provides a set of command line utilities (written in Java) which wrap the web service APIs. This allows the user to start, stop and manage EC2 instances from the command line.
Creating a new instance is as simple as:
./ec2-run-instances ami-f937d290 -k amazon
The ‘-k amazon’ specifies the name of the SSH private key to use. I’ll come back to this in a bit. Starting ten instances of this image can be accomplished by adding ‘-n 10’
./ec2-run-instances ami-f937d290 -n 10 -k amazon
It is also possible to look at the virtual machines console output. Unfortunately, this is read-only. Management activities are not possible via the console. In this case the instance identifier is passed not the AMI.
./ec2-get-console-output i-8fad57e6
Again, all of the management activities happen via web service APIs so you can build whatever management software you require.
At present Amazon offers three different virtual hardware platforms.
The storage layout of a small instance running a Fedora 8 image looks like the following:
-bash-3.2# cat /proc/partitions major minor #blocks name   8    2 156352512 sda2   8    3    917504 sda3   8    1   1639424 sda1
-bash-3.2# df -h Filesystem           Size Used Avail Use% Mounted on /dev/sda1            1.6G 1.4G 140M 91% / none                 851M    0 851M  0% /dev/shm /dev/sda2            147G 188M 140G  1% /mnt
Output from top:
Tasks: 49 total,  1 running, 48 sleeping,  0 stopped,  0 zombie Cpu(s): 0.6%us, 1.0%sy, 0.0%ni, 96.1%id, 0.6%wa, 0.0%hi, 0.0%si, 1.7%st Mem:  1740944k total,   88904k used, 1652040k free,    4520k buffers Swap:  917496k total,       0k used,  917496k free,   33424k cached
The disk partitions attached to each instance are allocated when the reservation is created. While these file systems will survive a reboot they will not survive shutting the instance down. Also note that Amazon makes it clear that internal maintenance may shut down virtual machines. This basically means that you cannot consider the disks attached to the reservations as anything more then temporary storage. It is expected that applications running on the EC2 platform will make use of the S3 data storage service for data persistence. In fact, signing up for the EC2 service automatically gives access to S3.
Once instantiated each virtual machine has a single Ethernet interface and is assigned two IP addresses. The IP address assigned to the Ethernet interface is a RFC-1918 (private) address. This address can be used for communication between EC2 instances. The second address is a globally unique IP address. This address is not actually assigned to an interface on the virtual machine. Instead NAT is used to map the external address to the internal address. This allows the instance to be directly addressed from anywhere on the Internet but does limit communication to using the protocols supported by Amazon’s NAT system. At present traffic to and from the virtual machines is limited to the common transport layer protocols (TCP and UDP) making it impossible to use other transport protocols such as SCTP or DCCP.
Both the internal and external IP addresses are assigned to new instances at boot time. EC2 does not support static IP address assignment.
Amazon implements firewall functionality in the NAT system which handles all public Internet traffic going to and from the EC2 instances. When instantiated each instance can be assigned a group name or use the default group. The group name functions like an access list. Changing the access rules associated with a group is accomplished with the ec2-authorize command. The following example allows SSH, HTTP and HTTPS to a group named ‘webserver’.
ec2-authorize webserver -P tcp -p 22 ec2-authorize webserver -P tcp -p 80 ec2-authorize webserver -P tcp -p 443
The authentication method used to connect to an EC2 instance depends on whether or not you build your own images. If you build your own image you can use whatever authentication or management solution you like. Obvious examples include configuring the image with predefined usernames and passwords and using SSH or perhaps Webadmin. Installing SSH keys for each user and disabling password authentication is probably the best choice.
Authentication when using the publicly available images is a little more complicated. Having a default user/password combination or even default user SSH keys would allow other users to easily login to an instance booted from a publicly available image. To get around this problem Amazon has created a system whereby you can register an SSH key with EC2. During the virtual machine imaging process the public portion of this SSH key is installed as the user key for the root user.
The biggest difference between server colocation and EC2 is the ephemeral nature of the resources in EC2. This is a positive property in that it is trivial to obtain new resources in EC2. On the negative side of things the fact that ‘machines’ can disappear and that other resources such as IP address assignments are unpredictable adds new complexities.
Amazon states that servers can be shut down during maintenance periods and of course hardware failures will happen. Both of these events will result in virtual machine instances ‘failing’. Since disks and therefore the data that they contain disappear when instances die it seems that the complete failure of individual virtual servers is going to be a more common event than one might expect with traditional server co-location. Consider that a massive power failure event in Amazon’s data center(s) will be the equivalent to a traditional colocation facility being destroyed. Not only do you temporarily lose operational capability but each and every server and the data they were processing and storing would be gone.
In reality every large scale web service should plan for large failure events and individual server failure is also expected to happen regularly given enough nodes. Perhaps deployment on EC2 will make these events just enough more likely to force developers to address them rather than implicitly assuming that they will never occur.
If anyone reading this has experience using EC2 I would love to hear about how often you experience virtual machine failure.
Another interesting complication comes from the fact that EC2 does not support static IP address assignments. Often large web deployments include a device operating as a load balancer in front of many web servers. This may be a specialized device or another server running something like mod_proxy. Using example.com as an example, a typical deployment would point the DNS A records for www.example.com to the load balancer devices. When colocating a server it is normal to be assigned a block of IP addresses for your devices. This makes it easy to replace a failed load balancer node without requiring DNS changes. However, in the case of EC2 you do not know the IP address of your load balancer node until it has booted. As already discussed, this node can disappear and when its replacement comes back online it will be assigned a different IP address.
This presents a problem because the Internet’s DNS infrastructure relies on the ability of DNS servers to cache information. The length of time that a particular DNS record is cached is called the time to live (TTL). Within the TTL time a DNS server will simply return the last values it obtained for www.example.com rather than traversing the DNS hierarchy to obtain a new answer. The dynamic nature of IP address assignment inside EC2 does not mix well with long TTL values. Imagine a TTL value of one day for www.example.com and the failure of the load balancer node. The result would be up to a full day where portions of the Internet would be unable to reach www.example.com. Perhaps more inconvenient would be the user seeing another EC2 customer’s site if the address was reassigned.
In order to work around this problem one solution is to use a very low TTL value. This is the approach taken by AideRSS.
$ dig www.aiderss.com ; < <>> DiG 9.5.0b1 < <>> www.aiderss.com ;; global options: printcmd ;; Got answer: ;; ->>HEADER< <- opcode: QUERY, status: NOERROR, id: 3721 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 5, ADDITIONAL: 5 ;; QUESTION SECTION: ;www.aiderss.com.              IN     A ;; ANSWER SECTION: www.aiderss.com.       3600   IN     CNAME  aiderss.com. aiderss.com.           60     IN     A      72.44.48.168 . . .
$ host 72.44.48.168 168.48.44.72.in-addr.arpa domain name pointer ec2-72-44-48-168.compute-1.amazonaws.com.
AideRSS is using a sixty second TTL for the aiderss.com A record. This means that every sixty seconds all DNS servers must expire the cached value and go looking for a new value.
Another site hosted on EC2 is Mogulus (just found them when looking for EC2 customers). They take a slightly nicer approach to this problem.
$ dig www.mogulus.com ; < <>> DiG 9.5.0b1 < <>> www.mogulus.com ;; global options: printcmd ;; Got answer: ;; ->>HEADER< <- opcode: QUERY, status: NOERROR, id: 46281 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 2 ;; QUESTION SECTION: ;www.mogulus.com.              IN     A ;; ANSWER SECTION: www.mogulus.com.       7200   IN     A      67.202.12.112 www.mogulus.com.       7200   IN     A      72.44.57.45
$ host 67.202.12.112 112.12.202.67.in-addr.arpa domain name pointer ec2-67-202-12-112.z-1.compute-1.amazonaws.com. $ host 72.44.57.45 45.57.44.72.in-addr.arpa domain name pointer ec2-72-44-57-45.z-1.compute-1.amazonaws.com.
Rather than a single A record with a very low TTL Mogulus uses two A records pointing to two different EC2 nodes and a TTL of 7200 seconds (two hours).
Personally, I consider these low TTL values (especially the 60s one) to be mildly anti-social behavior because it forces additional work on DNS servers throughout the Internet to deal with a local problem. Amazon should consider adding the ability to statically provision IP addresses. This would allow the Internet facing EC2 nodes to have consistent addresses and thereby reduce the failover problems. Like everything else in EC2, this could be charged by usage. I’d be happy to pay a few dollars (5, 10, x?) a month for single IPv4 address within EC2 that I could assign to a node of my choosing.
Unless you are reading this close to the date it was written it is probably a good idea to visit Amazon for pricing information instead of relying on the data here.
When I first starting investigating EC2 I misinterpreted EC2’s pricing. I thought that instance usage was charged on a CPU time basis. This would effectively mean that an idle server would cost next to nothing. The correct interpretation is that billing is based on how long the instance is running not how much CPU it uses. The current EC2 pricing is:
This makes the constant use of a single small instance cost $70/month. Pretty reasonable especially when you consider that you do not have to buy the hardware.
Using EC2 also incurs data transfer charges.
Data transfer between EC2 nodes and to/from the S3 persistent storage service is free. Note that S3 has its own pricing structure.
In a lot of ways EC2 is similar to server location services. At its lowest level EC2 gives you a ‘server’ to work with. Given the prices outlined above using EC2 as a colocation replacement may be a good choice depending on your requirements.
What really makes EC2 interesting is its API and dynamic nature. The EC2 API makes it possible for resources such as servers and the hosting environment in general to become a component of your application instead of something which the application is built on. Applications built on EC2 have the ability to automatically add and remove nodes as demands change. Replacing failed nodes can also be automated. Giving applications the ability to respond to their environment is very intriguing idea. Somehow it makes the application seem more alive.
I finally got around to watching A new way to look at networking yesterday. This is a talk given by Van Jacobson at Google in 2006 (yes, it has been on my todo list for a long time).This is definitely worth watching if you are interested in networking.
A couple of quick comments (These are not particularly deep or anything. This is mostly for my own reference later.):
It’s pretty hard to not notice the buzz around ‘cloud computing’. In large part this is due to the new services being offered by Amazon. Who would have thought that a book seller could become the infrastructure for a new generation of Internet start-ups?
Here’s a bit of information to whet your appetite. Somehow I have to find the time to play with these technologies.
Scaling with the clouds
Surviving the storm
Drawing (nearly) unlimited power from the sky
Drawing power from the sky, part 2
In The Inquisition In Canada my friend Bob outlines how Human Rights Commissions (HRCs) are being abused.
Last week’s Cross Country Checkup episode titled “Are There Legitimate Limits to Free Expression?” also delves into the role of the HRCs as part of a larger discussion on free speech. You can find a nice introduction to this topic in the episode’s introduction (text) or you can download to the whole show (MP3). Several people people close to this issue are interviewed as well callers from across the country.
There was also a quote from someone (unfortunately I don’t remember who) which sums the issue up nicely (paraphrasing):
You have a right to not be exposed to hate but you don’t have a right to not be offended.
But he doesn’t think legislators are by and large crooks who are taking bribes in exchange for votes. In fact, he says we may have the least bribery in our nation’s history.
But the money still corrupts in a number of ways. For instance, legislators, like scientists funded by drug companies, internalize their supporters’ interests.
“Money corrupts the process of reasoning”
“They get a sixth sense of how what they do might affect how they raise money.”
Let’s start with a public version control system that lets all of society see who is adding what to legislation and other important government documents.
Coffee debuted in the late 17th century in Oxford, England — leading to rowdy coffee houses, jittery arguments and even an attempt by King Charles II to ban the substance for inspiring seditious behavior.
The other consequence: the Enlightenment.
I’ve often heard coffee houses mentioned as meeting places in various historical and revolutionary contexts. The idea that the coffee was to some extent the source of the revolutionary ideas is new to me.
The switch for a mesh topology in society has led to easy access for everyone to Free software created by open source communities. The result is an emerging approach which is rapidly spreading for smaller software projects and in my view is the future of all software acquisition. The emerging approach is an adoption-led market.
In this approach, developers select from available Free software and try the software that fits best in their proposed application. They develop prototypes, switch packages as they find benefits and problems and finally create a deployable solution to their business problem. At that final point, assuming the application is sufficiently critical to the business to make it worthwhile to do so, they seek out vendors to provide support, services (like defect resolution) and more. Adoption-led users are not all customers; they only become so when they find a vendor with value to offer.
Saving carbon emissions with HTTP caching.
Assume a fully loaded server uses 100W. Six servers, year-round, consume 5,000 kilowatt-hours per year or approximately 500-1000 pounds of CO2 emissions.
No quote but it’s still worth reading if you are interested in web development.
No One Likes a Bully: The IIPA and Canada
An interesting description of where some of the pressure relating to Canada’s copyright reform is coming from.