I'm sure folks who are interested in such things saw the reports that the Amazon.com Cloud experienced a major failing all over the East Coast. Several major websites were down and affected, including ours.
In simple terms, something was causing problems with data drive connections within a region of the Amazon cloud where our web server is located. Traffic simply couldn't get to the virtual computers. The data was fine. Nothing was lost, certainly not on our server, but it acted as if the computers couldn't find their boot disks.
This was the case between approximately 14:00-23:00 Eastern Time for the AAVSO. From what I have been able to find out, they are still working at getting all the instances working and are not done yet, but communications between our disks and our server is now working, and I expect it to remain so.
I have gone through and verified our data integrity and the AID and the databases are good. The logs show exactly what I expected them to show.
Everyone was suprised at how long this took to fix. We considered moving the web server back to Mira as things dragged on, but it would have taken time for that change to propagate through the Net, I found, and then time for it to propagate back once things were fixed. We do have Mira set up as a "warm backup" and I will be looking hard at specific methods of implementing that over the next few days. Originally that option had been set up in case we'd sustained a massive failure to our data or had been vandalized. This was not the case and everyone was expecting the issue to be solved at any minute. We weren't in a situtation where we knew the problem would take several hours or days to solve like we would be if we had lost data. Perhaps that part of our response could have been handled better, and I'll be looking at that as well.
In the end we apologize - *I* apologize - for the downtime. We really have done pretty well in that department, and I think you can hopefully expect a similar level of service in the future.
Thanks very much, and, of course, please contact me with any questions.
AAVSO Astronomical Technologist
Thanks Doc et al. for monitoring the situation and readying a contingency plan. I've been there and done that in my FAA days: trying to decide just when to pull the trigger on a backup plan that will provide a level of service, but might make it harder and take longer in the end than just waiting for the primary system to come back up. Get some well deserved rest.
The decision to move to the cloud is still a good one, it's the way of the future and this minor bump is no big deal.
A couple months ago when it was announced our website would move to the cloud, I had expressed some reservations about that. At the time, while having some concerns about reliability and safety, I did not think it was a really big issue. How surprised I was when this service failure occured yesterday! Not having an immediate warm fail-over to our in-house webserver added to the confusion. The only way I found out why it was down was by emailing to HQ! This has made me re-evaluate the safety issues of the cloud. While this recent failure only caused an inconvenient loss of service for a day, what if a more serious failure had occured? For example, if something had caused corruption of all the data on the amazon servers, so the different user accounts private data somehow got intermingled? What a nightmare to fix that would have been! Can the cloud guarantee such a catastrophic failure could never occur? Who would have thought the loss of our website access would have occured just months after we shifted over? This underscores the dangers of the "progress" of moving more computer services away from our own facilities to large anonymous providers. Yes, maybe on paper you save some money by getting more bandwidth for less dollars. But, how do you account for the potential devastating costs of loss of data and services in the event of failures? The cloud provider's paper guarantee of QOS then becomes worth only the material its written on... So, this is one of the big issues at stake. If you farm out services to others, you lose control of the situation yourself. I bet Doc was very frustrated, with all the emails coming in asking whats up?, yet little definitive information coming back from Amazon when things would be back to normal. (Service providers are notorious for keeping clients in the dark about progress. Just remember all the horror stories of people stuck on airplanes for hours and days!) I believe we all need to take a much closer look at cloud vs. in-house. And particularly, setting up a safe warm fail-over in case of cloud breakdowns. One idea may be to keep our servers up on the internet all the time, but access them via a slightly different name such as "www2.aavso.org" or directly by our own IP address (which can be input in any browser without need of DNS propagating the info). Lets reallly think about this! Mike LMK
We've had plenty of failures at HQ over the years that resulted in outages worse than this. Some of the failures were caused by equipment failures, some caused by our upstream providers failing, some caused by lightning, one caused by a truck backing up into our power pole, and one caused by an unauthorized intrusion.
None of those instances would have been a problem on the cloud. Of them, only the intrusion would have still occured and it would have been resolved almost instantly due to our ability to quickly restore to a recent backup with the click of a button. It is true that yesterday's outage would not have occured if we were hosted on site. But, overall, the history of our servers suggests that the cloud is much safer and more stable than hosting things ourselves. No one said the cloud would be perfect, we just said that it would be better. And the sample size is too small to make a decision on that statement at this time. And that doesn't even touch on the myriad of other positives we get out of the cloud - outside of stability. I think it would be foolish to make a decision as important as this in the immediate aftermath of a single outage. We need to think of things in the long term, big picture perspective.
The one recommendation I do have ties into Mike's point of no notification. The original goal of the cloud was to have our servers on site available as an immediate backup. Doc decided not to use that due to 2 reasons: 1. it would take time for the DNS change to propagate and 2. he did not know whether the outage would be over by the time staff made the switch.
My advice is to change the TTL on our DNS record to something very short (say 15 minutes). That would make any DNS change propagate worldwide very quickly. Then, when something like this happens, update the DNS to point to our server and then have our server do nothing but show a simple web page with a status message. That is, the entire web server doesnt need to be functioning. Just a message telling people what's up.