Slow load times ahead

Dreamhost has been limping badly today, and will apparently be slow for a few more days. Even their status page has temporarily been replaced with a plain HTML version. So you may be experiencing very slow load times here for a while.

We have been experiencing Network Problems Today, these are the same problems that have actually been happening since we first reported problems with our network. Unfortunately these problems have gotten worse today and are causing a majority of the downtime and slowness issues you are reporting today. These problems, and our attempts at fixing them, have been an ongoing effort. The maintenance Monday Night is a big step in resolving these problems. We are working on the network and all servers having problems, but the real solution will be the Monday Night maintenance. Sorry about the downtime, we hope to have this all resolved soon.
- Sep. 8, 2006 5:45 p.m.

Hello India, we’re still here…

…but other sites are apparently blocked.

There are a fair number of readers here from India, where some ISPs have started blocking many blogs, including all of Typepad, Blogspot, Geocities. So you might have thought this site was also blocked if you came by yesterday, since you would have gotten something like “Connection refused” or a similar error message.

Fortunately / unfortunately, it’s just Dreamhost having some hardware and network problems, which took down many of their clients for several hours yesterday, and is still behaving badly today.

Random Dreamhost issues

In case you were wondering where the site went, the past 24 hours or so has been a day of random issues with Dreamhost.

Yesterday afternoon they were having connectivity problems, which took all their customers offline for a few hours.

This morning, I discovered that this site was running, but all Dreamhost sites were unreachable via SBC/PacBell here in the Bay Area. From the logs it looks like Comcast and a variety of overseas networks were still able to connect. The Google proxy hack mentioned this morning on O’Reilly provided another quick path for looking at the web site from a different network to verify that connectivity was still working, at least from the Google data center.

A couple of hours ago I got what I thought was a response to my e-mail regarding the network connectivity problem, but which turned out to be one of the CPU utilization warning letters that have been going out lately:

[your] CPU minute usage for today is 56.15. The daily limit is 60 CPU minutes. You will continue to receive these notifications as long as your resource consumption is over 50 CPU minutes.

A little mysterious, since traffic to the site was off because of the network outage, and spam traffic hasn’t spiked either.

There aren’t any resource utilization logs posted yet. I wonder if the flaky networking over the past day contributed to the high CPU use by leaving a lot of processes around waiting for I/O that was coming in slowly or never.

Anyway, the site seems to be running normally as of this afternoon (or at least, I can get to it now).

See also: Dreamhost load average = 1004.16?

Dreamhost load average = 1004.16?

You may have that this site has been slow at times lately.

It’s currently running on shared hosting account at Dreamhost. Most of the time the load average is pretty reasonable, around 2 to 6, but in the past week or so I’ve seen it spike above 50 or even 100 a few times.

This morning I’m seeing the highest load average yet, and the site is effectively offline for the moment. The server is still keeping the connections open, but nothing is actually coming back.

[lira]$ uptime
11:21:52 up 19 days, 22:25, 10 users, load average: 583.32, 695.46, 271.13
[lira]$ uptime
11:22:53 up 19 days, 22:26, 9 users, load average: 1004.16, 957.32, 387.55

I’m not sure if this is related to recent software upgrades on their end or if there’s a new customer on this server with an application that’s behaving badly.

…30 minutes later…

Looks like they’ve rebooted the server. Still not looking too happy though.

12:02:05 up 12 min, 5 users, load average: 155.32, 66.01, 29.54

Temporary Fix for Referrer Spam

I have a temporary fix for blocking the referrer spam that started a couple of weeks ago. The volume of referrer spam here has steadily been increasing since then, and the number of source IP addresses is also continuing to expand.

The main problem I’m having is that the conditional rewrite rules I want to use in .htaccess don’t seem to be working on my current Wordpress setup at Dreamhost. Regular rewrites seem to work fine, but none of the conditional ones are working for me. The initial IP blocklists stopped most of it for a few days, but new spam IP addresses are appearing more quickly now than a few days ago.

In the meantime, the Dreamhost support knowledge base suggests using SetEnvIfNoCase to define patterns to be blocked. This does work at Dreamhost, and I’ve blocked most of the current spam run with the following:

SetEnvIfNoCase Referer ".*\.get\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.drop\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.hey\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.go\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.dive\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.switch\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.come\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.mysite\.de" BadReferrer

order deny,allow
deny from env=BadReferrer

Combined with the IP blocklist from a few days ago, this has made a huge reduction in the outgoing bandwidth. For a while the spam was all HEAD requests, but lately they have all been GET requests on full pages. A few days ago it passed 10,000 spam requests for the day.

Today it looks like we’ll end up around 35,000 blocked referrer spam requests.

I’m a little busy lately so I haven’t tried chasing down the reason the conditional rewrites aren’t working. In the meantime, this is keeping the spam overhead down a bit.

See also: Blocking Referrer Spam, Referrer Spammer IP Blocklist

More from Dreamhost and Media Temple on the L.A. Power Outage

I’ve generally been satisfied with hosting services at Dreamhost, which provide a lot of capabilities at a modest cost. However, yesterday’s power outage in Los Angeles shut down Dreamhost and a number of other sites in data centers that were supposed to have hardened power and redundant network connections. An obvious question is: what happened to the backup power?

One of the main points of using a hosting or colocation service is having better connectivity, environmental controls, and power. In Dreamhost’s case, the latter would be the backup UPS and diesel generators which are supposed to start up when the power grid goes offline.

There is a series of posts on the Dreamhost blog on yesterday’s outage. Looks like the upstream network providers (Level 3, Global Crossing, and Mzima) failed while DH still had power from their backup system, then a few minutes later the backup power failed.

Shortly thereafter the entire building where our data center is located’s back-up generators (there are SUPPOSED to be four) stopped working, and all power was gone. We were able to get back into our data center then, and it was like the day after tomorrow or something. Really creepy just walking through rows of dark, quiet, dead server after dark, quiet, dead server.

Dreamhost is apparently housed at Media Temple’s Garland Building, which is equipped with networking, HVAC, and backup power for data centers. Here’s what Media Temple had to say to their customers:

On Monday, September 12, 2005, at approximately 12:35 p.m., the building experienced a total loss of electrical power from the DWP on their primary grid. At this time, the building generators started and began supplying adequate power to the tenants.
At approximately 12:55 p.m., the building power was partially restored by the DWP. At approximately 1:05 p.m., the building experienced another total power failure from the DWP.
During this period of 12:55 p.m. to 1:05 p.m., two of the five generators failed. The remaining three generators were unable to sustain the power requirements of the building causing the emergency electrical systems to transfer into a “load shedding mode” and the building’s UPS system to turn itself off, thus preventing permanent UPS and related equipment damage.

Media Temple’s own site has some info on the power outage:

Question: Doesn’t (mt) Media Temple have UPS Backup? Answer:
Yes. (mt) Media Temple and the Garland Building (our LA data center building location) has one of the most sophisticated, redundant power systems in all of Los Angeles. As a matter of fact there are few buildings that rival the amount of redundancy and investment which has been put into the Garland Building’s backup power systems. As explained however today; the building’s power failure was caused by “human error” similar to the larger, citywide issue.
When the citywide power outage occurred, the Garland Building’s main UPS and backup power generator systems activated successfully and provided power to the building during the early stages of the city’s power outage. After approximately 30 minutes, the Garland Building engineers began to “further assist” the power situation which resulted in some form of “human error” on their part and interrupted the backup power service altogether thus causing the entire building to loose power. All tenants and data center occupants were instructed to immediately evacuate the building as the building engineers continued to “work” on the issue. After approximately 2 hours, building engineers restored power to the building near simultaneous to the rest of the Los Angeles power deprived areas. Building engineers state they are acutely aware of what caused the Garland Building power failure and have remedy to prevent such failures in the future.

So, we have a large portion of the Los Angeles power grid accidentally shut down by a small human error, followed by a very sophisticated backup power system accidentally shut down by another small human error. It’s remarkably common for these sorts of problems to come up, as most facilities don’t actually run in their “backup” modes enough to test regularly. A few years back, we had a cage full of servers in a brand new colo facility, which successfully switched to generator power during a rolling power blackout, but then shut down a while later due to heat buildup since the building HVAC wasn’t on backup power yet.

Gives you greater appreciation for this guy in New Orleans, who kept the DirectNIC data center running through Hurricane Katrina and its aftermath.

Don’t cross those wires – Los Angeles power outage

The electric power went off in Los Angeles during lunch time today, which is why this site (among many others) has been offline. I’ll put together some notes here and post when the server comes back up.

The electricity was knocked out shortly before 1 p.m. after two power surges, and outages were reported from downtown to the coast and north into the San Fernando Valley, an area encompassing hundreds of thousands of residents and thousands of businesses.

Heavy power usage can lead to blackouts. But the weather in Los Angeles was not unsually hot Monday.

The emergency status page at Dreamhost shows:

Serious electricity problems in the Los Angeles area have taken our entire network off-line.
We have no information about when power will be restored or what the status of our servers or network will be when power is restored.
Last Updated: Monday, 12-Sep-2005 13:39:23 PDT

An update from NBC4 news in Los Angeles at 2:17pm PDT: (they obviously have working backup power)

The Office of Emergency Management said power officials told the agency that the outage was linked to the accidental cutting of a cable.
A major portion of the San Fernando Valley reported outages, but power was being restored in some areas at just before 2 p.m.

Update from Dreamhost at 3:34pm PDT:

We now expect to begin turning all of our equipment back in within 45 to 90 minutes. Barring any complications, your websites and email will be fully operational again shortly afterwards.
Last Updated: Monday, 12-Sep-2005 15:34:47 PDT

Another update from NBC4 at 3:54pm PDT:

Los Angeles Department of Water and Power officials said the outage was linked to human error at a receiving station near Burbank. Workers connected the wrong wires, causing a surge of power that led to shutdowns at three power generating stations, according to officials.

Update from Dreamhost at 4:06pm PDT:

Power to our data center has now been restored and our network is back up. Our servers and equipment are being powered on gradually to avoid potentially damaging power spikes. Barring any complications, your websites and email will be fully operational again very soon.
Last Updated: Monday, 12-Sep-2005 16:06:03 PDT

…and the site is back online at 4:37pm PDT…

Update 09-12-2005 18:15 PDT: More from the L.A. Times, CNN, BoingBoing