Clueless in the cloud

What Amazon wrote:

We have noticed that one or more of your instances are running on a host degraded due to hardware failure. [...]

The host needs to undergo maintenance and will be taken down [...]

What Amazon might have written:

Thought you were clever, eh? Running that fancy Cassandra cluster? I bet you didn't expect your redundant copies on several Cassandra nodes to really be stored on the same crummy drive.

Nothing went wrong, in the event.

The cloud doesn't give very many clues about actual failure threats.

I like iron. When services run on iron, I'm not so clueless about what might fail together. I know what's located in the same room, I know what data is actually stored on the same spinning drive.

Last night, Amazon vendor discarded some packets from a real server (in the non-AWS part of a VPN) to a virtual host (in an AWS VPC, part of the same VPN). I couldn't tell why. The cloud provides no packet counts via SNMP, no graphs. My router in the VPC gives me zero clues about what goes wrong, when something does.

Two months ago, it became clear that Amazon's load balancing appliance uses ELB. When ELB broke, some load balancers did too. What customer could have guessed that dependency?

Amazon certainly does a fabulous job of keeping virtual hosts running. Maybe other cloud vendors do too, Amazon's the only one with which I have substantial experience. But I feel clueless. I have to trust Amazon too much and know far too little.