Availability · Rolling Thunder Security Codex

Definition

Availability is the property that information and the systems that handle it are accessible to authorized parties when they need them.

The definition has three pieces. The system must be accessible, which usually means responsive within an acceptable time. It must serve authorized parties, which keeps availability distinct from a system that is technically reachable but only by attackers. And accessibility must hold when needed, which acknowledges that some downtime is acceptable and some is catastrophic depending on context.

Availability is the most context-dependent of the three pillars. An hour of downtime for a static marketing site is annoying. An hour of downtime for an emergency dispatch system kills people. The same technical event has wildly different impact depending on what the system is for.

The Math of Uptime

Availability is usually quantified as a percentage of total time the system is operating correctly. The industry shorthand is "nines": how many nines come after the decimal point.

Target	Percentage	Downtime per year	Downtime per month
Two nines	99%	3 days 15 hours	7 hours 18 min
Three nines	99.9%	8 hours 45 min	43 min 49 sec
Four nines	99.99%	52 min 35 sec	4 min 22 sec
Five nines	99.999%	5 min 15 sec	26 sec
Six nines	99.9999%	31 seconds	2.6 seconds

Each additional nine costs roughly an order of magnitude more to engineer. Two nines is easy; five nines requires geographically distributed failover, hot-standby databases, and a 24/7 operations team. Most commercial services target three to four nines because the cost of additional nines exceeds the value lost to downtime at that level. Critical infrastructure, life-safety systems, and high-frequency trading systems target higher.

A quick warning about "five nines"

"We aim for five nines" is a phrase to be skeptical of. Five nines means roughly five minutes of downtime per year, including unplanned outages, planned maintenance, software upgrades, network blips, certificate renewals, and DNS propagation. Real-world systems with measured five-nines availability are rare and expensive. Marketing five nines is cheap and common.

Mechanisms

Availability is engineered by anticipating failures and arranging for the system to continue working when they happen. The toolkit has five families.

Redundancy means having more than one of everything that can fail. Two power supplies, two network paths, two database replicas, two data centers. The principle is that the probability of two independent components failing at the same time is much lower than the probability of either one failing alone. Redundancy works only when the failures are actually independent, which is why a single shared power feed or a single shared software bug can take down a redundant pair simultaneously.

Failover is the automatic transfer of work from a failed component to its redundant counterpart. Designing good failover is harder than it sounds. The system must detect failure quickly enough to matter, switch over without losing state, and avoid flapping when the original component intermittently recovers. Database failover is particularly difficult because two replicas accepting writes simultaneously is worse than one replica being down.

Capacity planning protects availability against demand spikes rather than component failures. Black Friday traffic, viral content, news events, and successful product launches can all overwhelm systems sized for normal load. Auto-scaling, load balancing, and capacity reservation are the modern tools; over-provisioning is the older one.

Backups protect availability against data loss, not service interruption. A backup does not keep the system running; it lets you rebuild the system after it stops running. The three-two-one rule (three copies, on two different media, with one offsite) is the standard. The 2017 GitLab outage that lost six hours of production data happened because their five separate backup mechanisms had all silently failed; the team only discovered this when they tried to use them.

DDoS mitigation protects availability against deliberate denial-of-service attacks. The defensive toolkit includes traffic scrubbing, rate limiting, geo-blocking, content delivery networks, anycast routing, and over-provisioned upstream capacity. Modern DDoS attacks reach into the terabit range and can only be absorbed by providers operating at internet-backbone scale.

Failure Modes

Denial-of-service attacks (DoS and DDoS). Deliberate flooding of a service with traffic or requests to make it unavailable to legitimate users. Volumetric attacks (raw bandwidth), protocol attacks (TCP state exhaustion), and application-layer attacks (slow HTTP, expensive queries) all fall under this umbrella.
Ransomware. Encryption of victim data with the decryption key withheld until payment. From a CIA perspective, ransomware is primarily an availability attack, although it usually involves confidentiality violations as well (the attackers exfiltrated data before encrypting).
Hardware failure. Disks die, power supplies burn out, network cards fail, memory modules corrupt. The mean time to failure for a single component is long; the mean time to failure of some component in a large fleet is short.
Software bugs and bad deployments. The single most common cause of outages at well-run companies. A code change passes every test and breaks production. The 2017 AWS S3 outage was a one-character typo in an operator command that took down a large fraction of the public internet for four hours.
Natural disasters and environmental events. Floods, fires, earthquakes, hurricanes, and the increasingly relevant heat waves that overwhelm data center cooling. Geographic distribution is the only effective defense.
Dependency failures. The system itself is fine, but a service it depends on (DNS, certificate authority, payment processor, identity provider) is not. The 2024 CrowdStrike outage took down 8.5 million Windows machines worldwide because of a faulty update to a single endpoint security agent.

Case Study: Colonial Pipeline, 2021

Colonial Pipeline operates the largest refined-petroleum pipeline in the United States, carrying roughly 100 million gallons of fuel per day from the Gulf Coast to the East Coast. On May 6, 2021, the DarkSide ransomware group breached Colonial's IT network through a single compromised VPN credential (the account had no multi-factor authentication enabled).

Within hours, DarkSide had exfiltrated approximately 100 gigabytes of business data and begun encrypting systems on Colonial's IT network. When Colonial's incident response team detected the intrusion early on May 7, they made a precautionary decision: they shut down the pipeline. Not because the pipeline operational technology had been compromised (it had not), but because they could not be certain the OT network was isolated from the IT network, and because the billing systems used to track which fuel went to which customer had been encrypted and they could not operate the pipeline without billing.

The pipeline remained offline for six days. During that time:

Gas stations across the southeastern United States ran out of fuel. Panic buying compounded the actual supply disruption.
The Department of Transportation issued emergency declarations allowing alternative fuel transport.
The President invoked emergency powers to keep critical infrastructure running.
Colonial paid a ransom of approximately 4.4 million dollars in Bitcoin. The FBI later recovered about half of it.

The technical compromise was small. The availability impact was enormous because the affected business systems were in the dependency path of physical fuel delivery. The lesson from Colonial is that availability in critical infrastructure is not just about uptime of the obvious systems. It is about understanding the full dependency graph, which can include billing systems, identity systems, monitoring systems, and any other support function the operational system cannot run without.

Pattern to remember

Single-factor authentication on a remote-access VPN remained a common configuration in 2021 and remains common in 2026. The Colonial breach happened because one credential was enough. If you can change exactly one control in an environment to most improve availability against ransomware, requiring MFA for all remote access is usually it.

The Hard Question

Availability includes planned downtime, and planned downtime is where most engineering teams quietly lose nines they thought they had budgeted. Patching schedules, certificate renewals, database migrations, hardware refreshes, and dependency upgrades all require time when the system is degraded or offline. A team that operates the system perfectly during incidents and then takes it down every Tuesday for two hours of maintenance is not running five nines, regardless of what their SLA document says.

The discipline that addresses this is sometimes called availability budgeting: count all downtime, planned or unplanned, against the target, and force engineering trade-offs that reduce planned downtime if you want headroom for unplanned incidents. Rolling upgrades, blue-green deployments, online schema migrations, and short-lived certificates with automated rotation are how mature teams claw back the planned-downtime budget.

When you read an availability target, ask what it counts. The honest targets count everything.