System Availability & Downtime: SLA, Metrics, and How to Calculate Uptime

System Availability & Downtime: SLA, Metrics, and How to Calculate Uptime

Published on June 3, 2026 · Niwo

System Availability & Downtime: SLA, Metrics, and How to Calculate Uptime

What is System Availability?

System availability measures the proportion of time a service, application, or infrastructure component remains operational and accessible to its users. It is the single most important reliability metric in IT operations — and the foundation of every Service Level Agreement (SLA).

The formal definition is straightforward:

Availability (%) = Uptime / (Uptime + Downtime) × 100

Where:

  • Uptime = total time the system was operational during the measurement period
  • Downtime = total time the system was unavailable during the measurement period
  • The result is expressed as a percentage between 0% and 100%

Worked Example

A web application logs 43 minutes of downtime over a 30-day month:

Availability = (43,200 min − 43 min) / 43,200 min × 100
            = 43,157 / 43,200 × 100
            = 99.90%

That single calculation determines whether you meet your SLA, whether you owe your customers a service credit, and whether your team celebrates or holds a postmortem.

Note: System availability and downtime are two sides of the same equation. You cannot improve one without understanding the other. For a deeper look at what downtime means and what causes it, read our guide: What is Downtime? Definition, Causes & How to Minimize It.


Availability vs Uptime vs Reliability

These three terms are often used interchangeably — but they measure fundamentally different things. Understanding the distinction is critical when negotiating SLAs or designing observability dashboards.

ConceptDefinitionHow It’s MeasuredExample
AvailabilityWhether the system is accessible and functionalPercentage of time operational99.99% available over a month
UptimeThe raw duration a system is runningTime (hours, minutes)719 hours of operation in a 30-day month
ReliabilityThe probability that the system performs correctly and without failure over a periodMTBF (hours), error rate, success rateMTBF of 2,190 hours between failures

A Practical Analogy

Think of a vending machine:

  • Uptime = the machine is powered on
  • Availability = you can insert money and select a product (the machine is reachable and responsive)
  • Reliability = the machine dispenses the correct product every time, without jamming

A system can have high uptime but low availability (a server that is running but returning 503 errors), or high availability but low reliability (a database that is reachable but returning corrupt data). True operational excellence requires all three.

pie title Common Causes of Downtime
    "Human Error" : 45
    "Power Outages" : 43
    "Hardware Failure" : 25
    "Software Bugs" : 20
    "Cyberattacks" : 15
    "Network Issues" : 12
    "Other" : 10

SLA, SLO, and SLI — The Three Pillars

Site Reliability Engineering (SRE) — the discipline pioneered by Google — defines three distinct layers for managing service quality. Understanding these layers transforms availability from a passive metric into an active engineering tool.

SLI (Service Level Indicator)

An SLI is the raw measurement that tells you whether a service is meeting expectations. Common SLIs include:

  • Latency: 95th percentile response time < 200 ms
  • Throughput: Requests per second
  • Error rate: Percentage of requests returning 5xx status codes
  • Availability: Fraction of successful probes or requests

Example: “The proportion of HTTP GET requests to the API that return a 200 OK within 500 ms.”

SLO (Service Level Objective)

An SLO is your target — the threshold your team commits to meeting internally. It is always stricter than your SLA.

Example: “99.95% of API requests will complete successfully in under 500 ms, measured over a rolling 30-day window.”

SLOs give engineering teams a clear, data-driven reliability target. They answer the question: “How reliable does this service need to be?”

SLA (Service Level Agreement)

An SLA is the contractual commitment you make to your customers. It specifies:

  • The minimum availability you guarantee (e.g., 99.9%)
  • How availability is measured (measurement window, exclusions)
  • Penalties if you miss the target (service credits, refunds)

Example: “We guarantee 99.9% monthly uptime. If we fail, you receive a 10% service credit.”

The SLA → SLO → SLI Pipeline

SLI (measurement) → SLO (target) → SLA (contract)

A healthy SRE practice sets the SLA lower than the SLO to create a safety buffer. If your SLA is 99.9%, your SLO might be 99.95% — giving you a 0.05% error budget cushion before customers are affected.

Error Budgets: The SRE Innovation

An error budget is the amount of downtime your SLO allows over a rolling window. It converts reliability from an abstract goal into a spendable resource.

Error Budget Formula:

Error Budget = (1 − SLO) × Total Time

Worked Example:

  • SLO = 99.95% availability over 30 days
  • Total time = 43,200 minutes
  • Error budget = (1 − 0.9995) × 43,200 = 0.0005 × 43,200 = 21.6 minutes per month

This means your team can afford 21.6 minutes of downtime per month before violating the SLO. The innovation? You can spend the error budget on risky deployments, feature experiments, or maintenance — as long as you don’t exhaust it. When the budget runs low, you shift focus to reliability work.

🔍 Use our Downtime Calculator to compute exact error budgets for any SLO or SLA level — instantly see how much downtime your targets allow.


How to Calculate Availability

Availability calculation depends on your measurement window and whether you are measuring a single component or a composite system.

Basic Formula (Single Component)

Availability = (Total Time − Downtime) / Total Time × 100

Composite Availability (Multi-Component Systems)

When a service depends on multiple components, the overall availability is the product of each component’s availability.

System Availability = A₁ × A₂ × A₃ × ... × Aₙ

Worked Example — A Three-Tier Web Application:

ComponentIndividual AvailabilityImpact on System
Web server (Nginx)99.99%A₁ = 0.9999
Application server99.95%A₂ = 0.9995
Database (PostgreSQL)99.99%A₃ = 0.9999
System Availability = 0.9999 × 0.9995 × 0.9999
                    = 0.9993
                    = 99.93%

Notice that 99.93% is lower than any individual component. This is why distributed systems require each component to be significantly more reliable than the overall target — a principle known as availability budgeting.

Availability Including Planned Maintenance

Some organizations calculate availability excluding planned downtime (maintenance windows). Others include everything. Always check which convention your SLA uses:

  • Unadjusted availability: Includes all downtime (planned + unplanned)
  • Adjusted availability: Excludes pre-approved maintenance windows

The difference can be substantial. A system with 99.95% unadjusted availability might appear as 99.99% after excluding planned maintenance.

Availability by SLA Level

The table below shows exactly how much downtime each SLA level permits across different time windows:

Availability LevelNinesDaily DowntimeWeekly DowntimeMonthly DowntimeYearly Downtime
99%1 nine14 min 24 s1 h 40 min7 h 12 min3.65 days
99.5%7 min 12 s50 min 24 s3 h 36 min1.83 days
99.9%3 nines1 min 26 s10 min 4 s43 min 28 s8.76 hours
99.95%43 s5 min 2 s21 min 44 s4.38 hours
99.99%4 nines8.6 s1 min 0 s4 min 23 s52 min 34 s
99.999%5 nines0.86 s6.0 s26 s5 min 15 s
99.9999%6 nines0.09 s0.6 s2.6 s31.5 s

💡 Need to convert between availability percentage and downtime minutes for a specific time window? Our Downtime Calculator handles every SLA level from 99% to 99.9999% — pick any value and see the exact downtime in seconds, minutes, or hours.


MTBF, MTTR, and MTTA Explained

Availability percentages tell you how much downtime occurred, but they don’t tell you why or how well your team responded. That’s where MTBF, MTTR, and MTTA come in.

MTBF (Mean Time Between Failures)

MTBF measures reliability — the average operational time between consecutive failures. Higher MTBF = more reliable system.

MTBF = Total Operating Time / Number of Failures

Worked Example: A load balancer runs for 8,760 hours (1 year) and experiences 6 failures:

MTBF = 8,760 / 6 = 1,460 hours ≈ 60.8 days

This means the load balancer runs for roughly 60 days on average between failures.

How MTBF relates to availability:

Availability ≈ MTBF / (MTBF + MTTR)

If MTBF = 1,460 hours and MTTR = 2 hours:

Availability ≈ 1,460 / (1,460 + 2) = 99.86%

MTTR (Mean Time to Repair/Recover)

MTTR measures recoverability — how quickly your team can restore service after a failure. Lower MTTR = faster recovery.

MTTR = Total Downtime / Number of Failures

Worked Example: A database cluster experiences 4 outages totaling 10 hours of downtime:

MTTR = 10 / 4 = 2.5 hours

MTTR is the most actionable metric in this family because it directly reflects the quality of your incident response processes — runbooks, alerting, on-call readiness, and tooling.

Breaking Down MTTR

In practice, MTTR is composed of several sub-phases:

PhaseDescriptionTypical Duration
MTTD (Mean Time to Detect)Time from failure start to alert firing30 s – 15 min
MTTA (Mean Time to Acknowledge)Time from alert to human acknowledgement2 – 15 min
MTTT (Mean Time to Troubleshoot)Time spent diagnosing the root cause15 – 60 min
MTTF (Mean Time to Fix)Time to implement and verify the fix5 – 120 min

Total MTTR ≈ MTTD + MTTA + MTTT + MTTF

MTTA (Mean Time to Acknowledge)

MTTA measures responsiveness — the gap between an alert firing and a human acknowledging it. It is a lagging indicator for on-call discipline and alert routing quality.

Target ranges:

  • Gold: < 2 minutes (critical production systems with 24/7 on-call)
  • Silver: < 5 minutes (business-critical services with pager rotation)
  • Bronze: < 15 minutes (standard services with email+Slack alerting)

The MTBF / MTTR / Availability Triangle

These three metrics form a closed relationship:

Availability = MTBF / (MTBF + MTTR)
MTBFMTTRAvailabilityClassification
30 days4 hours99.44%Good — typical internal tool
60 days2 hours99.86%Great — standard production
90 days30 min99.98%Excellent — mission-critical
365 days15 min99.997%Elite — AWS/Azure tier

Key insight: You can achieve the same availability by having an extremely reliable system (high MTBF) OR by recovering extremely fast (low MTTR). The best teams optimize both.


The Nines of Availability

The “nines” shorthand (99.9%, 99.99%, 99.999%) is the standard vocabulary for availability in the industry. Each nine represents a tenfold reduction in allowable downtime.

What Each Nine Means

Nines LevelThe MathDowntime/YearReal-World Equivalent
90% (zero nines)1 failure in 1036.5 daysUnacceptable for production
99% (one nine)1 failure in 1003.65 daysDev/staging, internal tools
99.9% (three nines)1 failure in 1,0008.76 hoursStandard SaaS, B2B production
99.99% (four nines)1 failure in 10,00052.6 minEnterprise, financial services
99.999% (five nines)1 failure in 100,0005.26 minTelecom, emergency services
99.9999% (six nines)1 failure in 1,000,00031.5 sCarrier-grade, air traffic control

When You Need Each Level

  • Three nines (99.9%) is the baseline for any production internet service. It means your customers experience about one business day of downtime per year. For most SaaS products, this is acceptable — especially if downtime occurs during off-peak hours.

  • Four nines (99.99%) is the standard for enterprise infrastructure and financial platforms. Under one hour of yearly downtime requires redundant networking, multi-AZ deployments, and automated failover. You cannot achieve this with a single server.

  • Five nines (99.999%) is mission-critical territory. Less than 6 minutes of downtime per year demands active-active multi-region architecture, real-time replication, and a full incident response team on standby 24/7.

  • Six nines (99.9999%) is carrier-grade — think telephone exchanges, 911 dispatch, and satellite communications. It requires fully redundant, geographically distributed infrastructure with zero single points of failure and automated failover measured in milliseconds.

The Cost of Nines

Each additional nine is exponentially more expensive to achieve:

  • 99.9% → 99.99%: ~3–5× infrastructure cost increase
  • 99.99% → 99.999%: ~5–10× infrastructure cost increase
  • 99.999% → 99.9999%: ~10–20× infrastructure cost increase

Rule of thumb: Never pay for more nines than your business requires. A startup with $1M ARR does not need five nines. A payment processor with $1B in daily volume probably does.


High Availability Strategies

High availability (HA) is the engineering discipline of designing systems that remain operational despite component failures. HA is not a single technology — it is a set of patterns applied across the entire stack.

Redundancy

Eliminate single points of failure at every layer:

  • Hardware: N+1 power supplies, RAID storage, redundant NICs
  • Compute: Multiple application server instances behind a load balancer
  • Database: Primary-replica replication with automated failover
  • Network: Dual internet uplinks from different ISPs, redundant switches

Failover Mechanisms

TypeHow It WorksFailover TimeCost
Active-PassiveOne server handles traffic; standby replica takes over on failure30 s – 5 minMedium (idle standby resources)
Active-ActiveBoth servers handle traffic simultaneously; loss of one reduces capacity< 1 s (DNS/LB)High (full capacity at both sites)
AutomaticHealth checks detect failure and redirect traffic without human interventionSecondsMedium-High (requires orchestration)
ManualOn-call engineer triggers failover via runbook5 – 30 minLow (human-dependent)

Load Balancing

Distribute traffic across multiple healthy instances to prevent any single server from becoming a bottleneck — or a point of failure.

  • Layer 4 (TCP/UDP): HAProxy, Nginx Stream, AWS NLB
  • Layer 7 (HTTP/HTTPS): Nginx, Traefik, AWS ALB, Cloudflare
  • Global: DNS-based (Route53, Cloudflare DNS) with health-checked failover

Multi-Region Architecture

For four nines and above, you need geographic distribution:

  • Active-Passive multi-region: Primary region handles all traffic; secondary region is on standby with replicated data. Failover is manual or semi-automated.
  • Active-Active multi-region: Both regions handle live traffic simultaneously. Requires global load balancing, data replication with conflict resolution, and careful session management.

The Three Pillars of HA Operations

  1. Monitoring: Detect failures faster than your users do
  2. Automation: Eliminate human latency from the recovery path
  3. Testing: Validate your HA design through chaos engineering and game days

Without all three, your multi-region architecture is just a more expensive way to fail.


Measuring and Monitoring Availability

You cannot improve what you do not measure. Effective availability monitoring requires the right tools and the right approach to SLI collection.

Synthetic Monitoring

Synthetic monitoring simulates user behavior with pre-defined scripts or probes:

  • UptimeRobot: Free HTTP/HTTPS monitoring with 5-minute intervals and status pages
  • Pingdom: Advanced synthetic checks with multi-step transaction scripts (login → search → checkout)
  • Checkly: Developer-friendly, code-first synthetic monitoring with Playwright-based browser checks
  • Grafana Synthetic Monitoring: Open-source probes integrated with the Grafana ecosystem

Real-User Monitoring (RUM)

RUM captures actual user interactions to measure availability from the user’s perspective:

  • Google Analytics / CrUX: Real-user performance data aggregated by Chrome users
  • Datadog RUM: Full-session replay with frontend and backend trace correlation
  • New Relic Browser: JavaScript error tracking and page load metrics

Prometheus-Based SLI Collection

In the Kubernetes and cloud-native ecosystem, Prometheus is the standard for SLI instrumentation:

# Service availability over the last 30 days
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d])) /
  sum(rate(http_requests_total[30d]))
)

This PromQL query calculates the proportion of non-5xx responses over a rolling 30-day window — a direct SLI measurement.

Monitoring Stack Recommendations

Use CaseRecommended Tools
Simple uptime monitoring (1–10 services)UptimeRobot, Better Uptime
Multi-service with alertingPingdom, Checkly
Cloud-native / KubernetesPrometheus + Grafana + Alertmanager
Full observability (APM + RUM + logs)Datadog, New Relic, Grafana Cloud

🔍 Use our Downtime Calculator alongside your monitoring data — convert your measured downtime minutes into an availability percentage and check it against your SLA targets instantly.


Frequently Asked Questions

What is the difference between availability and reliability?

Availability measures whether a system is accessible (can you reach it?). Reliability measures whether it behaves correctly (are the responses valid?). A system can be 99.99% available but only 95% reliable if it frequently returns errors or corrupt data. Availability is a binary metric (up/down); reliability is a correctness metric (good/bad responses).

How to calculate system availability?

System availability is calculated as (Total Time − Downtime) / Total Time × 100. For composite systems, multiply each component’s availability together. Example: a web app with 99.99% uptime over a 30-day month (43,200 minutes) and 43 minutes of downtime has 99.90% availability. Use our Downtime Calculator for instant conversions.

What is 99.999% availability?

99.999% availability (“five nines”) allows a maximum of 5 minutes and 15 seconds of downtime per year — approximately 26 seconds per month. It is the standard for mission-critical infrastructure like telecom switches, payment rails, and emergency dispatch systems. Achieving five nines requires active-active multi-region architecture with automated failover.

What does MTTR measure?

MTTR (Mean Time to Repair) measures the average time it takes to restore a system after a failure. It includes detection, acknowledgement, diagnosis, and repair. A lower MTTR means faster recovery. MTTR of under 1 hour is considered excellent; under 15 minutes is elite. MTTR is the most impactful metric to optimize because it directly reduces downtime duration.

How many nines does AWS have?

AWS commits to specific availability targets per service in its SLA documentation. Amazon S3 has a 99.99% SLA (four nines). Amazon EC2 has a 99.99% SLA for multi-instance deployments in different Availability Zones. AWS Lambda has a 99.95% SLA. No AWS service guarantees 100% availability, and most offer service credits if they miss their published SLA targets. AWS itself reports aggregate uptime across all services, but actual experience varies by region and service configuration.


Conclusion

System availability is not just a number — it is the language your infrastructure uses to tell you how healthy it is. Understanding how to calculate it, how to break it down into SLIs, SLOs, and SLAs, and how to use error budgets to drive engineering decisions transforms availability from a report card into a competitive advantage.

Key Takeaways

  • Availability = Uptime / (Uptime + Downtime) × 100 — Know the formula and when to apply it
  • Distinguish between uptime, availability, and reliability — They measure different things, and confusing them leads to bad SLA design
  • Set SLOs tighter than SLAs — The gap is your error budget, and it is your most valuable operational resource
  • Optimize MTTR before MTBF — Fast recovery is often cheaper and more impactful than preventing every possible failure
  • Match your nines to your business — Five nines is a technical achievement; four nines is usually the right business answer
  • Measure what matters — Use synthetic monitoring, RUM, and Prometheus-based SLIs to track availability from every angle

Put It Into Practice

Ready to apply what you’ve learned?

  1. Calculate your current availability — Our Downtime Calculator converts between uptime percentages and downtime minutes for any time window, from daily to yearly
  2. Review your SLA targets — Are they realistic? Do your SLOs give you enough error budget?
  3. Read about downtime fundamentals — Our guide What is Downtime? covers definitions, causes, costs, and strategies for minimizing unplanned outages

Start measuring, start improving, and let the numbers guide your reliability journey.

Related articles