Performance Metrics


  1. Ability of a system to grow and manage increased traffic.
  2. Increased volume of data or requests.
  3. Our goal is we want to achieve this growth without a lose in performance.
  4. Bad system design could result in a bottleneck on the number of users or traffic our application can handle, or could result in exponentially increasing cost to server a small increased traffic.


  1. Probability a system will fail during a period of time.
  2. Reliability for software is slightly harder to define than hardware reliability. Software may have degrees of reliability.
  3. Overall system is reliable if it keeps working even when software or hardware components fail.
    1. So that means we need system in place like automated testing to prevent bugs from being deployed to production.
    2. You also need tools that can predict and compensate for hardware failure so that before a server even fails, you can be notified and preemptively take that server offline and repair it before it starts serving bad requests.

A common way to measure Reliability is Mean Time Between Failure (MTBF) .

Here is how to calculate MTBF: MTBF = (#total elapsed time - #total downtime) / #number of failures

For instance, the total elapsed time is 24 hours, and your total downtime is 4 hours, there are 4 failures. Therefore, the MTBF is equal to (24 hours - 4 hours) / 4 failures = 5 hour MTBF.


Amount of time a system is operational during a period of time. This is probably the most important metrics when it comes to your users, whether your site actually works and what percent of the time it works.

Poorly designed software requiring downtime for updates is less available.

The metrics for Availability is pretty straightforward: Availability % = (available time / total time) x 100.

For example, yor site is available for 23 hours, and therefore the availability percentage is (23 hours / 24 hours) x 100 = 95.83%.

Here is a quick reference table for general availability percentage annually:

AvailabilityAnnual Downtime
99%3 days, 15 hours, 40 mins
99.9%8 hours, 46 mins
99.99%52 mins, 36 secs
99.999%5.26 mins

Reliability vs Availability

  1. Reliable system is always an available system.
  2. Availability can be maintained by redundancy, but system may not be reliable. An example would be the Microservice Architecture where you can easily launch a new replica without damaging the system availability.
  3. Reliability software will be more profitable because providing same service requires less backup resources.
  4. Requirements will depend on function of the software.

Using plane as an example, if you need availability to routine maintenance, you could hire a fleet and have backup plane to rolls out and take over that flight. But for a plane, the most important thing you want to make sure it’s reliable because that plane in the air you do not want a failure.


  1. How well the system performs
  2. Latency and throughput often used as metrics.
    1. Latency means how long does a request takes to get back to the user.
    2. Throughput means the total amount of requests and traffic your system can handle.


  1. Speed and difficulty involve with maintaining system
  2. Observability, how hard to track bugs.
  3. Difficulty of deploying updates.
  4. Want to abstract away infrastructure so product engineers don’t have to worry about it.