Website monitoring metrics explained: Uptime, TTFB, and more

Start 30-day free trial Try now, sign up in 30 seconds

Website metrics are quantifiable measurements that track your website's uptime, speed, and reliability in real time, giving your team the data to detect problems before users do.

You already know that every second of your site’s downtime is costing you user count, revenue, and search rankings. But what's the difference between engineering teams that close incidents in minutes and teams that scramble in the dark for hours? It’s not magic. It is knowing the right metrics, understanding precisely what each one measures, and having benchmarks that define the line between acceptable and broken.

This guide covers the core website monitoring metrics used by engineering and SRE teams, including what each metric measures, how it's calculated, the benchmarks, and where to start when the numbers go wrong.

Uptime percentage: The baseline of website reliability

Uptime percentage is the most fundamental website monitoring metric. It measures the proportion of time your website or service is reachable and responding correctly, expressed as a percentage of the total monitoring time.

Uptime % = (Total time – Downtime) / Total time × 100

A 99% uptime guarantee sounds reassuring, but it permits over 7 hours of downtime per month. At 99.9% (three nines), the site is down for approximately 43 minutes per month. At 99.99% (four nines), that drops to roughly 4 minutes. For SaaS platforms, e-commerce sites, and financial services, four nines is the practical baseline for a reliable service.

What's generally counted as downtime?

This depends on your monitoring configuration. Most tools consider a site "down" when it returns an HTTP error (5xx), times out, or fails to respond within a defined timeframe. Some teams also track partial downtime—where specific pages or services are unavailable even though the root domain responds.

Benchmark targets by industry:

  • E-commerce and FinTech: 99.99% or higher
  • SaaS applications: 99.9% minimum, 99.99% preferred
  • Marketing and informational sites: 99.5–99.9%

Considering uptime percentage alone can mask serious problems. Teams must pair uptime tracking with error rate and response time checks to get a true picture of availability. Learn more about uptime monitoring.

Availability or uptime percentage measures the proportion of time your website or service is reachable and responding correctly

TTFB: Your server's first impression

Time to first byte (TTFB) measures the time from when a client sends an HTTP request until the server delivers the first byte of response data. It's one of the most direct indicators of back-end health.

TTFB is not calculated in a single step—it's the sum of several sequential events: DNS resolution, TCP connection establishment, TLS/SSL handshake (for HTTPS), server processing time, and initial network transit. If your TTFB is poor, this breakdown tells you exactly where to look.

What's a good TTFB?

Google's Core Web Vitals (a widely adopted benchmark in the industry) classifies TTFB as:

  • Under 800ms: Good
  • 800ms–1,800ms: Needs improvement
  • Over 1,800ms: Poor

For server-side performance, engineering teams typically aim for a back-end TTFB of under 200ms. Anything above 500ms from the server alone warrants investigation into database query optimization, caching strategy, or compute capacity.

Why does TTFB matter for SEO and UX?

TTFB is the beginning of everything else on a page. The browser can't begin parsing HTML, fetching CSS, or rendering any content until that first byte arrives. A slow TTFB delays every other downstream metric, all of which Google weighs in its ranking signals.

Some of the common causes of high TTFB include:

  • Unoptimized database queries generating pages dynamically on every request.
  • No caching layer in front of the application.
  • A server located far from the user resulting in increased network latency.
  • Insufficient compute resources causing request queuing.
Your page's or site's response time is one of the most direct indicators of back-end health.

Apdex score: Quantifying user satisfaction

Application Performance Index (Apdex) is an open standard that converts raw performance data into a single, interpretable user satisfaction score. Instead of wrestling with latency distributions and percentile charts, Apdex gives you one number between zero and one, where zero means no users are satisfied and one means all users are satisfied.

Apdex works by asking a simple question about every request: How did the user feel about the experience? To answer this, you first define a threshold value called T: The response time you consider acceptable for a given page or endpoint. Every request then falls into one of three zones:

  • Satisfied users got a response within T. The experience felt fast and seamless.
  • Tolerating users waited longer than T but no more than four times T. They got their answer, but noticed the delay.
  • Frustrated users either waited beyond four times T, or hit a server-side error entirely.

Reading the score

Once calculated, the score tells you where you stand:

  • 0.94 and above: Excellent. Nearly all users are having a great experience.
  • 0.85 to 0.93: Good. Most users are satisfied, with only minor friction.
  • 0.70 to 0.84: Fair. Performance issues are noticeable and worth addressing.
  • 0.50 to 0.69: Poor. A significant portion of your users are frustrated.
  • Below 0.50: Unacceptable. The majority of users are having a bad experience, and performance needs immediate attention.

Error Rate: The health signal you can't ignore

Error rate measures the percentage of requests to your application that result in an error response, typically HTTP status codes 4xx and 5xx. It's one of the four golden signals of site reliability engineering, the others being latency, traffic, and saturation.

Error rate = (Error responses / Total requests) × 100

Not all errors are equal. HTTP 4xx errors (like 404 Not Found or 403 Forbidden) can be due to broken links, bad URLs, permission problems. HTTP 5xx errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable) indicate server-side failures and are the more critical category to monitor tightly.

Industry benchmarks:

  • For production applications, an error rate below 1% is the standard target.
  • For high-availability APIs, below 0.1% is achievable and expected.
  • An error rate above 5% is a service-degrading event requiring immediate attention.

Why error rate needs deeper analysis

A system with an average response time of 200ms and a 10% error rate appears fast—but one in ten users is getting a broken experience. Error rate and Apdex must be read together. In fact, Apdex automatically classifies server-side errors as frustrated responses regardless of response time, which is one of its strengths.

What drives error rate spikes:

  • Unhandled exceptions in application code
  • Database connection pool exhaustion
  • Downstream API dependency failures
  • Deployment issues or configuration mismatches
  • Memory leaks are causing service restarts

MTTR: How fast you fix problems

Uptime tells you how often things break. Mean time to recovery (MTTR) tells you how well-prepared you are when they do. MTTR measures the average time elapsed between the detection of a failure and the full restoration of the service.

MTTR = Total downtime / Number of incidents

MTTR is not just about engineering speed—it's also a measure of your monitoring and alerting maturity. A team that takes 45 minutes to notice an outage and 15 minutes to fix it has a far worse MTTR than a team with alerts firing in under a minute. This is why mean time to detection (MTTD)—the time from failure to awareness—is tracked alongside MTTR.

Target ranges:

  • SRE teams target MTTR under five minutes for P1 incidents.
  • Most production teams operate in the 15–60 minute range.

Page load time and Core Web Vitals

Page load time is the total time from a user's request to the moment the page is fully interactive. While it sounds like a single metric, modern performance measurement breaks it into more precise measurements—collectively called Core Web Vitals—that Google uses as ranking signals.

  • Largest contentful paint (LCP): Measures when the largest visible element (image or text block) finishes rendering. The LCP target is under 2.5 seconds.
  • First contentful paint (FCP): Measures the time when the first visible element appears on the screen. Strongly correlated with TTFB. The FCP target is under 1.8 seconds.
  • Interaction to next paint (INP): Measures how quickly the page responds to user interactions. Replaced first input delay (FID) as a Core Web Vital in 2024. The INP target is under 200ms.
  • Cumulative layout shift (CLS): Measures visual stability—how much page elements shift unexpectedly during load. The CLS target is below 0.1.

These metrics matter for both user experience and organic search rankings. Poor core web vitals scores can directly affect your visibility in a Google search.

The Waterfall chart from Site24x7 helps to analyse the performance of various elements on a page

SLIs, SLOs, and SLAs: The framework that ties it all together

Individual metrics become meaningful when they are embedded in a formal reliability framework.

  • Service-level indicator (SLI) is the actual measured metric. For example, "the percentage of requests with TTFB under 200ms" or "monthly uptime percentage."
  • Service-level objective (SLO) is the internal target for an SLI. For example, "99% of requests must have TTFB under 200ms" or "uptime must be 99.9% per calendar month."
  • Service-level agreement (SLA) is an external, contractual commitment to customers, typically with financial penalties for breaches. SLAs are set more conservatively than SLOs to create a buffer.

The gap between your SLO and SLA is your error budget: the amount of degradation you can absorb before violating contractual obligations. SRE teams use this budget to balance reliability work against feature development.

Putting it together: A monitoring stack that works

No single metric tells the full story. The most effective monitoring approach uses these metrics in combination:

  • Uptime and error rate catches outages and partial failures.
  • TTFB and LCP diagnoses back-end and front-end performance bottlenecks.
  • Apdex and error rate provides a unified view of user experience quality.
  • MTTR and MTTD reveals the maturity of your incident response process.
  • SLOs tied to SLAs give engineering teams a principled framework for prioritizing reliability work.

Teams must set concrete threshold values for each metric before problems arise. A high error rate alongside a CPU spike tells a different story than an error rate spike alone. A declining Apdex score with stable TTFB often points to frontend performance issues or dependency failures, not to server capacity.

That said, static thresholds only take you so far. Site24x7's Zia-based dynamic threshold profiles use AI-powered anomaly detection to adapt to your monitor's actual behavior, alerting you the moment an issue begins, rather than waiting for a hard-coded limit to be breached.

Moreover, the Site24x7 dashboard brings all of this together in one place, giving you a unified, real-time view across uptime, performance, error rates, and user experience metrics, so you can correlate signals and act fast without switching between tools.

Measure consistently, trend over time, and, critically, segment your metrics by endpoint, device type, and geography. Aggregate numbers hide the specific user populations most affected by performance issues.

Start your monitoring today to let your metrics guide you

Website monitoring metrics are not a reporting exercise—they are your operational intelligence. Start with baselines for each metric, set targets grounded in your user expectations and industry benchmarks, and build alerts that surface the right signals before your users experience the consequences. Try our website monitoring for free and see how these metrics can help your site optimization.

Frequently asked questions

How often should I check my website monitoring metrics?

Critical metrics like uptime and error rate should be monitored continuously with real-time alerts. Performance metrics like TTFB and Apdex are best reviewed as daily trends, while SLO compliance and MTTR analysis are typically reviewed weekly or at the close of each incident.

Should I monitor my website from multiple locations?

Yes. A server may respond normally from one region while users in another geography experience high latency or timeouts. Monitoring from multiple locations helps distinguish between a global outage and a region-specific performance degradation—which changes both the diagnosis and the fix. Site24x7 offers 130+ monitoring locations across the globe.

Can good metrics coexist with a bad user experience?

Aggregate metrics can look perfectly healthy while a specific segment of your users—a particular device type, browser, or geography—is having a terrible experience. A 99.9% uptime number means nothing to the user in London hitting a timeout on every checkout. This is why segmenting metrics by endpoint, device, and location isn't a nice-to-have feature—it's where the real story lives.

How do I know which metric to prioritize when multiple signals fire at once?

Start with error rate and uptime—these indicate whether users are being served at all. Then move to TTFB to assess server health, followed by Apdex for user experience impact. MTTR becomes the focus once the issue is confirmed and your team is in resolution mode.