Product updates, guides, and more

Stay up to date with the news and learn how to get the most out of the platform.

When Your Monitoring Infrastructure Goes Down

When Your Monitoring Infrastructure Goes Down

Mar 19, 2026 Monitoring 👁️ 5 reads

Last updated: 2026-03-19


On March 19, 2026, between 07:13 and 07:29 UTC, every user of StatusPage.me received false downtime alerts. Monitors reported DOWN. Auto-incidents were created. Notifications fired across email, Slack, Discord, Telegram, and webhooks.

No one’s actual services were down.

The problem was entirely on our end — and specifically, it exposed a class of failure that most monitoring systems are not designed to handle.

This post explains what happened, why it is a genuinely difficult engineering problem, and what we built to prevent it from happening again.


What happened

Contabo network outage
Contabo's website was returning HTTP 500 during the outage, indicating a severe network-level failure.

Our infrastructure provider, Contabo, experienced a network-level outage that cut off all outbound connections from their servers for approximately 16 minutes. Downdetector recorded a spike of user reports at exactly that time.

Downdetector spike for Contabo outage
Downdetector spike for Contabo's network outage on March 19, 2026.

Our monitoring scheduler — the process that runs all uptime checks — runs on a Contabo VPS.

When the outbound network went down, every monitor check failed simultaneously. Not because user services were down. Because our scheduler could not reach the internet at all.


Why the quorum system did not help

StatusPage.me uses a quorum engine to suppress false positives. The idea is straightforward: if only one monitoring region reports a site as DOWN while other regions report it as UP, the result is marked as a false positive and no alert fires.

This works well for the common case: a single flaky region, a routing hiccup, a transient DNS failure.

But this event was different.

When the VPS itself loses outbound internet access, every region fails simultaneously. There is no healthy region to disagree. The quorum system sees universal agreement that everything is down — because from the perspective of our monitoring infrastructure, everything is unreachable.

The quorum system correctly confirmed what was in fact a false result. It had no way to distinguish “every site is actually down” from “our monitoring agents cannot reach the internet.”

This is a fundamentally different failure mode: not a target outage, but an infrastructure cascade.


What we did immediately

Once we identified the cause, we ran a cleanup script against the database:

  • All auto-created incidents from the outage window were resolved, with an explanatory note added to each
  • All affected monitor check results were flagged as false_positive, excluding them from uptime calculations — no user’s uptime percentage reflects this event
  • All affected users received an email explaining what happened and confirming their uptime data was corrected

What we built to prevent it

Cleaning up after the fact is not enough. Here is what we shipped the same day.

1. Network watchdog

The most direct fix: the scheduler now tests its own outbound internet connectivity every 15 seconds.

It sends concurrent TCP connect probes to three independent external endpoints:

  • 1.1.1.1:80 — Cloudflare
  • 8.8.8.8:53 — Google DNS
  • 9.9.9.9:80 — Quad9

These are raw TCP connections — no DNS lookup, no HTTP overhead. If two of three probes fail, the scheduler activates a monitoring freeze: incident creation and outbound notifications are suppressed immediately, before any monitor check completes its quorum cycle.

The freeze uses a 45-second rolling window, renewed every 15 seconds while the network is still down. The moment two probes succeed again, the freeze is cleared. The gap between network recovery and the first real alert firing is at most 15 seconds.

This means the window of false firing is essentially zero. The scheduler detects its own network loss before any monitor alarm reaches users.

2. Global spike detector

A second, independent layer. If 30% or more of all active monitors flip DOWN within a 3-minute rolling window — and that count exceeds 50 monitors — incident creation and notifications are suppressed for 20 minutes.

This catches the case where the network watchdog fails or the outage affects only a subset of monitors. A single site going down does not trigger it. A correlated mass failure does.

The spike detector fires after 1–3 minutes of quorum cycles completing. The network watchdog fires within 15–30 seconds. Together, they cover different points in the detection timeline.

3. Admin freeze API

A manual override: a single API call activates a monitoring freeze for a specified number of minutes. Useful when a provider incident is known before the automated systems detect it, or when the thresholds above have not been crossed yet.


The harder problem we are still solving

StatusPage.me already runs monitoring agents on independent infrastructure across multiple providers and locations. When one region has a network issue, the quorum system sees the other regions reporting UP and suppresses the false positive — exactly as designed.

That protection works when the main server stays reachable. External agents pull their assignments from it and push results back to it. As long as those connections work, a failing region gets outvoted.

The March 19 incident was a different failure mode. Contabo’s outage was severe enough that their own company website was returning HTTP 500 — a sign that the issue went well beyond outbound routing on customer VMs. The main server itself was likely unreachable during that window, which meant external agents had nothing to push results to. The protection that normally works was suspended for the duration of the outage.

The remaining architectural gap is therefore not agent distribution — it is main server availability. If the server that coordinates monitoring goes fully dark, the system cannot distinguish “our coordination layer is unreachable” from “all monitored services are down.”

The network watchdog shipped as part of this response addresses the false-alert window directly: it detects the network failure within 15–30 seconds and suppresses notifications before any quorum cycle completes. But the underlying gap — having the coordination layer survive a provider-level outage — requires a hot standby on an independent provider, which is on the roadmap.


What this means for monitoring in general

This event illustrates something worth thinking about regardless of which monitoring tool you use.

A monitoring tool that generates false alerts is not neutral. It erodes confidence in the alerts that matter. Teams that have been paged for false positives are slower to respond to real ones. That is the opposite of what monitoring is supposed to do.

Most false positive guidance focuses on the target: multi-region checks, confirmation windows, DNS validation. Less attention is paid to the monitoring infrastructure itself as a potential source of false positives.

If your monitoring system’s outbound network fails, what happens? Does it fire alerts, stay silent, or do something else? That behavior is worth knowing before an incident, not during one.


FAQ

Why did multiple regions all fail at the same time?

All monitoring regions ran through the same VPS and shared the same outbound network path. When that path went down, every region lost connectivity simultaneously. True multi-provider redundancy — agents on separate networks from separate providers — would have prevented this.

Were any user services actually down?

No. Every service that received a false DOWN alert was operating normally throughout the outage window.

Was uptime data affected?

Uptime calculations use a false_positive flag on individual check results. All results from the outage window were flagged, so they are excluded from uptime percentages. No user’s uptime data was negatively affected by this event.

How quickly will the watchdog suppress alerts in future incidents like this?

The network watchdog runs every 15 seconds. If two of three probes fail, the freeze activates within that cycle. In a scenario like the March 19 outage, the first alert suppression would happen within 15–30 seconds of the network going down — before the quorum system has time to confirm any DOWN state.

What is the monitoring freeze?

A monitoring freeze halts incident creation and outbound notifications without stopping monitor checks. Checks continue running and results are stored, but no alerts fire. The freeze lifts automatically when the watchdog confirms connectivity is restored.


Author avatar
Nikola Stojković
Published Mar 19, 2026
Share & Subscribe