Skip to content
Published on

Building a Cloudflare Workers Usage Monitor with an Automated Kill Switch

Programming
Authors

The Why

I run several Cloudflare Workers across multiple accounts. The billing model is pay-per-use: requests and CPU time beyond the included amounts add up quickly. Abuse can turn into a nasty surprise on the next invoice. I wanted something that would catch overages early and stop traffic before costs spiral.

This post walks through what I built: a Worker that monitors usage, detects threshold breaches, and automatically disconnects Workers from the internet when limits are exceeded. It also generates a daily usage report with cost estimates.

The Problem

Cloudflare Workers billing is based on:

  • Requests (10M included per month on the Paid plan, then $0.30 per million)
  • CPU time (30M ms included, then $0.02 per million ms)

If a Worker starts looping, gets hit by a bot, or has a bug that burns CPU, usage can spike fast. There is no built-in hard cap. You only find out when the bill arrives.

The Approach

I built a separate Worker that:

  1. Runs on a schedule (every 5 minutes) and fetches billing-period-to-date metrics from the Cloudflare GraphQL API
  2. Compares each Worker's usage against configurable thresholds
  3. When a Worker exceeds a threshold, it disconnects that Worker from the internet (removes routes, custom domains, workers.dev) and sends a Discord alert
  4. Runs a daily report that aggregates usage across D1, KV, R2, Queues, Durable Objects, and Workflows, estimates cost, and sends a JSON report to Discord

The "kill switch" is intentional: it stops traffic immediately. Re-enabling is manual after you investigate and fix the cause.

Architecture

The system is a single Worker with two cron triggers and one Workflow:

  • Cron (every 5 min): Overage check. Fetches metrics for all accounts, compares against thresholds, checks a D1 cooldown table (so we don't re-trigger the same overage every 5 minutes), and dispatches a Workflow instance per overage.
  • Cron (8am UTC): Daily report. Two parallel GraphQL queries (Worker metrics + account-level usage), aggregate per account, estimate cost using Workers Paid pricing, save to D1, send JSON to Discord.
  • OverageWorkflow: One instance per overage. Disconnects DNS (zone routes, custom domains, workers.dev subdomain) via the Cloudflare API, then sends a Discord embed with the details.

D1 stores two things: an overage_state table for cooldown deduplication (with TTL so we don't re-fire on the same Worker within an hour), and a usage_reports table for report history.

Architecture diagram

What It Doesn't Cover

Plain and simple: this protects against foreign threats, not our own stupidity. The kill switch cuts off public traffic to Workers. It does not protect against:

  • Public R2 reads. If you have a public R2 bucket, anyone can read from it. Those requests bypass Workers entirely and are billed to your account. The kill switch cannot stop that.
  • Recursive Workers. A Worker that calls itself (or another Worker) in a loop isn't exposed to the public internet. Nuking DNS—routes, custom domains, workers.dev—does nothing. Internal calls don't go through that. The kill switch is useless here.
  • Runaway Durable Objects. Same story. A DO spinning out of control is internal. We don't cap or disconnect them.
  • Other products. Queues, Workflows, etc. have their own billing. The daily report tracks them, but we don't auto-cap them.

Configuration

Thresholds are configurable via environment variables (global defaults) and per-Worker overrides in an accounts config file:

  • REQUEST_THRESHOLD: default 500k requests
  • CPU_TIME_THRESHOLD_MS: default 5M ms (about 83 minutes of CPU)
  • OVERAGE_COOLDOWN_SECONDS: default 3600 (1 hour)

Each account and Worker can override these. The accounts list is hardcoded (account IDs, billing cycle day, Worker names, optional per-Worker thresholds). A single API token with access to all accounts is used.

Daily Report

The daily report gives a snapshot of usage and estimated cost for the current billing period. It pulls:

  • Worker metrics: requests, errors, CPU time, subrequests per script
  • Account usage: D1 rows read/written and storage, KV operations and storage, R2 requests and storage, Queues operations, Durable Objects requests/duration/storage, Workflows requests/CPU/storage

Storage is shown in MB. Billing period timestamps are full ISO (midnight start, 23:59:59 end). The report applies Workers Paid pricing and breaks down overage cost by product. Output goes to D1 and Discord as a JSON attachment.

What I'd Do Differently

A few things I might change or add:

  • Workers Logs in the report (Cloudflare's GraphQL API doesn't expose those metrics yet, so it's a placeholder)
  • A way to re-enable Workers from the same system instead of doing it manually in the dashboard
  • A UI for the report and controls, behind Cloudflare Access

If you run multiple Cloudflare accounts and want guardrails against runaway usage, this pattern is a solid starting point. The core idea is simple: monitor, compare, and disconnect before the bill gets out of hand.