Setup guide · everything in one page

How Cloud Watchdog works
and how to set it up.

A complete walkthrough — from sign-up to alerts firing in Slack — written so a non-engineer can follow it. Includes the AWS CloudFormation templates you'll need (download links below) and the exact tags your resources should carry so circuit breakers behave safely.

01

What Cloud Watchdog actually does

Cloud Watchdog is a watchdog (literally) for your AWS bill. It connects read-only into your AWS account, watches your live CloudWatch metrics every few minutes, and sends a Slack message + email the moment one of your resources starts behaving like it's about to cost you a lot of money.

Three things happen behind the scenes, all on a schedule:

1. Inventory sync (every 30 min)

Pulls a fresh list of your EC2, Lambda, ECS, RDS, NAT, EBS, EIP, ELB, and CloudWatch log groups so you always have an up-to-date map of what's actually running.

2. Metric polling (every 5 min)

For each alert rule you've enabled, fetches the latest CloudWatch sample (CPU, Invocations, NAT bytes, RDS connections, etc). Compares it to your threshold.

3. Idle scan (every 6 hours)

Finds unattached EBS volumes, unused Elastic IPs, idle NAT gateways, low-CPU EC2, orphan snapshots. Each shows the exact dollars/month you'd save by deleting.

Free tier first.All AWS API calls use CloudWatch's free tier (1M requests/month) plus Cost Explorer when you ask for spend data ($0.01/call — and even that's opt-in via the dashboard toggle). Your AWS bill won't grow just because Cloud Watchdog is watching it.

02

Free vs Starter — which one should you pick?

Free is for trying it out on a single AWS account with a handful of alert rules and the top idle resources visible. Starter adds more accounts, more rules, the full idle list, and the actually-stop-the-thing circuit breakers.

FeatureFreeStarter
Connected AWS accounts13
Alert rules525
Slack + email alertsYesYes
Idle-resource waste finderTop 5 findingsAll findings
Circuit breakers (Lambda throttle, EC2 stop, ECS scale-to-zero)DisabledUp to 10 rules
Cost Explorer auto-syncManual onlyManual or daily auto
Price$0 / forever$19 / month (founder lock)

You can downgrade any time — Starter features just turn back off and your rules are converted to alert-only.

03

Step-by-step setup — 10 minutes start to finish

You'll go through five short steps. No prior AWS or DevOps experience required — the trickiest bit is uploading a YAML file in the AWS Console, and we provide a one-click download for that.

  1. 01

    Sign up for the dashboard

    2 min

    Head to cloudwatchdog.online/sign-up and create your account with Google or email. You'll land on an empty workspace called "My workspace".

    No credit card needed. Free plan is active by default.

  2. 02

    Deploy the IAM role in your AWS account

    5 min

    Cloud Watchdog never stores AWS access keys. Instead, you give it a tiny IAM role inside your AWS account that it can assume short-lived. You do that with a CloudFormation template we ship for free.

    1. Pick the right template (see below): Read-only for the Free plan, Circuit-breaker for Starter.
    2. AWS Console → CloudFormation → Create stack → With new resources → upload the YAML.
    3. Stack name: CloudWatchdog (or whatever you like).
    4. When status reaches CREATE_COMPLETE, copy the RoleArn from the Outputs tab.
    5. Back in Cloud Watchdog → /onboarding/cloud → paste the ARN. We run an automatic permissions probe and tell you within ~10 seconds whether everything's wired correctly.
  3. 03

    Wire up Slack + email so alerts actually reach you

    2 min

    /settings Notifications. Paste a Slack incoming-webhook URL (created in your Slack workspace under "Apps → Custom Integrations → Incoming Webhooks") and add an email address. We send a test ping to confirm both work before saving.

  4. 04

    Create your first alert rule

    1 min

    /alert-rules Create rule. Pick the service (Lambda / EC2 / RDS / NAT), metric, threshold, and time window. For most rules a 10-minute window works well — short windows often miss CloudWatch's ~5-minute publish cadence.

    The form gates this for you — if you pick EC2 or RDS, the minimum window jumps to 10 min. Lambda can stay at 2 min because its metrics publish in near-real time.

  5. 05

    Tag resources (only if you want circuit breakers)

    2 min per resource

    Skip this step for Free.If you're on Starter and want the system to actually stop a Lambda or EC2 instance when it goes haywire, you have to opt resources in with two tags. See the tag section below for the exact key/value pairs.

04

Connecting your AWS account

Cloud Watchdog needs to read your AWS resources to find waste and watch metrics. For the optional circuit breakers, it also needs a narrow set of write permissions (only on resources you've opted in by tag). Both come from a CloudFormation template we publish on GitHub for you to audit before deploying.

Download the right template for your plan

Read-only template

Use this for the Free plan or if you only ever want detection.

Permissions granted

  • · ec2:Describe*
  • · lambda:List* / Get*
  • · ecs:List* / Describe*
  • · rds:Describe*
  • · logs:Describe* / Get*
  • · cloudwatch:GetMetricStatistics
  • · cloudwatch:GetMetricData
  • · ce:GetCostAndUsage (for $ data, optional)
Most popular

Circuit-breaker template

Use this for Starter when you want auto-stop / auto-throttle.

Permissions granted

  • · All read perms from the read-only template
  • · + ec2:StopInstances
  • · + lambda:PutFunctionConcurrency
  • · + ecs:UpdateService
  • · + application-autoscaling:RegisterScalableTarget
  • · All write perms scoped via IAM Condition tags
  • · Resources tagged env=prod refused at the policy layer
What's inside the YAML? A single IAM role with a trust policy that allows sts:AssumeRolefrom Cloud Watchdog's control-plane principal, conditioned on an ExternalIdwe generate per customer to prevent the "confused deputy" problem. The permission policy is the list above. No long-lived credentials are ever stored on our side. Every API call we make on your behalf is signed with short-lived (15-min) credentials from AssumeRole.

Deploy the template (the screen-by-screen part)

  1. AWS Console → search CloudFormation → make sure your region is one your resources actually live in (e.g. us-east-1).
  2. Click Create stackWith new resources (standard).
  3. Pick Upload a template file, click Choose file, and select the YAML you downloaded above. Click Next.
  4. Stack name: CloudWatchdog (or CloudWatchdog-CircuitBreakerif deploying the breaker template as a second stack). Leave parameters at their defaults — they auto-fill from Cloud Watchdog's onboarding page.
  5. Two more clicks of Next (you can skip tags + advanced options).
  6. Tick the IAM capabilitiescheckbox at the bottom of the review page — that's AWS reminding you the stack creates a role. Click Submit.
  7. Wait ~30 seconds. Status flips from CREATE_IN_PROGRESS to CREATE_COMPLETE.
  8. Open the Outputs tab. Copy RoleArn — it looks like arn:aws:iam::123456789012:role/CloudWatchdog-Role-XYZ. Paste it into /onboarding/cloud in Cloud Watchdog and click Test & save.
You can deploy both templates side-by-side. If you start on Free / read-only and later upgrade, just deploy the circuit-breaker template as a separate stack and swap the ARN in /cloud-accounts → Edit role. Your old read-only role keeps working if you ever want to swap back.

05

Resource tags — the safety net for circuit breakers

Cloud Watchdog refuses to act onany AWS resource — even with the circuit-breaker IAM role attached — unless you've marked the resource with two opt-in tags. This is the second of three safety checks (alongside the IAM policy itself and the production-environment guard). Pure detection / read-only rules need no tags at all.

TagWhat it means
env = dev or stagingThe hard guard. If a resource is tagged env=prod (or has no env tag and gets inferred as prod), circuit breakers refuse to touch it. Period. There is no way to override.
cloudwatchdog:managed = true"Cloud Watchdog may read, probe, and plan actions against me." Without this tag, no auto-action will ever be queued — even if the rule fires. Inventory + alerts still work fine.
cloudwatchdog:auto-stop = true"You may additionally stop / throttle / scale-me-to-zero." Separate from managed so you can put it on everything in inventory but only opt-in disposable resources for auto-stop.
Why three layers of safety?Belt + suspenders + parachute. Even if a rule's scope accidentally matches a critical resource, all three of (env-tag check, managed tag, auto-stop tag) have to align before any AWS-side change happens. We'd rather you have to add a tag than wake up to accidental production downtime.

How to add the tags

AWS Console → EC2 (or Lambda / ECS, etc) → select your resource → Tags tab → Manage tags → Add tag. Or use the AWS CLI:

aws ec2 create-tags \
  --resources i-0abc1234567890 \
  --tags Key=env,Value=dev \
         Key=cloudwatchdog:managed,Value=true \
         Key=cloudwatchdog:auto-stop,Value=true

After tagging, click Sync inventory on /resources — the new tags land in our DB within 10 seconds, and the next alert that opens will plan an action successfully.

06

Slack & email notifications

Two channels, both equally important. Most teams set them up together — Slack for "everyone sees it instantly", email for "the on-call person in another timezone has a paper trail at 3am".

Slack

  1. 1.In Slack: Apps → search 'Incoming WebHooks' → Add to your workspace.
  2. 2.Pick the channel where alerts should land (e.g. #aws-alerts).
  3. 3.Copy the webhook URL — it looks like https://hooks.slack.com/services/T00.../B00.../xxx.
  4. 4.Cloud Watchdog → /settings → Notifications → paste the webhook → Test → Save.

Email

  1. 1./settings → Notifications → enter the email address.
  2. 2.Click Send test email — Cloud Watchdog fires a test ping via Resend.
  3. 3.Check the inbox + spam folder. If it landed in spam, mark it Not Spam — domain reputation builds over time.
  4. 4.Save. Verified addresses get a green check.
Sent only on real events.No newsletters, no "you have 0 alerts" summaries, no marketing. The only emails we send are real alerts, plus a verification ping the first time you set the channel up.

07

Creating alert rules

A rule is a sentence of the form "if <metric> for <service> goes <operator> <threshold> over <window> minutes, do <action>." Two flavors:

Usage-metric rule

Fires on a CloudWatch metric crossing a threshold.

e.g. Lambda Invocations ≥ 100 in 5 min

Cost rule

Fires when your AWS spend (in dollars) crosses a threshold.

e.g. Total spend ≥ $50 in the last 7 days

The metric library — 44 metrics across 7 services

When you pick a service, the metric dropdown only shows metrics that make sense for it — and each metric carries its own unit. Pick CPU, type 70, the form labels it %. Pick NetworkOut, type 10, the form labels it MB(and converts to bytes when it talks to CloudWatch, so you don't do the math). Coverage:

  • EC2 (8 metrics): CPU, NetworkIn/Out, NetworkPacketsIn/Out, MetadataNoToken, CPUCreditUsage, CPUCreditBalance.
  • EBS (7 metrics): queue length, R/W throughput, R/W ops, idle time, BurstBalance.
  • RDS (7 metrics): CPU, FreeableMemory, DatabaseConnections, R/W IOPS, FreeStorageSpace, ReplicaLag.
  • Lambda (6 metrics): Invocations, Errors, Duration, Throttles, ConcurrentExecutions, IteratorAge.
  • S3 (5 metrics): BucketSizeBytes, NumberOfObjects, AllRequests, 4xx, 5xx.
  • NAT (6 metrics): BytesOutToDestination, BytesInFromDestination, ActiveConnectionCount, ErrorPortAllocation, PacketsDropCount, IdleTimeoutCount.
  • ELB / ALB (4 metrics): RequestCount, TargetResponseTime, HTTP 5xx, HTTP 4xx.

Each resource also has its own dedicated detail page (open any row on /resources) that renders every catalog metric as its own 24h graph, with a one-click "Set alert rule for this metric" CTA next to each.

Resource scope — one rule, many instances

Every Usage rule asks which resources? Two modes:

  • All current + future— the rule applies to every resource of the chosen service, including ones you create tomorrow. Best for "every Lambda must obey the runaway-invocation rule".
  • Pick specific resources — a search-box + checkbox list lets you scope to a hand-picked subset. New resources are not auto-included.

CloudWatch Alarms force a 1:1 alarm-per-resource model — 80 EC2 instances meant 80 nearly-identical alarms. Here it's one rule.

Auto-suggest threshold

Click the Auto-suggest from last 7dbutton next to the Threshold input. Cloud Watchdog pulls every cached sample for the matched resources, computes P50 / P95 / P99, and fills in P95 × 1.2 rounded to a clean number. You can refine from there — it's a starting point, not a lock-in. Needs ≥10 samples to fire.

Suggested rules from your inventory

Once inventory sync is running, the /alert-rules page shows a Suggested for your inventory card with 2–4 high-value rules you probably want — based on what services you actually run. One click pre-fills the create form.

Action modes — three to pick from

Free + Starter

Alert only

Notify Slack + email. No AWS-side change ever happens. Recommended starting point.

Starter

Auto-execute after 5-min cancel window

On match, plan an action and show a 5-min Cancel button. If nobody clicks Cancel, it auto-runs. For runaway scenarios.

Starter

Manual confirm only

Plan the action, queue it forever, no auto-run. A human must click Confirm in Slack / on /alerts. Safer choice for sensitive resources.

Window minimums per service.EC2 and RDS publish CloudWatch metrics every ~5 minutes with another ~5 min ingestion lag, so a 2-minute window almost always sees zero samples. The create-rule form enforces a 10-min minimum for EC2/RDS, 2 min for Lambda/NAT. Lower = trust me, you don't want it.

08

Circuit breakers — what they actually do

A circuit breaker is what we call a rule whose action mode is anything other than "Alert only". When the threshold trips, Cloud Watchdog can actively halt the resource that's bleeding money. It's gated by three safety checks on top of the IAM policy:

1. Not prod

If env=prod (or no env tag and we infer prod), the action is refused at planning time. No exceptions.

2. Managed tag

The resource must have cloudwatchdog:managed=true. Without it, no action is queued — alerts still fire.

3. Auto-stop tag

cloudwatchdog:auto-stop=true is the second explicit opt-in, distinct from managed. Lets you mark inventory broadly but only auto-stop a narrow subset.

What each action actually does

Lambda throttle

Calls lambda:PutFunctionConcurrency with 0 — new invocations get throttled to a hard stop. Existing in-flight functions complete normally.

Reversible

EC2 stop

Calls ec2:StopInstances. The instance enters 'stopping' then 'stopped'. EBS volumes survive; you pay only for the storage.

Reversible

ECS scale-to-zero

Calls ecs:UpdateService with desiredCount=0. The service stops draining tasks. Task definitions + service config are preserved.

Reversible

Everything is reversible. The action_request row stores the previous state (concurrency value, desiredCount, instance state). One click on Restore in /alerts puts it back. No data is ever deleted by Cloud Watchdog.

09

Frequently asked questions

Does Cloud Watchdog store my AWS access keys?+

No. We use sts:AssumeRole with a per-customer ExternalId. Every API call we make to your account is signed with short-lived 15-minute credentials. There's no long-lived secret in our database, and we couldn't keep one if we tried — the trust policy in your IAM role wouldn't allow it.

What happens if I downgrade from Starter to Free?+

Your circuit-breaker rules are auto-converted to alert-only (detection keeps firing, no auto-stop). Alert rules above the Free cap of 5 get disabled, newest first — the oldest 5 stay enabled. Connected cloud accounts above the Free cap of 1 stay connected but you can't add new ones until you upgrade again. Everything is reversible.

Can the system stop its own host EC2?+

Yes, if you've tagged the host with cloudwatchdog:auto-stop=true. Don't do that. The audit log will show what happened, and the EC2 will come back up automatically once you start it from the AWS console (Docker restart-policy handles the rest). Test rules on a separate disposable EC2.

How fast can detection actually be?+

Roughly 5 min worst-case for usage-metric rules (CloudWatch publish cadence + our 5-min poll). Lambda metrics are near-real-time. Cost rules use Cost Explorer which is ~24 hours stale — they're for slow-burn detection, not emergencies.

Will Cloud Watchdog make my AWS bill go UP?+

Tiny amounts. The CloudWatch GetMetricStatistics calls stay inside the 1M/month free tier for normal use. Cost Explorer API is $0.01/call and is opt-in (toggle on the spend card). The AssumeRole calls themselves are free.

What regions are supported?+

All AWS commercial regions. The inventory sync auto-discovers which regions your account has resources in, so you only pay metric-poll cost where you have things running.

Is there an open-source version?+

Not the full product, but the CloudFormation templates are public — you can audit them before deploying. We're a single founder (Bibek Jha) building this in the open.

That's everything. Want to get started?

Free for a single AWS account + 5 alert rules, $19/month if you want more accounts, more rules, and the auto-stop circuit breakers. Cancel any time.