How Cloud Watchdog works
and how to set it up.
A complete walkthrough — from sign-up to alerts firing in Slack — written so a non-engineer can follow it. Includes the AWS CloudFormation templates you'll need (download links below) and the exact tags your resources should carry so circuit breakers behave safely.
01
What Cloud Watchdog actually does
Cloud Watchdog is a watchdog (literally) for your AWS bill. It connects read-only into your AWS account, watches your live CloudWatch metrics every few minutes, and sends a Slack message + email the moment one of your resources starts behaving like it's about to cost you a lot of money.
Three things happen behind the scenes, all on a schedule:
1. Inventory sync (every 30 min)
Pulls a fresh list of your EC2, Lambda, ECS, RDS, NAT, EBS, EIP, ELB, and CloudWatch log groups so you always have an up-to-date map of what's actually running.
2. Metric polling (every 5 min)
For each alert rule you've enabled, fetches the latest CloudWatch sample (CPU, Invocations, NAT bytes, RDS connections, etc). Compares it to your threshold.
3. Idle scan (every 6 hours)
Finds unattached EBS volumes, unused Elastic IPs, idle NAT gateways, low-CPU EC2, orphan snapshots. Each shows the exact dollars/month you'd save by deleting.
02
Free vs Starter — which one should you pick?
Free is for trying it out on a single AWS account with a handful of alert rules and the top idle resources visible. Starter adds more accounts, more rules, the full idle list, and the actually-stop-the-thing circuit breakers.
| Feature | Free | Starter |
|---|---|---|
| Connected AWS accounts | 1 | 3 |
| Alert rules | 5 | 25 |
| Slack + email alerts | Yes | Yes |
| Idle-resource waste finder | Top 5 findings | All findings |
| Circuit breakers (Lambda throttle, EC2 stop, ECS scale-to-zero) | Disabled | Up to 10 rules |
| Cost Explorer auto-sync | Manual only | Manual or daily auto |
| Price | $0 / forever | $19 / month (founder lock) |
03
Step-by-step setup — 10 minutes start to finish
You'll go through five short steps. No prior AWS or DevOps experience required — the trickiest bit is uploading a YAML file in the AWS Console, and we provide a one-click download for that.
- 01
Sign up for the dashboard
2 min
Head to cloudwatchdog.online/sign-up and create your account with Google or email. You'll land on an empty workspace called "My workspace".
No credit card needed. Free plan is active by default.
- 02
Deploy the IAM role in your AWS account
5 min
Cloud Watchdog never stores AWS access keys. Instead, you give it a tiny IAM role inside your AWS account that it can assume short-lived. You do that with a CloudFormation template we ship for free.
- Pick the right template (see below): Read-only for the Free plan, Circuit-breaker for Starter.
- AWS Console → CloudFormation → Create stack → With new resources → upload the YAML.
- Stack name:
CloudWatchdog(or whatever you like). - When status reaches
CREATE_COMPLETE, copy theRoleArnfrom the Outputs tab. - Back in Cloud Watchdog → /onboarding/cloud → paste the ARN. We run an automatic permissions probe and tell you within ~10 seconds whether everything's wired correctly.
- 03
Wire up Slack + email so alerts actually reach you
2 min
/settings → Notifications. Paste a Slack incoming-webhook URL (created in your Slack workspace under "Apps → Custom Integrations → Incoming Webhooks") and add an email address. We send a test ping to confirm both work before saving.
- 04
Create your first alert rule
1 min
/alert-rules → Create rule. Pick the service (Lambda / EC2 / RDS / NAT), metric, threshold, and time window. For most rules a 10-minute window works well — short windows often miss CloudWatch's ~5-minute publish cadence.
The form gates this for you — if you pick EC2 or RDS, the minimum window jumps to 10 min. Lambda can stay at 2 min because its metrics publish in near-real time.
- 05
Tag resources (only if you want circuit breakers)
2 min per resource
Skip this step for Free.If you're on Starter and want the system to actually stop a Lambda or EC2 instance when it goes haywire, you have to opt resources in with two tags. See the tag section below for the exact key/value pairs.
04
Connecting your AWS account
Cloud Watchdog needs to read your AWS resources to find waste and watch metrics. For the optional circuit breakers, it also needs a narrow set of write permissions (only on resources you've opted in by tag). Both come from a CloudFormation template we publish on GitHub for you to audit before deploying.
Download the right template for your plan
Read-only template
Use this for the Free plan or if you only ever want detection.
Permissions granted
- · ec2:Describe*
- · lambda:List* / Get*
- · ecs:List* / Describe*
- · rds:Describe*
- · logs:Describe* / Get*
- · cloudwatch:GetMetricStatistics
- · cloudwatch:GetMetricData
- · ce:GetCostAndUsage (for $ data, optional)
Circuit-breaker template
Use this for Starter when you want auto-stop / auto-throttle.
Permissions granted
- · All read perms from the read-only template
- · + ec2:StopInstances
- · + lambda:PutFunctionConcurrency
- · + ecs:UpdateService
- · + application-autoscaling:RegisterScalableTarget
- · All write perms scoped via IAM Condition tags
- · Resources tagged env=prod refused at the policy layer
sts:AssumeRolefrom Cloud Watchdog's control-plane principal, conditioned on an ExternalIdwe generate per customer to prevent the "confused deputy" problem. The permission policy is the list above. No long-lived credentials are ever stored on our side. Every API call we make on your behalf is signed with short-lived (15-min) credentials from AssumeRole.Deploy the template (the screen-by-screen part)
- AWS Console → search CloudFormation → make sure your region is one your resources actually live in (e.g.
us-east-1). - Click Create stack → With new resources (standard).
- Pick Upload a template file, click Choose file, and select the YAML you downloaded above. Click Next.
- Stack name:
CloudWatchdog(orCloudWatchdog-CircuitBreakerif deploying the breaker template as a second stack). Leave parameters at their defaults — they auto-fill from Cloud Watchdog's onboarding page. - Two more clicks of Next (you can skip tags + advanced options).
- Tick the IAM capabilitiescheckbox at the bottom of the review page — that's AWS reminding you the stack creates a role. Click Submit.
- Wait ~30 seconds. Status flips from
CREATE_IN_PROGRESStoCREATE_COMPLETE. - Open the Outputs tab. Copy
RoleArn— it looks likearn:aws:iam::123456789012:role/CloudWatchdog-Role-XYZ. Paste it into /onboarding/cloud in Cloud Watchdog and click Test & save.
06
Slack & email notifications
Two channels, both equally important. Most teams set them up together — Slack for "everyone sees it instantly", email for "the on-call person in another timezone has a paper trail at 3am".
Slack
- 1.In Slack: Apps → search 'Incoming WebHooks' → Add to your workspace.
- 2.Pick the channel where alerts should land (e.g. #aws-alerts).
- 3.Copy the webhook URL — it looks like https://hooks.slack.com/services/T00.../B00.../xxx.
- 4.Cloud Watchdog → /settings → Notifications → paste the webhook → Test → Save.
- 1./settings → Notifications → enter the email address.
- 2.Click Send test email — Cloud Watchdog fires a test ping via Resend.
- 3.Check the inbox + spam folder. If it landed in spam, mark it Not Spam — domain reputation builds over time.
- 4.Save. Verified addresses get a green check.
07
Creating alert rules
A rule is a sentence of the form "if <metric> for <service> goes <operator> <threshold> over <window> minutes, do <action>." Two flavors:
Usage-metric rule
Fires on a CloudWatch metric crossing a threshold.
e.g. Lambda Invocations ≥ 100 in 5 min
Cost rule
Fires when your AWS spend (in dollars) crosses a threshold.
e.g. Total spend ≥ $50 in the last 7 days
The metric library — 44 metrics across 7 services
When you pick a service, the metric dropdown only shows metrics that make sense for it — and each metric carries its own unit. Pick CPU, type 70, the form labels it %. Pick NetworkOut, type 10, the form labels it MB(and converts to bytes when it talks to CloudWatch, so you don't do the math). Coverage:
- EC2 (8 metrics): CPU, NetworkIn/Out, NetworkPacketsIn/Out, MetadataNoToken, CPUCreditUsage, CPUCreditBalance.
- EBS (7 metrics): queue length, R/W throughput, R/W ops, idle time, BurstBalance.
- RDS (7 metrics): CPU, FreeableMemory, DatabaseConnections, R/W IOPS, FreeStorageSpace, ReplicaLag.
- Lambda (6 metrics): Invocations, Errors, Duration, Throttles, ConcurrentExecutions, IteratorAge.
- S3 (5 metrics): BucketSizeBytes, NumberOfObjects, AllRequests, 4xx, 5xx.
- NAT (6 metrics): BytesOutToDestination, BytesInFromDestination, ActiveConnectionCount, ErrorPortAllocation, PacketsDropCount, IdleTimeoutCount.
- ELB / ALB (4 metrics): RequestCount, TargetResponseTime, HTTP 5xx, HTTP 4xx.
Each resource also has its own dedicated detail page (open any row on /resources) that renders every catalog metric as its own 24h graph, with a one-click "Set alert rule for this metric" CTA next to each.
Resource scope — one rule, many instances
Every Usage rule asks which resources? Two modes:
- All current + future— the rule applies to every resource of the chosen service, including ones you create tomorrow. Best for "every Lambda must obey the runaway-invocation rule".
- Pick specific resources — a search-box + checkbox list lets you scope to a hand-picked subset. New resources are not auto-included.
CloudWatch Alarms force a 1:1 alarm-per-resource model — 80 EC2 instances meant 80 nearly-identical alarms. Here it's one rule.
Auto-suggest threshold
Click the Auto-suggest from last 7dbutton next to the Threshold input. Cloud Watchdog pulls every cached sample for the matched resources, computes P50 / P95 / P99, and fills in P95 × 1.2 rounded to a clean number. You can refine from there — it's a starting point, not a lock-in. Needs ≥10 samples to fire.
Suggested rules from your inventory
Once inventory sync is running, the /alert-rules page shows a Suggested for your inventory card with 2–4 high-value rules you probably want — based on what services you actually run. One click pre-fills the create form.
Action modes — three to pick from
Alert only
Notify Slack + email. No AWS-side change ever happens. Recommended starting point.
Auto-execute after 5-min cancel window
On match, plan an action and show a 5-min Cancel button. If nobody clicks Cancel, it auto-runs. For runaway scenarios.
Manual confirm only
Plan the action, queue it forever, no auto-run. A human must click Confirm in Slack / on /alerts. Safer choice for sensitive resources.
08
Circuit breakers — what they actually do
A circuit breaker is what we call a rule whose action mode is anything other than "Alert only". When the threshold trips, Cloud Watchdog can actively halt the resource that's bleeding money. It's gated by three safety checks on top of the IAM policy:
1. Not prod
If env=prod (or no env tag and we infer prod), the action is refused at planning time. No exceptions.
2. Managed tag
The resource must have cloudwatchdog:managed=true. Without it, no action is queued — alerts still fire.
3. Auto-stop tag
cloudwatchdog:auto-stop=true is the second explicit opt-in, distinct from managed. Lets you mark inventory broadly but only auto-stop a narrow subset.
What each action actually does
Lambda throttle
Calls lambda:PutFunctionConcurrency with 0 — new invocations get throttled to a hard stop. Existing in-flight functions complete normally.
Reversible
EC2 stop
Calls ec2:StopInstances. The instance enters 'stopping' then 'stopped'. EBS volumes survive; you pay only for the storage.
Reversible
ECS scale-to-zero
Calls ecs:UpdateService with desiredCount=0. The service stops draining tasks. Task definitions + service config are preserved.
Reversible
09
Frequently asked questions
Does Cloud Watchdog store my AWS access keys?+
No. We use sts:AssumeRole with a per-customer ExternalId. Every API call we make to your account is signed with short-lived 15-minute credentials. There's no long-lived secret in our database, and we couldn't keep one if we tried — the trust policy in your IAM role wouldn't allow it.
What happens if I downgrade from Starter to Free?+
Your circuit-breaker rules are auto-converted to alert-only (detection keeps firing, no auto-stop). Alert rules above the Free cap of 5 get disabled, newest first — the oldest 5 stay enabled. Connected cloud accounts above the Free cap of 1 stay connected but you can't add new ones until you upgrade again. Everything is reversible.
Can the system stop its own host EC2?+
Yes, if you've tagged the host with cloudwatchdog:auto-stop=true. Don't do that. The audit log will show what happened, and the EC2 will come back up automatically once you start it from the AWS console (Docker restart-policy handles the rest). Test rules on a separate disposable EC2.
How fast can detection actually be?+
Roughly 5 min worst-case for usage-metric rules (CloudWatch publish cadence + our 5-min poll). Lambda metrics are near-real-time. Cost rules use Cost Explorer which is ~24 hours stale — they're for slow-burn detection, not emergencies.
Will Cloud Watchdog make my AWS bill go UP?+
Tiny amounts. The CloudWatch GetMetricStatistics calls stay inside the 1M/month free tier for normal use. Cost Explorer API is $0.01/call and is opt-in (toggle on the spend card). The AssumeRole calls themselves are free.
What regions are supported?+
All AWS commercial regions. The inventory sync auto-discovers which regions your account has resources in, so you only pay metric-poll cost where you have things running.
Is there an open-source version?+
Not the full product, but the CloudFormation templates are public — you can audit them before deploying. We're a single founder (Bibek Jha) building this in the open.