Calendo Computer Use Tests

Browser-agent test suites run against production https://calendo.dev.

📖 Hosted (rendered) version: https://calendo-cu-tests.pages.dev/ — the dashboard and every runbook rendered as browsable web pages (Cloudflare Pages, on a separate domain from the product). The easiest way to review the suite or hand a runbook to a computer-use agent from any machine. Regenerate + redeploy with: node scripts/build-cu-tests-site.mjs /tmp/cu-site && wrangler pages deploy /tmp/cu-site --project-name calendo-cu-tests --branch main.

This directory is the master index and operating manual for Calendo's "Computer Use Test" suite — a battery of end-to-end checks that a computer-use agent (a browser-driving AI) executes against the live product. Each suite is a Markdown runbook in ./suites/. Each run is recorded in a copy of ./results/RESULTS-TEMPLATE.md.

1. What this is

Calendo already has three layers of automated tests, all of which run in CI before any deploy:

Layer	What it proves	Where it runs
Vitest unit tests (`tests/unit/**`)	Pure functions, route handlers, validation, AI tool wiring	In-process, mocked D1 (better-sqlite3)
Integration / smoke (`scripts/smoke-local.mjs`, `tests/schema`)	Real worker + real local D1, HTTP-level behavior	`wrangler dev` against local D1
Playwright E2E (`tests/e2e/**`)	Critical DOM journeys (sign-up, booking flow)	Headless browser against a local worker

Computer Use Tests are a different, higher tier. They do not run against a local worker or a mocked database. They drive a real browser against production with real logged-in accounts, and — most importantly — they verify external reality that no in-process test can touch:

A booking does not merely insert a Calendo D1 row; it must produce a real event on the host's real Google or Outlook calendar, and a real email in a real Gmail inbox that the agent opens and reads in the browser.
A reschedule must move that real calendar event (old slot empty, new slot populated, no duplicate).
A cancellation must delete the real calendar event and send a real cancellation email.

Why L3 reality matters

Unit and integration tests can pass while the product is silently broken for users, because they mock or stub the very systems that fail in the real world: OAuth token refresh, Google freeBusy, Resend deliverability, Microsoft Graph calendarView, cron-driven sequence sends. A green CI run tells you the code's logic is right. It does not tell you that an invitee actually received a calendar invite this morning.

The Computer Use Tests close that gap. They are the only layer that answers: "If a stranger booked a meeting on calendo.dev right now, would a real event show up on the real calendar and a real email land in the real inbox?" That question is the product's entire value proposition, so it gets its own test tier.

These suites are expensive and stateful (they touch production data, real calendars, real inboxes), so they are run deliberately — before a launch, after a risky deploy, on a release cadence — not on every commit. CI stays the fast gate; Computer Use Tests are the reality gate.

2. How to run

The entry point is this README. Point a computer-use agent at it, then:

Read the global facts below (accounts, RUNID, verification tiers, parallelization, cleanup).
Read ./00-setup-preconditions.md and satisfy every precondition (sessions logged in, fixtures present, baseline captured) before touching a suite.
Pick a target — either a single suite or a wave (see §6):
- One suite: open its runbook in ./suites/, e.g. ./suites/CU-01-core-booking-lifecycle.md, and execute it top to bottom.
- A wave / the full graph: run all Wave 1 suites concurrently (Lane A/B/D), then run Wave 2 (CU-06) alone. See §6 for why the order is mandatory.
Mint a fresh RUNID (see §4) and embed it in every name/email you create during the run.
Execute the runbook, capturing evidence at each step (see §7).
Record results: copy ./results/RESULTS-TEMPLATE.md to ./results/RESULTS-<SUITE>-<RUNID>.md (e.g. RESULTS-CU-01-20260601a.md) and fill in PASS / FAIL / BLOCKED per coverage item, attaching screenshots and observed reality.
Clean up the data your run created (see §7).

Running one suite vs. the full graph

One suite is self-contained: it has its own preconditions, its own uniquely-named fixtures, and its own results file. This is the right mode for re-verifying a single area after a targeted fix.
The full graph is the launch/regression mode: every suite runs, parallelized by lane, with CU-06 isolated in a second wave because it rewrites global host availability and would corrupt every other suite's slot math if run concurrently. Always run the graph as Wave 1 (all A/B/D concurrently) → Wave 2 (CU-06 alone), never flat.

3. Accounts & sessions

Sessions must already be logged in. This environment stores no passwords. Every account below must already have a live, authenticated browser session before a run starts. A mid-run cold login (email/password or OAuth consent) is forbidden — if a suite hits a login wall, that portion becomes manual residue, not a FAIL. Verify all sessions in ./00-setup-preconditions.md first.

Account	Email	Role in tests	Notes
P1 — Host	`ravikantguptaofficial@gmail.com`	Primary host. Owns event types, availability, dashboard. The Google Calendar + Gmail inbox used for L3 verification.	Google sign-in + connected Google Calendar + the notification inbox the agent opens to read confirmation/reschedule/cancel emails.
P3 — Teammate	`everythingaichannelemail@gmail.com`	Second org member (team/round-robin suites); guest recipient for guest-notification email checks.	Distinct mailbox from P1 — used where a separate invitee/guest inbox is required.
Outlook	`ravikant0909@outlook.com`	Microsoft/Outlook calendar integration host.	Used only by the Outlook suite (CU-04). Real Outlook event create/delete + conflict.

Plus-alias (invitee) convention

Invitees and throwaway registrants are plus-aliases of P1's Gmail, so their mail lands in P1's inbox where the agent can read it, while remaining unique per run:

Invitee bookings: ravikantguptaofficial+inv-<RUNID>@gmail.com
Fresh-account registration (auth suite): ravikantguptaofficial+auth-<RUNID>@gmail.com
Onboarding throwaway: ravikantguptaofficial+onb-<RUNID>@gmail.com

Gmail delivers all +tag aliases to the base inbox, so the agent searches inv-<RUNID> (or the event name) in P1's Gmail to confirm L3 email delivery.

Known limitation: because invitees are plus-aliases of the host's own Gmail, true separate-invitee calendar-invite/RSVP delivery is ambiguous in suites that use them. Where a genuinely distinct invitee mailbox is required, use P3 (e.g. guest-notification checks in CU-05/CU-07).

4. The RUNID convention

Every run mints one fresh, unique RUNID token (e.g. a date + short suffix like 20260601a, or a short random string). That token is embedded in everything the run creates — event-type names, schedule names, invitee names, plus-alias emails, poll/form titles, webhook labels — so that:

Created data is traceable to the run that made it.
Concurrent suites in the same wave never collide (each writer uses its own uniquely-named fixtures).
Search-based L3 verification works — the agent finds "its own" email/event by searching the RUNID.
Cleanup is unambiguous — anything carrying the RUNID is this run's to delete; leftovers carrying an old RUNID are harmless.

Never reuse a RUNID across runs. A fresh token per run is what makes parallel execution and post-run auditing safe.

5. Suite catalog (ranked)

Sorted by priority (P0 → P3), then by ID. Each ID links to its runbook in ./suites/.

ID	Title	Pri	Accounts	Lane	Excl.	Est.	L3 reality
CU-01	Core booking lifecycle (book → reschedule → cancel) with calendar & email reality	P0	P1 host + anon invitee	B	—	30m	GCal create/move/delete + confirm/reschedule/cancel emails
CU-02	Auth lifecycle: register, verify, login, forgot/reset, delete	P0	fresh throwaway (`+auth-<RUNID>`)	D	—	22m	verification + password-reset emails (clicked from Gmail)
CU-03	Google Calendar: conflict blocking, buffers, two-way sync	P0	P1 host + connected GCal	D	—	40m	heavy — busy conflict, event create/move/delete in GCal
CU-05	Event-type configuration → public booking-page enforcement	P0	P1 host (own event types)	B	—	85m	guest-notification email
CU-06	Availability engine: weekly rules, overrides, holidays, slot-debug	P0	P1 host	X	YES	35m	none
CU-07	Host-side booking management: book-on-behalf, no-show, notes, guests, cancel/reschedule	P1	P1 host (own event type)	B	—	40m	invitee/guest email + GCal
CU-08	AI booking chatbot (the differentiator) on the public booking page	P1	anon invitee	B	—	30m	confirmation email + GCal
CU-09	AI dashboard assistant — feature-parity actions via chat	P1	P1 host	B	—	45m	partial (verify side effects in UI)
CU-10	Landing page + marketing + static pages + mobile responsiveness	P1	none (anon, read-only)	A	—	15m	none
CU-11	Public booking page UX: timezones, calendar nav, empty-state, QR, mobile	P1	anon (host has availability)	A/B	—	22m	none (one optional booking)
CU-24	Signup calendar consent (Google + Microsoft combined consent) & onboarding entry; connect-resilience regression	P1	fresh Google/Microsoft (no Calendo acct) / P1	D	—	20m	calendar access granted at signup
CU-04	Microsoft / Outlook calendar integration	P2	Outlook `ravikant0909@outlook.com`	D	—	35m	Outlook event create/delete + conflict
CU-12	Routing forms: build → public submit → route to event type → analytics	P2	P1 host + anon	B	—	18m	none
CU-13	Meeting polls: create → share → vote → tally → pick winner	P2	P1 host + anon voters	B	—	18m	none
CU-14	Team / org scheduling: invite, roles, round-robin, collective, team page, audit log	P2	P1 owner + P3 teammate	D	—	40m	email/GCal to the ASSIGNED member
CU-15	Contacts, analytics dashboard, and CSV export	P2	P1 host	A	—	25m	none
CU-16	Settings & customization: profile, cancellation policy, blocklist, branding, BYOK, pixels	P2	P1 host	B	—	30m	none
CU-17	Slack notifications & outbound webhooks (verified via webhook.site)	P2	P1 host	B	—	30m	webhook.site receipt + Slack message
CU-18	New-user onboarding wizard (4-step) and schedule-setup prompt	P2	fresh throwaway (`+onb-<RUNID>`)	D	—	22m	none
CU-19	Embeddable booking widget (inline / popup / badge)	P3	P1 host + external test page	B	—	25m	none (optional booking email)
CU-20	Email sequences, reminders, and reconfirmation (time-gated, partial)	P3	P1 host	B	—	40m	sequence/immediate emails (reminders cron-gated)
CU-22	Chrome extension for Gmail (manual-led)	P3	P1 (manual install)	manual	—	20m	none

Totals: 21 suites · ~11.1 hours of single-threaded agent time (heavily compressed by parallelization — see §6).

6. Parallelization model

Suites are tagged by lane so they can run concurrently without corrupting each other's data or slot math.

Lane legend

Lane	Meaning	Parallel-safe?
A	Read-only. Anonymous, no writes (landing/marketing, analytics views). Run anytime, against anything.	Yes — fully
B	Host-writers using uniquely-named event types. Each suite writes only its own `<name>-<RUNID>` event types/fixtures, so concurrent B suites never touch the same rows.	Yes — via RUNID isolation
D	Account-isolated. Runs against a fresh throwaway, the P3 teammate, or the Outlook account — separate state from P1's main config.	Yes — different accounts
A/B	Mostly read-only with one optional, RUNID-scoped booking (CU-11).	Yes
X	Exclusive. Rewrites global host state. Must run alone.	No
manual	Human-led (CU-22 Chrome extension); the agent can only verify the resolved booking link.	n/a

The two waves

WAVE 1  (run concurrently)        WAVE 2  (run alone, after Wave 1 fully completes)
──────────────────────────        ──────────────────────────────────────────────
All Lane A / B / D suites:        CU-06  (Lane X — EXCLUSIVE)
  CU-01  CU-02  CU-03  CU-04
  CU-05  CU-07  CU-08  CU-09
  CU-10  CU-11  CU-12  CU-13
  CU-14  CU-15  CU-16  CU-17
  CU-18  CU-19  CU-20
  (CU-22 runs whenever a human is available)

Why CU-06 is exclusive

CU-06 (the availability engine) rewrites the global host availability for P1 — it sets a known Mon–Fri 09:00–17:00 schedule, applies date overrides, loads a US-2026 holiday preset, and creates/deletes named schedules, then restores a baseline at the end. Every other suite that books against P1 (CU-01, CU-05, CU-07, CU-08, CU-11, and more) computes available slots from exactly that global availability. If CU-06 ran concurrently, it would silently change the slots those suites depend on mid-flight — making their bookings non-deterministic and their slot-math assertions flaky or wrong.

So CU-06 runs alone, in a second wave, after Wave 1 has fully drained, and is responsible for capturing a baseline at the start and fully restoring the global schedule at the end so the account is left clean for the next run.

7. Conventions

Verification tiers

Every coverage item is verified to one of three depths. A suite states the tier it reaches per item.

Tier	Name	What it proves	How
L1	UI	The rendered page shows the right thing	Read the DOM / screenshot the visible state
L2	Calendo persistence	The change is stored server-side, not just in-memory	Reload the page (or re-fetch the API) and confirm it survives
L3	External reality	A real artifact exists outside Calendo	Open Gmail / Google Calendar / Outlook / webhook.site in the browser and confirm the real event/email/receipt

L3 is the whole point (see §1). An item that only reaches L1/L2 is not wrong, but it has not proven the product works for a real user — note the gap explicitly.

Screenshots & evidence

Capture a screenshot at every PASS/FAIL decision point, and especially for L3 (the Gmail search result, the calendar event at the booked slot, the webhook.site request body).
For L3 email, record the search query used (e.g. inv-<RUNID>) and the observed subject line.
For L3 calendar, record the date/time of the event and that the invitee appears as an attendee.
Retry transient failures before scoring FAIL. Google freeBusy/push propagation, Resend delivery, and Microsoft Graph can lag by seconds-to-minutes. A short re-check (reload, wait, re-search) is required before declaring a real failure. Anthropic 429/529 overload on AI suites is an external condition, recorded as such, not a Calendo bug.

Cleanup discipline

Delete what you create. Each suite removes its RUNID-scoped event types, bookings, schedules, forms, polls, webhooks at the end. Cancelling a booking should leave no stray calendar event behind — verify it.
CU-06 must restore the global host schedule to its captured baseline. This is non-negotiable; a half-restored availability corrupts the next run.
Some artifacts have no in-UI delete (contacts derived from history, meeting polls, orgs/team event types). These are documented per suite as acceptable RUNID-scoped residue or as a human/D1 cleanup task — they are harmless to reruns precisely because of the RUNID convention.
Never leave a session in a destructive state. Throwaway accounts (CU-02, CU-18) self-delete via the Danger Zone at the end; the no-show mark in CU-07/CU-09 has no un-mark, so it must only be applied to a RUNID-scoped throwaway booking.

8. Related files

File	Purpose
`./00-setup-preconditions.md`	Read first. Sessions to verify, fixtures to seed, baseline to capture before any suite runs.
`./COVERAGE.md`	The full coverage map: what every suite verifies, plus the consolidated manual-residue and out-of-scope / TBD ledger.
`./index.html`	Browsable HTML dashboard of the suite catalog and coverage (light theme, mobile-friendly).
`./suites/`	The runbooks themselves — one Markdown file per suite (`CU-NN-*.md`).
`./results/`	Run outputs. Copy `RESULTS-TEMPLATE.md` → `RESULTS-<SUITE>-<RUNID>.md` per run.

9. Out of scope / TBD

Some product surfaces are deliberately not covered by these browser-agent suites because there is no test mode, no reachable inbox/phone, or the surface is gated to a different account. These are not silently dropped — they are catalogued in ./COVERAGE.md as explicit TBD / manual items. The headline exclusions:

Stripe payments (live mode only — no test mode), PayPal, checkout coupons, and the Pro-upgrade flow.
All Pro-gated features: SMS/Twilio reminders (also need a real phone), custom domains, remove-branding.
The admin panel (gated to a different hardcoded email).
Per-suite residue: reminder/reconfirmation time-gated cron sends, deep cross-timezone correctness, ICS file parsing, HMAC signature recomputation, and server-side audits (data deletion, encryption-at-rest, role 403 enforcement) — all itemized per suite in COVERAGE.md.

See ./COVERAGE.md for the complete, per-suite breakdown of manual residue and TBD items.