Calendo Computer Use Tests
Browser-agent test suites run against production https://calendo.dev.
π Hosted (rendered) version: https://calendo-cu-tests.pages.dev/ β the dashboard and every runbook rendered as browsable web pages (Cloudflare Pages, on a separate domain from the product). The easiest way to review the suite or hand a runbook to a computer-use agent from any machine. Regenerate + redeploy with:
node scripts/build-cu-tests-site.mjs /tmp/cu-site && wrangler pages deploy /tmp/cu-site --project-name calendo-cu-tests --branch main.
This directory is the master index and operating manual for Calendo's "Computer Use Test" suite β a battery of end-to-end checks that a computer-use agent (a browser-driving AI) executes against the live product. Each suite is a Markdown runbook in ./suites/. Each run is recorded in a copy of ./results/RESULTS-TEMPLATE.md.
1. What this is
Calendo already has three layers of automated tests, all of which run in CI before any deploy:
| Layer | What it proves | Where it runs |
|---|---|---|
Vitest unit tests (tests/unit/**) | Pure functions, route handlers, validation, AI tool wiring | In-process, mocked D1 (better-sqlite3) |
Integration / smoke (scripts/smoke-local.mjs, tests/schema) | Real worker + real local D1, HTTP-level behavior | wrangler dev against local D1 |
Playwright E2E (tests/e2e/**) | Critical DOM journeys (sign-up, booking flow) | Headless browser against a local worker |
Computer Use Tests are a different, higher tier. They do not run against a local worker or a mocked database. They drive a real browser against production with real logged-in accounts, and β most importantly β they verify external reality that no in-process test can touch:
- A booking does not merely insert a Calendo D1 row; it must produce a real event on the host's real Google or Outlook calendar, and a real email in a real Gmail inbox that the agent opens and reads in the browser.
- A reschedule must move that real calendar event (old slot empty, new slot populated, no duplicate).
- A cancellation must delete the real calendar event and send a real cancellation email.
Why L3 reality matters
Unit and integration tests can pass while the product is silently broken for users, because they mock or stub the very systems that fail in the real world: OAuth token refresh, Google freeBusy, Resend deliverability, Microsoft Graph calendarView, cron-driven sequence sends. A green CI run tells you the code's logic is right. It does not tell you that an invitee actually received a calendar invite this morning.
The Computer Use Tests close that gap. They are the only layer that answers: "If a stranger booked a meeting on calendo.dev right now, would a real event show up on the real calendar and a real email land in the real inbox?" That question is the product's entire value proposition, so it gets its own test tier.
These suites are expensive and stateful (they touch production data, real calendars, real inboxes), so they are run deliberately β before a launch, after a risky deploy, on a release cadence β not on every commit. CI stays the fast gate; Computer Use Tests are the reality gate.
2. How to run
The entry point is this README. Point a computer-use agent at it, then:
- Read the global facts below (accounts, RUNID, verification tiers, parallelization, cleanup).
- Read
./00-setup-preconditions.mdand satisfy every precondition (sessions logged in, fixtures present, baseline captured) before touching a suite. - Pick a target β either a single suite or a wave (see Β§6):
- One suite: open its runbook in
./suites/, e.g../suites/CU-01-core-booking-lifecycle.md, and execute it top to bottom. - A wave / the full graph: run all Wave 1 suites concurrently (Lane A/B/D), then run Wave 2 (CU-06) alone. See Β§6 for why the order is mandatory.
- One suite: open its runbook in
- Mint a fresh
RUNID(see Β§4) and embed it in every name/email you create during the run. - Execute the runbook, capturing evidence at each step (see Β§7).
- Record results: copy
./results/RESULTS-TEMPLATE.mdto./results/RESULTS-<SUITE>-<RUNID>.md(e.g.RESULTS-CU-01-20260601a.md) and fill in PASS / FAIL / BLOCKED per coverage item, attaching screenshots and observed reality. - Clean up the data your run created (see Β§7).
Running one suite vs. the full graph
- One suite is self-contained: it has its own preconditions, its own uniquely-named fixtures, and its own results file. This is the right mode for re-verifying a single area after a targeted fix.
- The full graph is the launch/regression mode: every suite runs, parallelized by lane, with CU-06 isolated in a second wave because it rewrites global host availability and would corrupt every other suite's slot math if run concurrently. Always run the graph as Wave 1 (all A/B/D concurrently) β Wave 2 (CU-06 alone), never flat.
3. Accounts & sessions
Sessions must already be logged in. This environment stores no passwords. Every account below must already have a live, authenticated browser session before a run starts. A mid-run cold login (email/password or OAuth consent) is forbidden β if a suite hits a login wall, that portion becomes manual residue, not a FAIL. Verify all sessions in
./00-setup-preconditions.mdfirst.
| Account | Role in tests | Notes | |
|---|---|---|---|
| P1 β Host | ravikantguptaofficial@gmail.com | Primary host. Owns event types, availability, dashboard. The Google Calendar + Gmail inbox used for L3 verification. | Google sign-in + connected Google Calendar + the notification inbox the agent opens to read confirmation/reschedule/cancel emails. |
| P3 β Teammate | everythingaichannelemail@gmail.com | Second org member (team/round-robin suites); guest recipient for guest-notification email checks. | Distinct mailbox from P1 β used where a separate invitee/guest inbox is required. |
| Outlook | ravikant0909@outlook.com | Microsoft/Outlook calendar integration host. | Used only by the Outlook suite (CU-04). Real Outlook event create/delete + conflict. |
Plus-alias (invitee) convention
Invitees and throwaway registrants are plus-aliases of P1's Gmail, so their mail lands in P1's inbox where the agent can read it, while remaining unique per run:
- Invitee bookings:
ravikantguptaofficial+inv-<RUNID>@gmail.com - Fresh-account registration (auth suite):
ravikantguptaofficial+auth-<RUNID>@gmail.com - Onboarding throwaway:
ravikantguptaofficial+onb-<RUNID>@gmail.com
Gmail delivers all +tag aliases to the base inbox, so the agent searches inv-<RUNID> (or the event name) in P1's Gmail to confirm L3 email delivery.
Known limitation: because invitees are plus-aliases of the host's own Gmail, true separate-invitee calendar-invite/RSVP delivery is ambiguous in suites that use them. Where a genuinely distinct invitee mailbox is required, use P3 (e.g. guest-notification checks in CU-05/CU-07).
4. The RUNID convention
Every run mints one fresh, unique RUNID token (e.g. a date + short suffix like 20260601a, or a short random string). That token is embedded in everything the run creates β event-type names, schedule names, invitee names, plus-alias emails, poll/form titles, webhook labels β so that:
- Created data is traceable to the run that made it.
- Concurrent suites in the same wave never collide (each writer uses its own uniquely-named fixtures).
- Search-based L3 verification works β the agent finds "its own" email/event by searching the RUNID.
- Cleanup is unambiguous β anything carrying the RUNID is this run's to delete; leftovers carrying an old RUNID are harmless.
Never reuse a RUNID across runs. A fresh token per run is what makes parallel execution and post-run auditing safe.
5. Suite catalog (ranked)
Sorted by priority (P0 β P3), then by ID. Each ID links to its runbook in ./suites/.
| ID | Title | Pri | Accounts | Lane | Excl. | Est. | L3 reality |
|---|---|---|---|---|---|---|---|
| CU-01 | Core booking lifecycle (book β reschedule β cancel) with calendar & email reality | P0 | P1 host + anon invitee | B | β | 30m | GCal create/move/delete + confirm/reschedule/cancel emails |
| CU-02 | Auth lifecycle: register, verify, login, forgot/reset, delete | P0 | fresh throwaway (+auth-<RUNID>) | D | β | 22m | verification + password-reset emails (clicked from Gmail) |
| CU-03 | Google Calendar: conflict blocking, buffers, two-way sync | P0 | P1 host + connected GCal | D | β | 40m | heavy β busy conflict, event create/move/delete in GCal |
| CU-05 | Event-type configuration β public booking-page enforcement | P0 | P1 host (own event types) | B | β | 85m | guest-notification email |
| CU-06 | Availability engine: weekly rules, overrides, holidays, slot-debug | P0 | P1 host | X | YES | 35m | none |
| CU-07 | Host-side booking management: book-on-behalf, no-show, notes, guests, cancel/reschedule | P1 | P1 host (own event type) | B | β | 40m | invitee/guest email + GCal |
| CU-08 | AI booking chatbot (the differentiator) on the public booking page | P1 | anon invitee | B | β | 30m | confirmation email + GCal |
| CU-09 | AI dashboard assistant β feature-parity actions via chat | P1 | P1 host | B | β | 45m | partial (verify side effects in UI) |
| CU-10 | Landing page + marketing + static pages + mobile responsiveness | P1 | none (anon, read-only) | A | β | 15m | none |
| CU-11 | Public booking page UX: timezones, calendar nav, empty-state, QR, mobile | P1 | anon (host has availability) | A/B | β | 22m | none (one optional booking) |
| CU-04 | Microsoft / Outlook calendar integration | P2 | Outlook ravikant0909@outlook.com | D | β | 35m | Outlook event create/delete + conflict |
| CU-12 | Routing forms: build β public submit β route to event type β analytics | P2 | P1 host + anon | B | β | 18m | none |
| CU-13 | Meeting polls: create β share β vote β tally β pick winner | P2 | P1 host + anon voters | B | β | 18m | none |
| CU-14 | Team / org scheduling: invite, roles, round-robin, collective, team page, audit log | P2 | P1 owner + P3 teammate | D | β | 40m | email/GCal to the ASSIGNED member |
| CU-15 | Contacts, analytics dashboard, and CSV export | P2 | P1 host | A | β | 25m | none |
| CU-16 | Settings & customization: profile, cancellation policy, blocklist, branding, BYOK, pixels | P2 | P1 host | B | β | 30m | none |
| CU-17 | Slack notifications & outbound webhooks (verified via webhook.site) | P2 | P1 host | B | β | 30m | webhook.site receipt + Slack message |
| CU-18 | New-user onboarding wizard (4-step) and schedule-setup prompt | P2 | fresh throwaway (+onb-<RUNID>) | D | β | 22m | none |
| CU-19 | Embeddable booking widget (inline / popup / badge) | P3 | P1 host + external test page | B | β | 25m | none (optional booking email) |
| CU-20 | Email sequences, reminders, and reconfirmation (time-gated, partial) | P3 | P1 host | B | β | 40m | sequence/immediate emails (reminders cron-gated) |
| CU-22 | Chrome extension for Gmail (manual-led) | P3 | P1 (manual install) | manual | β | 20m | none |
Totals: 21 suites Β· ~11.1 hours of single-threaded agent time (heavily compressed by parallelization β see Β§6).
6. Parallelization model
Suites are tagged by lane so they can run concurrently without corrupting each other's data or slot math.
Lane legend
| Lane | Meaning | Parallel-safe? |
|---|---|---|
| A | Read-only. Anonymous, no writes (landing/marketing, analytics views). Run anytime, against anything. | Yes β fully |
| B | Host-writers using uniquely-named event types. Each suite writes only its own <name>-<RUNID> event types/fixtures, so concurrent B suites never touch the same rows. | Yes β via RUNID isolation |
| D | Account-isolated. Runs against a fresh throwaway, the P3 teammate, or the Outlook account β separate state from P1's main config. | Yes β different accounts |
| A/B | Mostly read-only with one optional, RUNID-scoped booking (CU-11). | Yes |
| X | Exclusive. Rewrites global host state. Must run alone. | No |
| manual | Human-led (CU-22 Chrome extension); the agent can only verify the resolved booking link. | n/a |
The two waves
WAVE 1 (run concurrently) WAVE 2 (run alone, after Wave 1 fully completes)
ββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββ
All Lane A / B / D suites: CU-06 (Lane X β EXCLUSIVE)
CU-01 CU-02 CU-03 CU-04
CU-05 CU-07 CU-08 CU-09
CU-10 CU-11 CU-12 CU-13
CU-14 CU-15 CU-16 CU-17
CU-18 CU-19 CU-20
(CU-22 runs whenever a human is available)
Why CU-06 is exclusive
CU-06 (the availability engine) rewrites the global host availability for P1 β it sets a known MonβFri 09:00β17:00 schedule, applies date overrides, loads a US-2026 holiday preset, and creates/deletes named schedules, then restores a baseline at the end. Every other suite that books against P1 (CU-01, CU-05, CU-07, CU-08, CU-11, and more) computes available slots from exactly that global availability. If CU-06 ran concurrently, it would silently change the slots those suites depend on mid-flight β making their bookings non-deterministic and their slot-math assertions flaky or wrong.
So CU-06 runs alone, in a second wave, after Wave 1 has fully drained, and is responsible for capturing a baseline at the start and fully restoring the global schedule at the end so the account is left clean for the next run.
7. Conventions
Verification tiers
Every coverage item is verified to one of three depths. A suite states the tier it reaches per item.
| Tier | Name | What it proves | How |
|---|---|---|---|
| L1 | UI | The rendered page shows the right thing | Read the DOM / screenshot the visible state |
| L2 | Calendo persistence | The change is stored server-side, not just in-memory | Reload the page (or re-fetch the API) and confirm it survives |
| L3 | External reality | A real artifact exists outside Calendo | Open Gmail / Google Calendar / Outlook / webhook.site in the browser and confirm the real event/email/receipt |
L3 is the whole point (see Β§1). An item that only reaches L1/L2 is not wrong, but it has not proven the product works for a real user β note the gap explicitly.
Screenshots & evidence
- Capture a screenshot at every PASS/FAIL decision point, and especially for L3 (the Gmail search result, the calendar event at the booked slot, the webhook.site request body).
- For L3 email, record the search query used (e.g.
inv-<RUNID>) and the observed subject line. - For L3 calendar, record the date/time of the event and that the invitee appears as an attendee.
- Retry transient failures before scoring FAIL. Google
freeBusy/push propagation, Resend delivery, and Microsoft Graph can lag by seconds-to-minutes. A short re-check (reload, wait, re-search) is required before declaring a real failure. Anthropic429/529overload on AI suites is an external condition, recorded as such, not a Calendo bug.
Cleanup discipline
- Delete what you create. Each suite removes its RUNID-scoped event types, bookings, schedules, forms, polls, webhooks at the end. Cancelling a booking should leave no stray calendar event behind β verify it.
- CU-06 must restore the global host schedule to its captured baseline. This is non-negotiable; a half-restored availability corrupts the next run.
- Some artifacts have no in-UI delete (contacts derived from history, meeting polls, orgs/team event types). These are documented per suite as acceptable RUNID-scoped residue or as a human/D1 cleanup task β they are harmless to reruns precisely because of the RUNID convention.
- Never leave a session in a destructive state. Throwaway accounts (CU-02, CU-18) self-delete via the Danger Zone at the end; the no-show mark in CU-07/CU-09 has no un-mark, so it must only be applied to a RUNID-scoped throwaway booking.
8. Related files
| File | Purpose |
|---|---|
./00-setup-preconditions.md | Read first. Sessions to verify, fixtures to seed, baseline to capture before any suite runs. |
./COVERAGE.md | The full coverage map: what every suite verifies, plus the consolidated manual-residue and out-of-scope / TBD ledger. |
./index.html | Browsable HTML dashboard of the suite catalog and coverage (light theme, mobile-friendly). |
./suites/ | The runbooks themselves β one Markdown file per suite (CU-NN-*.md). |
./results/ | Run outputs. Copy RESULTS-TEMPLATE.md β RESULTS-<SUITE>-<RUNID>.md per run. |
9. Out of scope / TBD
Some product surfaces are deliberately not covered by these browser-agent suites because there is no test mode, no reachable inbox/phone, or the surface is gated to a different account. These are not silently dropped β they are catalogued in ./COVERAGE.md as explicit TBD / manual items. The headline exclusions:
- Stripe payments (live mode only β no test mode), PayPal, checkout coupons, and the Pro-upgrade flow.
- All Pro-gated features: SMS/Twilio reminders (also need a real phone), custom domains, remove-branding.
- The admin panel (gated to a different hardcoded email).
- Per-suite residue: reminder/reconfirmation time-gated cron sends, deep cross-timezone correctness, ICS file parsing, HMAC signature recomputation, and server-side audits (data deletion, encryption-at-rest, role 403 enforcement) β all itemized per suite in
COVERAGE.md.
See ./COVERAGE.md for the complete, per-suite breakdown of manual residue and TBD items.