Calendo Computer Use Tests

Browser-agent test suites run against production https://calendo.dev.

πŸ“– Hosted (rendered) version: https://calendo-cu-tests.pages.dev/ β€” the dashboard and every runbook rendered as browsable web pages (Cloudflare Pages, on a separate domain from the product). The easiest way to review the suite or hand a runbook to a computer-use agent from any machine. Regenerate + redeploy with: node scripts/build-cu-tests-site.mjs /tmp/cu-site && wrangler pages deploy /tmp/cu-site --project-name calendo-cu-tests --branch main.

This directory is the master index and operating manual for Calendo's "Computer Use Test" suite β€” a battery of end-to-end checks that a computer-use agent (a browser-driving AI) executes against the live product. Each suite is a Markdown runbook in ./suites/. Each run is recorded in a copy of ./results/RESULTS-TEMPLATE.md.


1. What this is

Calendo already has three layers of automated tests, all of which run in CI before any deploy:

LayerWhat it provesWhere it runs
Vitest unit tests (tests/unit/**)Pure functions, route handlers, validation, AI tool wiringIn-process, mocked D1 (better-sqlite3)
Integration / smoke (scripts/smoke-local.mjs, tests/schema)Real worker + real local D1, HTTP-level behaviorwrangler dev against local D1
Playwright E2E (tests/e2e/**)Critical DOM journeys (sign-up, booking flow)Headless browser against a local worker

Computer Use Tests are a different, higher tier. They do not run against a local worker or a mocked database. They drive a real browser against production with real logged-in accounts, and β€” most importantly β€” they verify external reality that no in-process test can touch:

Why L3 reality matters

Unit and integration tests can pass while the product is silently broken for users, because they mock or stub the very systems that fail in the real world: OAuth token refresh, Google freeBusy, Resend deliverability, Microsoft Graph calendarView, cron-driven sequence sends. A green CI run tells you the code's logic is right. It does not tell you that an invitee actually received a calendar invite this morning.

The Computer Use Tests close that gap. They are the only layer that answers: "If a stranger booked a meeting on calendo.dev right now, would a real event show up on the real calendar and a real email land in the real inbox?" That question is the product's entire value proposition, so it gets its own test tier.

These suites are expensive and stateful (they touch production data, real calendars, real inboxes), so they are run deliberately β€” before a launch, after a risky deploy, on a release cadence β€” not on every commit. CI stays the fast gate; Computer Use Tests are the reality gate.


2. How to run

The entry point is this README. Point a computer-use agent at it, then:

  1. Read the global facts below (accounts, RUNID, verification tiers, parallelization, cleanup).
  2. Read ./00-setup-preconditions.md and satisfy every precondition (sessions logged in, fixtures present, baseline captured) before touching a suite.
  3. Pick a target β€” either a single suite or a wave (see Β§6):
    • One suite: open its runbook in ./suites/, e.g. ./suites/CU-01-core-booking-lifecycle.md, and execute it top to bottom.
    • A wave / the full graph: run all Wave 1 suites concurrently (Lane A/B/D), then run Wave 2 (CU-06) alone. See Β§6 for why the order is mandatory.
  4. Mint a fresh RUNID (see Β§4) and embed it in every name/email you create during the run.
  5. Execute the runbook, capturing evidence at each step (see Β§7).
  6. Record results: copy ./results/RESULTS-TEMPLATE.md to ./results/RESULTS-<SUITE>-<RUNID>.md (e.g. RESULTS-CU-01-20260601a.md) and fill in PASS / FAIL / BLOCKED per coverage item, attaching screenshots and observed reality.
  7. Clean up the data your run created (see Β§7).

Running one suite vs. the full graph


3. Accounts & sessions

Sessions must already be logged in. This environment stores no passwords. Every account below must already have a live, authenticated browser session before a run starts. A mid-run cold login (email/password or OAuth consent) is forbidden β€” if a suite hits a login wall, that portion becomes manual residue, not a FAIL. Verify all sessions in ./00-setup-preconditions.md first.

AccountEmailRole in testsNotes
P1 β€” Hostravikantguptaofficial@gmail.comPrimary host. Owns event types, availability, dashboard. The Google Calendar + Gmail inbox used for L3 verification.Google sign-in + connected Google Calendar + the notification inbox the agent opens to read confirmation/reschedule/cancel emails.
P3 β€” Teammateeverythingaichannelemail@gmail.comSecond org member (team/round-robin suites); guest recipient for guest-notification email checks.Distinct mailbox from P1 β€” used where a separate invitee/guest inbox is required.
Outlookravikant0909@outlook.comMicrosoft/Outlook calendar integration host.Used only by the Outlook suite (CU-04). Real Outlook event create/delete + conflict.

Plus-alias (invitee) convention

Invitees and throwaway registrants are plus-aliases of P1's Gmail, so their mail lands in P1's inbox where the agent can read it, while remaining unique per run:

Gmail delivers all +tag aliases to the base inbox, so the agent searches inv-<RUNID> (or the event name) in P1's Gmail to confirm L3 email delivery.

Known limitation: because invitees are plus-aliases of the host's own Gmail, true separate-invitee calendar-invite/RSVP delivery is ambiguous in suites that use them. Where a genuinely distinct invitee mailbox is required, use P3 (e.g. guest-notification checks in CU-05/CU-07).


4. The RUNID convention

Every run mints one fresh, unique RUNID token (e.g. a date + short suffix like 20260601a, or a short random string). That token is embedded in everything the run creates β€” event-type names, schedule names, invitee names, plus-alias emails, poll/form titles, webhook labels β€” so that:

Never reuse a RUNID across runs. A fresh token per run is what makes parallel execution and post-run auditing safe.


5. Suite catalog (ranked)

Sorted by priority (P0 β†’ P3), then by ID. Each ID links to its runbook in ./suites/.

IDTitlePriAccountsLaneExcl.Est.L3 reality
CU-01Core booking lifecycle (book β†’ reschedule β†’ cancel) with calendar & email realityP0P1 host + anon inviteeBβ€”30mGCal create/move/delete + confirm/reschedule/cancel emails
CU-02Auth lifecycle: register, verify, login, forgot/reset, deleteP0fresh throwaway (+auth-<RUNID>)Dβ€”22mverification + password-reset emails (clicked from Gmail)
CU-03Google Calendar: conflict blocking, buffers, two-way syncP0P1 host + connected GCalDβ€”40mheavy β€” busy conflict, event create/move/delete in GCal
CU-05Event-type configuration β†’ public booking-page enforcementP0P1 host (own event types)Bβ€”85mguest-notification email
CU-06Availability engine: weekly rules, overrides, holidays, slot-debugP0P1 hostXYES35mnone
CU-07Host-side booking management: book-on-behalf, no-show, notes, guests, cancel/rescheduleP1P1 host (own event type)Bβ€”40minvitee/guest email + GCal
CU-08AI booking chatbot (the differentiator) on the public booking pageP1anon inviteeBβ€”30mconfirmation email + GCal
CU-09AI dashboard assistant β€” feature-parity actions via chatP1P1 hostBβ€”45mpartial (verify side effects in UI)
CU-10Landing page + marketing + static pages + mobile responsivenessP1none (anon, read-only)Aβ€”15mnone
CU-11Public booking page UX: timezones, calendar nav, empty-state, QR, mobileP1anon (host has availability)A/Bβ€”22mnone (one optional booking)
CU-04Microsoft / Outlook calendar integrationP2Outlook ravikant0909@outlook.comDβ€”35mOutlook event create/delete + conflict
CU-12Routing forms: build β†’ public submit β†’ route to event type β†’ analyticsP2P1 host + anonBβ€”18mnone
CU-13Meeting polls: create β†’ share β†’ vote β†’ tally β†’ pick winnerP2P1 host + anon votersBβ€”18mnone
CU-14Team / org scheduling: invite, roles, round-robin, collective, team page, audit logP2P1 owner + P3 teammateDβ€”40memail/GCal to the ASSIGNED member
CU-15Contacts, analytics dashboard, and CSV exportP2P1 hostAβ€”25mnone
CU-16Settings & customization: profile, cancellation policy, blocklist, branding, BYOK, pixelsP2P1 hostBβ€”30mnone
CU-17Slack notifications & outbound webhooks (verified via webhook.site)P2P1 hostBβ€”30mwebhook.site receipt + Slack message
CU-18New-user onboarding wizard (4-step) and schedule-setup promptP2fresh throwaway (+onb-<RUNID>)Dβ€”22mnone
CU-19Embeddable booking widget (inline / popup / badge)P3P1 host + external test pageBβ€”25mnone (optional booking email)
CU-20Email sequences, reminders, and reconfirmation (time-gated, partial)P3P1 hostBβ€”40msequence/immediate emails (reminders cron-gated)
CU-22Chrome extension for Gmail (manual-led)P3P1 (manual install)manualβ€”20mnone

Totals: 21 suites Β· ~11.1 hours of single-threaded agent time (heavily compressed by parallelization β€” see Β§6).


6. Parallelization model

Suites are tagged by lane so they can run concurrently without corrupting each other's data or slot math.

Lane legend

LaneMeaningParallel-safe?
ARead-only. Anonymous, no writes (landing/marketing, analytics views). Run anytime, against anything.Yes β€” fully
BHost-writers using uniquely-named event types. Each suite writes only its own <name>-<RUNID> event types/fixtures, so concurrent B suites never touch the same rows.Yes β€” via RUNID isolation
DAccount-isolated. Runs against a fresh throwaway, the P3 teammate, or the Outlook account β€” separate state from P1's main config.Yes β€” different accounts
A/BMostly read-only with one optional, RUNID-scoped booking (CU-11).Yes
XExclusive. Rewrites global host state. Must run alone.No
manualHuman-led (CU-22 Chrome extension); the agent can only verify the resolved booking link.n/a

The two waves

WAVE 1  (run concurrently)        WAVE 2  (run alone, after Wave 1 fully completes)
──────────────────────────        ──────────────────────────────────────────────
All Lane A / B / D suites:        CU-06  (Lane X β€” EXCLUSIVE)
  CU-01  CU-02  CU-03  CU-04
  CU-05  CU-07  CU-08  CU-09
  CU-10  CU-11  CU-12  CU-13
  CU-14  CU-15  CU-16  CU-17
  CU-18  CU-19  CU-20
  (CU-22 runs whenever a human is available)

Why CU-06 is exclusive

CU-06 (the availability engine) rewrites the global host availability for P1 β€” it sets a known Mon–Fri 09:00–17:00 schedule, applies date overrides, loads a US-2026 holiday preset, and creates/deletes named schedules, then restores a baseline at the end. Every other suite that books against P1 (CU-01, CU-05, CU-07, CU-08, CU-11, and more) computes available slots from exactly that global availability. If CU-06 ran concurrently, it would silently change the slots those suites depend on mid-flight β€” making their bookings non-deterministic and their slot-math assertions flaky or wrong.

So CU-06 runs alone, in a second wave, after Wave 1 has fully drained, and is responsible for capturing a baseline at the start and fully restoring the global schedule at the end so the account is left clean for the next run.


7. Conventions

Verification tiers

Every coverage item is verified to one of three depths. A suite states the tier it reaches per item.

TierNameWhat it provesHow
L1UIThe rendered page shows the right thingRead the DOM / screenshot the visible state
L2Calendo persistenceThe change is stored server-side, not just in-memoryReload the page (or re-fetch the API) and confirm it survives
L3External realityA real artifact exists outside CalendoOpen Gmail / Google Calendar / Outlook / webhook.site in the browser and confirm the real event/email/receipt

L3 is the whole point (see Β§1). An item that only reaches L1/L2 is not wrong, but it has not proven the product works for a real user β€” note the gap explicitly.

Screenshots & evidence

Cleanup discipline


8. Related files

FilePurpose
./00-setup-preconditions.mdRead first. Sessions to verify, fixtures to seed, baseline to capture before any suite runs.
./COVERAGE.mdThe full coverage map: what every suite verifies, plus the consolidated manual-residue and out-of-scope / TBD ledger.
./index.htmlBrowsable HTML dashboard of the suite catalog and coverage (light theme, mobile-friendly).
./suites/The runbooks themselves β€” one Markdown file per suite (CU-NN-*.md).
./results/Run outputs. Copy RESULTS-TEMPLATE.md β†’ RESULTS-<SUITE>-<RUNID>.md per run.

9. Out of scope / TBD

Some product surfaces are deliberately not covered by these browser-agent suites because there is no test mode, no reachable inbox/phone, or the surface is gated to a different account. These are not silently dropped β€” they are catalogued in ./COVERAGE.md as explicit TBD / manual items. The headline exclusions:

See ./COVERAGE.md for the complete, per-suite breakdown of manual residue and TBD items.