Pipeline Rehabilitation and a Bonus LLM Evaluation Tool: Rebuilding Noteefy’s Cypress CI Infrastructure
Noteefy is golf’s leading demand and revenue management platform — trusted by 800+ courses including 80 of the top 200 public courses and 9 of the top 12 multi-course operators. We inherited a fragile CI pipeline and rebuilt it from the ground up. Then we delivered a purpose-built LLM evaluation tool nobody asked for.
Golf's demand management platform — filling tee times, reducing no-shows, and powering AI-driven pro shop assistance.
Noteefy is golf's leading demand and revenue management platform, trusted by over 800 courses nationwide — including 80 of the top 200 public courses and 9 of the top 12 multi-course operators. Its product suite includes Waitlist, Confirm, AI Pro Shop Assistant, and Lead Management tools that help golf course operators automatically fill cancelled tee times, reduce no-shows, and deliver a better booking experience. The platform's AI Pro Shop Assistant is a conversational interface powered by an LLM — handling golfer enquiries, booking assistance, and course information across multiple client configurations. Validating the quality and accuracy of AI-generated responses across different configurations is a distinct quality challenge that requires purpose-built evaluation tooling. Onboarded in November 2025, the engagement was focused on automation engineering — specifically, assessing and rehabilitating a fragile Cypress CI pipeline that had accumulated significant flakiness, and then expanding coverage on a stable foundation.
80% of their Cypress suite was flaky on arrival. We eradicated every high and medium priority failure, built a custom rerun mechanism without paying for a higher-tier plan, and delivered a full LLM evaluation platform as a bonus.
At a glance.
Audit, rehabilitate, enhance.
The engagement moved through three focused phases — a diagnostic audit that mapped every flaky test and identified root causes, a systematic rehabilitation phase that eliminated all high and medium priority flakiness, and an enhancement phase that delivered execution time improvements, coverage expansion, and the bonus Testberry LLM evaluation platform.
A fragile pipeline, no manual trigger, and 80% flakiness.
The Cypress automation suite was inherited in a fragile state. The CI pipeline was slow, unreliable, and difficult to debug — with over 80% of tests showing flakiness before a single line of new code had been written. Restoring confidence in the suite required systematic root-cause analysis, not just surface-level fixes.
Pervasive Test Flakiness
Over 80% of the 70+ Cypress test scripts showed high or medium priority flakiness on arrival. Tests failed inconsistently across runs — not due to product defects, but due to timing issues, state management problems, and selector instability — making CI results untrustworthy as a quality signal.
No On-Demand Execution
All Cypress tests ran exclusively as part of deployment-triggered GitLab pipelines. There was no way to run the suite on-demand — for debugging, for pre-merge validation, or for manual quality checks. Every run required triggering a full deployment.
No Failed-Test Rerun Mechanism
When tests failed, the only option was to re-run the entire suite from scratch. Without a Cypress Business plan, there was no built-in mechanism to rerun only failed tests — wasting CI minutes and slowing the development feedback loop significantly.
Slow CI Execution
Full suite execution took 12–15 minutes per run. At this duration, the pipeline was a bottleneck — long enough to break developer flow, slow enough to discourage frequent execution, and expensive enough in CI minutes that teams were reluctant to trigger additional runs.
From fragile to fast — and then beyond the brief.
The team systematically audited, rehabilitated, and then extended the Cypress CI pipeline — eliminating all high and medium priority flakiness, engineering custom tooling within the client's existing subscription constraints, and then delivering a purpose-built LLM evaluation platform as a bonus deliverable entirely outside the original scope.
Flaky Test Eradication
Conducted a systematic audit of all 70+ Cypress scripts, categorising every flaky test by severity and root cause. All high and medium priority flakiness was eliminated by mid-February — through timing fixes, selector stabilisation, state management improvements, and async handling corrections. The suite went from ~80% flaky to effectively zero high/medium failures.
On-Demand GitLab Pipeline
Built a configurable manual trigger into the GitLab CI/CD pipeline, allowing the suite to be executed on-demand without requiring a deployment. Developers and QA can now run the full suite, specific test files, or targeted subsets at any point in the development cycle.
Custom Failed-Test Rerun Script
Engineered a custom failed-test rerun mechanism without requiring a Cypress Business subscription. The script identifies failed test files from the previous run and reruns only those — eliminating full-suite re-execution for isolated failures and recovering CI minutes previously lost to unnecessary reruns.
Execution Time Reduction
Reduced full suite runtime from 12–15 minutes to under 5 minutes — a 70%+ improvement. The reduction came from a combination of flakiness elimination (fewer retries), pipeline configuration optimisation, and smarter test execution ordering.
Pipeline Failure Monitoring & Triage
Set up a dedicated Slack channel for real-time CI failure notifications, providing the team with immediate visibility into pipeline failures without manual monitoring. Structured failure messages include test names, error context, and direct links to the GitLab job — enabling rapid triage.
Bonus Delivery: Testberry — LLM Evaluation Tool
Built Testberry, a purpose-built LLM evaluation platform for validating the quality and accuracy of Noteefy's AI Pro Shop Assistant chatbot responses. Features: evaluation dashboard, run details and results views, ground truth dataset management, A/B testing across different chatbot configurations, and Slack integration for evaluation alerts — all delivered as a bonus outside the core automation scope.
The numbers.
High/Medium Flakiness (from 80%)
CI Execution Time Reduced
Full Pipeline Duration
Automation Scripts
What changed.
High/Medium Flakiness Remaining
Down from 80%+ on arrival — every high and medium priority flaky test systematically identified and resolved.
CI Execution Time Cut
Full suite runtime from 12–15 minutes to under 5 minutes — pipeline now a development asset, not a bottleneck.
Full Suite Runtime
Under 5 minutes from trigger to result — fast enough to run frequently, reliable enough to trust as a quality signal.
Automation Scripts
Suite grew from 70 to 100+ scripts on a stable, flakiness-free foundation during the enhancement phase.
Manual Pipeline Trigger Built
The suite can now run at any point in the development cycle — without requiring a deployment trigger.
LLM Evaluation Platform Delivered
A full purpose-built evaluation tool for Noteefy's AI Pro Shop Assistant — delivered as a bonus, outside the original engagement scope.
The stack.
The Noteefy engagement is a case study in pipeline rehabilitation. By addressing flakiness at its root cause — not patching symptoms — and engineering custom tooling within the client's existing subscription constraints, the team transformed a slow, unreliable CI environment into a fast, trustworthy foundation for continuous quality delivery. And then they went further: delivering Testberry, a purpose-built LLM evaluation platform for the AI Pro Shop Assistant — bringing structured, measurable quality assurance to Noteefy's AI feature surface, and demonstrating what it means to think beyond the brief.