Pipeline Rehabilitation and a Bonus LLM Evaluation Tool: Rebuilding Noteefy’s Cypress CI Infrastructure

Noteefy is golf’s leading demand and revenue management platform — trusted by 800+ courses including 80 of the top 200 public courses and 9 of the top 12 multi-course operators. We inherited a fragile CI pipeline and rebuilt it from the ground up. Then we delivered a purpose-built LLM evaluation tool nobody asked for.

Cypress GitLab CI/CD Flakiness Elimination Cursor AI LLM Evaluation

Pipeline Rehabilitation and a Bonus LLM Evaluation Tool: Rebuilding Noteefy’s Cypress CI Infrastructure

About Noteefy

Golf's demand management platform — filling tee times, reducing no-shows, and powering AI-driven pro shop assistance.

Noteefy is golf's leading demand and revenue management platform, trusted by over 800 courses nationwide — including 80 of the top 200 public courses and 9 of the top 12 multi-course operators. Its product suite includes Waitlist, Confirm, AI Pro Shop Assistant, and Lead Management tools that help golf course operators automatically fill cancelled tee times, reduce no-shows, and deliver a better booking experience. The platform's AI Pro Shop Assistant is a conversational interface powered by an LLM — handling golfer enquiries, booking assistance, and course information across multiple client configurations. Validating the quality and accuracy of AI-generated responses across different configurations is a distinct quality challenge that requires purpose-built evaluation tooling. Onboarded in November 2025, the engagement was focused on automation engineering — specifically, assessing and rehabilitating a fragile Cypress CI pipeline that had accumulated significant flakiness, and then expanding coverage on a stable foundation.

80% of their Cypress suite was flaky on arrival. We eradicated every high and medium priority failure, built a custom rerun mechanism without paying for a higher-tier plan, and delivered a full LLM evaluation platform as a bonus.

Project Profile

At a glance.

Client

Website

www.noteefy.com

Onboarded

November 2025

Automation

Cypress · JavaScript

CI/CD

GitLab Pipelines

Industry

Golf Tech / SaaS

Project Type

Automation Engineering

Engagement Model

Part-time (20 hrs / week)

Automation Framework

Cypress (JavaScript)

AI Coding Tool

Cursor AI

Code Review

Greptile

CI/CD

GitLab Pipelines

Monitoring

Slack (failure notifications)

Engagement Journey

Audit, rehabilitate, enhance.

The engagement moved through three focused phases — a diagnostic audit that mapped every flaky test and identified root causes, a systematic rehabilitation phase that eliminated all high and medium priority flakiness, and an enhancement phase that delivered execution time improvements, coverage expansion, and the bonus Testberry LLM evaluation platform.

Phase

Flakiness State

Pipeline State

Deliverables

Audit

80%+ high/medium flakiness across 70+ scripts

Deployment-triggered only; no manual trigger; 12–15 min runtime

Flakiness audit report; root cause identification; remediation priority list

Rehabilitation

High and medium flakiness eliminated by mid-February

On-demand manual trigger built; custom failed-test rerun script delivered

0% high/medium flaky tests; Slack failure notification channel active

Enhancement

~0% flakiness maintained

Under 5 min full suite runtime

100+ scripts; Testberry LLM evaluation platform delivered

The Challenge

A fragile pipeline, no manual trigger, and 80% flakiness.

The Cypress automation suite was inherited in a fragile state. The CI pipeline was slow, unreliable, and difficult to debug — with over 80% of tests showing flakiness before a single line of new code had been written. Restoring confidence in the suite required systematic root-cause analysis, not just surface-level fixes.

Pervasive Test Flakiness

Over 80% of the 70+ Cypress test scripts showed high or medium priority flakiness on arrival. Tests failed inconsistently across runs — not due to product defects, but due to timing issues, state management problems, and selector instability — making CI results untrustworthy as a quality signal.

No On-Demand Execution

All Cypress tests ran exclusively as part of deployment-triggered GitLab pipelines. There was no way to run the suite on-demand — for debugging, for pre-merge validation, or for manual quality checks. Every run required triggering a full deployment.

No Failed-Test Rerun Mechanism

When tests failed, the only option was to re-run the entire suite from scratch. Without a Cypress Business plan, there was no built-in mechanism to rerun only failed tests — wasting CI minutes and slowing the development feedback loop significantly.

Slow CI Execution

Full suite execution took 12–15 minutes per run. At this duration, the pipeline was a bottleneck — long enough to break developer flow, slow enough to discourage frequent execution, and expensive enough in CI minutes that teams were reluctant to trigger additional runs.

What We Did

From fragile to fast — and then beyond the brief.

The team systematically audited, rehabilitated, and then extended the Cypress CI pipeline — eliminating all high and medium priority flakiness, engineering custom tooling within the client's existing subscription constraints, and then delivering a purpose-built LLM evaluation platform as a bonus deliverable entirely outside the original scope.

Flaky Test Eradication

Conducted a systematic audit of all 70+ Cypress scripts, categorising every flaky test by severity and root cause. All high and medium priority flakiness was eliminated by mid-February — through timing fixes, selector stabilisation, state management improvements, and async handling corrections. The suite went from ~80% flaky to effectively zero high/medium failures.

On-Demand GitLab Pipeline

Built a configurable manual trigger into the GitLab CI/CD pipeline, allowing the suite to be executed on-demand without requiring a deployment. Developers and QA can now run the full suite, specific test files, or targeted subsets at any point in the development cycle.

Custom Failed-Test Rerun Script

Engineered a custom failed-test rerun mechanism without requiring a Cypress Business subscription. The script identifies failed test files from the previous run and reruns only those — eliminating full-suite re-execution for isolated failures and recovering CI minutes previously lost to unnecessary reruns.

Execution Time Reduction

Reduced full suite runtime from 12–15 minutes to under 5 minutes — a 70%+ improvement. The reduction came from a combination of flakiness elimination (fewer retries), pipeline configuration optimisation, and smarter test execution ordering.

Pipeline Failure Monitoring & Triage

Set up a dedicated Slack channel for real-time CI failure notifications, providing the team with immediate visibility into pipeline failures without manual monitoring. Structured failure messages include test names, error context, and direct links to the GitLab job — enabling rapid triage.

Bonus Delivery: Testberry — LLM Evaluation Tool

Built Testberry, a purpose-built LLM evaluation platform for validating the quality and accuracy of Noteefy's AI Pro Shop Assistant chatbot responses. Features: evaluation dashboard, run details and results views, ground truth dataset management, A/B testing across different chatbot configurations, and Slack integration for evaluation alerts — all delivered as a bonus outside the core automation scope.

Key Metrics

The numbers.

~0%

High/Medium Flakiness (from 80%)

70%

CI Execution Time Reduced

<5 min

Full Pipeline Duration

100+

Automation Scripts

Results & Impact

What changed.

~0%

High/Medium Flakiness Remaining

Down from 80%+ on arrival — every high and medium priority flaky test systematically identified and resolved.

70%

CI Execution Time Cut

Full suite runtime from 12–15 minutes to under 5 minutes — pipeline now a development asset, not a bottleneck.

<5 min

Full Suite Runtime

Under 5 minutes from trigger to result — fast enough to run frequently, reliable enough to trust as a quality signal.

100+

Automation Scripts

Suite grew from 70 to 100+ scripts on a stable, flakiness-free foundation during the enhancement phase.

On-Demand

Manual Pipeline Trigger Built

The suite can now run at any point in the development cycle — without requiring a deployment trigger.

Testberry

LLM Evaluation Platform Delivered

A full purpose-built evaluation tool for Noteefy's AI Pro Shop Assistant — delivered as a bonus, outside the original engagement scope.

Tools & Technology

The stack.

Automation

Cypress (JavaScript) — 100+ scripts on stable, flakiness-free foundation

CI/CD

GitLab Pipelines — on-demand trigger + deployment-triggered execution

AI Coding Tool

Cursor AI — script authoring, refactoring, and fixes

Code Review

Greptile — automated code review integration

Failure Monitoring

Slack — real-time CI failure notifications with structured triage context

Custom Tooling

Failed-test rerun script (without Cypress Business plan)

Bonus

Testberry — LLM evaluation platform for AI Pro Shop Assistant validation

The Takeaway

The Noteefy engagement is a case study in pipeline rehabilitation. By addressing flakiness at its root cause — not patching symptoms — and engineering custom tooling within the client's existing subscription constraints, the team transformed a slow, unreliable CI environment into a fast, trustworthy foundation for continuous quality delivery. And then they went further: delivering Testberry, a purpose-built LLM evaluation platform for the AI Pro Shop Assistant — bringing structured, measurable quality assurance to Noteefy's AI feature surface, and demonstrating what it means to think beyond the brief.