Situation
A client operating a SaaS platform was facing concerns around platform stability.
Rapid feature development had taken priority during initial growth stages, leading to a few significant outages. These outages stemmed from a variety of issues, including deployment challenges due to the sheer number of features and infrastructure-related issues such as database connection problems.
With the onboarding of new customers, reliability and platform uptime became critical to their business success. Customer complaints about stability were mounting, and these concerns began impacting their sales calls. The client engaged us to address these challenges and improve platform stability.
Task
Our task was to identify and implement solutions to enhance platform reliability. Specifically, the client sought improvements in testing, alerting, and deployment infrastructures to prevent future outages and ensure a stable user experience as their customer base grew.
Action
We structured the project into three distinct workstreams and executed the following steps over six weeks:
- Pre-emptive Testing Infrastructure:
- Implemented an automated testing suite using Playwright to simulate user interactions and validate feature functionality.
- Created three distinct deployment environments:
- Dev: An unstable environment for engineers to push and test new changes.
- QA: A staging environment, replicating production at a smaller scale, where automated tests were run to ensure feature stability.
- Prod: The production environment used by customers.
- Automated test results were emailed to the team every 24 hours for visibility and quick action on issues.
- Implemented an automated testing suite using Playwright to simulate user interactions and validate feature functionality.
- Recovery and Alerting Mechanisms:
- Collaborated closely with the client to define the highest priority features. We asked the question: “What types of issues could literally block users from using the platform?” This included authentication and core feature uptime, as these were essential to the platform’s value proposition.
- Established an On-Call schedule and Service Level Agreements (SLAs) for addressing issues, with immediate response times for Sev-1 tickets, and relaxed timelines for Sev-2 and feature requests.
- Used PagerDuty for alerting and Grafana for dashboarding key platform health metrics, triggering alerts based on predefined thresholds. We used past outages as guidance for setting those thresholds.
- On-Call Scheduling:
- Designed an on-call system to ensure round-the-clock readiness.
- Conducted weekly check-in meetings with the client and established a quarterly maintenance schedule to address minor adjustments and ensure continued stability.
Result
- The project was completed on time and within budget, delivering significant improvements to platform stability.
- Since the implementation, the client has not experienced a single outage over several months.
- Our AI-powered testing and alerting infrastructure demonstrated the ability to catch and mitigate issues that had previously caused outages.
- The client continues to rely on our team for on-call support as they onboard new customers.
- The improved stability measures have enhanced customer satisfaction and bolstered sales efforts by addressing a critical pre-requisite for platform adoption.
Tech Stack
- Testing: Playwright
- Deployments: GitHub Actions
- Alerting: PagerDuty
- Dashboarding: Grafana
Client Feedback
The client expressed confidence in the new stability measures and appreciated the proactive approach to monitoring and response. They continue to see these improvements as a cornerstone of their SaaS platform’s growth and reliability.