Summary
Following a routine AWS infrastructure upgrade at the start of last week, the application experienced progressively degraded performance. While core functionality such as reads, email processing, and standard field updates remained largely operational, significant issues were observed for customers who make use of custom fields. These included slow performance, frequent save failures, and inconsistent execution of on-save scripts.
The issue was ultimately traced to degraded performance on underlying storage. Resolution was achieved by migrating the affected drive to a new host and allowing failover, restoring normal performance.
Impact
Application remained partially usable:
- Read operations remained functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
- Email sending remained functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
- Standard field updates remained largely functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
However:
- Custom field saves were significantly degraded
- Frequent save failures occurred for case types with custom field configured
- On-save scripts were unreliable or unavailable
Overall user experience was poor and inconsistent, with significant degraded performance in key workflows
Timeline (High Level)
Resolution
- Worked jointly with AWS to investigate infrastructure-level performance
- Migrated the affected drive to a new host
- Triggered failover to restore normal performance
- Post-resolution monitoring confirmed stability, including successful email batch execution
Lessons Learned
1. Extend production freeze after infrastructure changes
- Immediate post-upgrade validation is not always sufficient
- Introduce a longer stabilization period after forced infrastructure upgrades
- Avoid releasing unrelated changes during this window
2. Escalate earlier to controlled downtime
3. Improve infrastructure resilience
- Investigation identified opportunities for hardware improvements
- Newer options now available that were not previously viable
Next Steps
Current Status
- System performance restored
- Email processing and core workflows operating normally
- Monitoring in place to ensure continued stability