Reduced application performance and intermittent issues saving custom fields

Incident Report for AgileCase

Postmortem

Summary

Following a routine AWS infrastructure upgrade at the start of last week, the application experienced progressively degraded performance. While core functionality such as reads, email processing, and standard field updates remained largely operational, significant issues were observed for customers who make use of custom fields. These included slow performance, frequent save failures, and inconsistent execution of on-save scripts.

The issue was ultimately traced to degraded performance on underlying storage. Resolution was achieved by migrating the affected drive to a new host and allowing failover, restoring normal performance.

Impact

Application remained partially usable:
- Read operations remained functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
- Email sending remained functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
- Standard field updates remained largely functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
However:
- Custom field saves were significantly degraded
- Frequent save failures occurred for case types with custom field configured
- On-save scripts were unreliable or unavailable
Overall user experience was poor and inconsistent, with significant degraded performance in key workflows

Timeline (High Level)

Monday (AWS Upgrade)
Routine but mandatory AWS infrastructure upgrade performed.
Post-upgrade (Initial Testing)
Initial checks showed no immediate issues. System appeared stable.
Following Days (Tuesday-Friday)
Gradual degradation in performance, particularly affecting custom field operations.
During Investigation
- Several unrelated minor incidents occurred, complicating diagnosis
- Customers with different types of configuration had different experiences
- Some mitigations appeared to resolve the issue temporarily, leading to false confidence and “false starts”
- Root cause was not immediately isolated due to delayed onset and mixed signals
Resolution
In collaboration with AWS, the affected storage was migrated to a new host and failover was triggered. This resolved the performance issues instantly.

Resolution

Worked jointly with AWS to investigate infrastructure-level performance
Migrated the affected drive to a new host
Triggered failover to restore normal performance
Post-resolution monitoring confirmed stability, including successful email batch execution

Lessons Learned

1. Extend production freeze after infrastructure changes

Immediate post-upgrade validation is not always sufficient
Introduce a longer stabilization period after forced infrastructure upgrades
Avoid releasing unrelated changes during this window

2. Escalate earlier to controlled downtime

Prolonged partial degradation has a higher operational cost
Earlier decision to take short, controlled downtime may:
- Accelerate resolution
- Reduce user impact overall

3. Improve infrastructure resilience

Investigation identified opportunities for hardware improvements
Newer options now available that were not previously viable

Next Steps

Implement a post-infrastructure-change freeze period (1–2 weeks)
Introduce stricter change control during active incidents
Review and schedule targeted hardware improvements
Maintain minimal changes to the live environment for 1–2 weeks following resolution to allow:
- System stability
- Customer workflow normalization

Current Status

System performance restored
Email processing and core workflows operating normally
Monitoring in place to ensure continued stability

Posted Apr 11, 2026 - 15:54 BST

Resolved

The application remained usable for read operations, email sending, and updates to standard fields. However, performance was significantly degraded for customers who make use of custom fields, with frequent save failures and associated on-save scripts often failing to execute or becoming unavailable.

Posted Apr 10, 2026 - 09:30 BST