Reduced application performance and intermittent issues saving custom fields

Incident Report for AgileCase

Postmortem

Summary

Following a routine AWS infrastructure upgrade at the start of last week, the application experienced progressively degraded performance. While core functionality such as reads, email processing, and standard field updates remained largely operational, significant issues were observed for customers who make use of custom fields. These included slow performance, frequent save failures, and inconsistent execution of on-save scripts.

The issue was ultimately traced to degraded performance on underlying storage. Resolution was achieved by migrating the affected drive to a new host and allowing failover, restoring normal performance.

Impact

  • Application remained partially usable:

    • Read operations remained functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
    • Email sending remained functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
    • Standard field updates remained largely functional, but with degraded performance and intermittent timeouts for those customers with a medium-large number of custom fields configured.
  • However:

    • Custom field saves were significantly degraded
    • Frequent save failures occurred for case types with custom field configured
    • On-save scripts were unreliable or unavailable
  • Overall user experience was poor and inconsistent, with significant degraded performance in key workflows

Timeline (High Level)

  • Monday (AWS Upgrade)
    Routine but mandatory AWS infrastructure upgrade performed.
  • Post-upgrade (Initial Testing)
    Initial checks showed no immediate issues. System appeared stable.
  • Following Days (Tuesday-Friday)
    Gradual degradation in performance, particularly affecting custom field operations.
  • During Investigation

    • Several unrelated minor incidents occurred, complicating diagnosis
    • Customers with different types of configuration had different experiences
    • Some mitigations appeared to resolve the issue temporarily, leading to false confidence and “false starts”
    • Root cause was not immediately isolated due to delayed onset and mixed signals
  • Resolution
    In collaboration with AWS, the affected storage was migrated to a new host and failover was triggered. This resolved the performance issues instantly.

Resolution

  • Worked jointly with AWS to investigate infrastructure-level performance
  • Migrated the affected drive to a new host
  • Triggered failover to restore normal performance
  • Post-resolution monitoring confirmed stability, including successful email batch execution

Lessons Learned

1. Extend production freeze after infrastructure changes

  • Immediate post-upgrade validation is not always sufficient
  • Introduce a longer stabilization period after forced infrastructure upgrades
  • Avoid releasing unrelated changes during this window

2. Escalate earlier to controlled downtime

  • Prolonged partial degradation has a higher operational cost
  • Earlier decision to take short, controlled downtime may:

    • Accelerate resolution
    • Reduce user impact overall

3. Improve infrastructure resilience

  • Investigation identified opportunities for hardware improvements
  • Newer options now available that were not previously viable

Next Steps

  • Implement a post-infrastructure-change freeze period (1–2 weeks)
  • Introduce stricter change control during active incidents
  • Review and schedule targeted hardware improvements
  • Maintain minimal changes to the live environment for 1–2 weeks following resolution to allow:

    • System stability
    • Customer workflow normalization

Current Status

  • System performance restored
  • Email processing and core workflows operating normally
  • Monitoring in place to ensure continued stability
Posted Apr 11, 2026 - 15:54 BST

Resolved

The application remained usable for read operations, email sending, and updates to standard fields. However, performance was significantly degraded for customers who make use of custom fields, with frequent save failures and associated on-save scripts often failing to execute or becoming unavailable.
Posted Apr 10, 2026 - 09:30 BST