Starting at 2025-01-23 09:24 UTC, Zulip Cloud began to experience an intermittent outage; users may have seen “Server error” responses when attempting to access Zulip, or slower responses if they succeeded.
These intermittent effects were caused by Zulip Cloud running out of memory as it attempted to process and respond to a very small number of requests which were extremely memory-intensive. Parts of the Zulip Cloud service automatically restarted to handle this memory pressure, with varying impacts — at its worst, this caused a complete outage from 09:46 UTC to 09:55 UTC.
Starting at 12:04 UTC, Zulip Cloud began to have difficulty responding to the volume of users' requests, and began serving “Server error” responses more consistently, due to being over capacity. This may have been triggered by our attempts to diagnose the memory issues; we are still investigating the contributing causes to this.
This outage lasted until 12:57 UTC, when we restored service by temporarily disabling the user presence API endpoints which were contributing the most load to the system; this allowed users to load their Zulip Cloud accounts. After access patterns stabilized, we re-enabled the user presence API.
At 15:50, we identified the bot messages that had caused the original out-of-memory issues, mitigated their contribution by deleting them, and conveyed that information to owners of the affected organization. This addressed the immediate risk of the intermittent outages that started the incident.
We apologize for any disruption caused by this incident, and are working on addressing the underlying causes, so that this doesn’t happen again. Ongoing follow-ups: