Zulip Cloud availability incident

Incident Report for Zulip Cloud

Postmortem

Starting at 2025-01-23 09:24 UTC, Zulip Cloud began to experience an intermittent outage; users may have seen “Server error” responses when attempting to access Zulip, or slower responses if they succeeded.

These intermittent effects were caused by Zulip Cloud running out of memory as it attempted to process and respond to a very small number of requests which were extremely memory-intensive. Parts of the Zulip Cloud service automatically restarted to handle this memory pressure, with varying impacts — at its worst, this caused a complete outage from 09:46 UTC to 09:55 UTC.

Starting at 12:04 UTC, Zulip Cloud began to have difficulty responding to the volume of users' requests, and began serving “Server error” responses more consistently, due to being over capacity. This may have been triggered by our attempts to diagnose the memory issues; we are still investigating the contributing causes to this.

This outage lasted until 12:57 UTC, when we restored service by temporarily disabling the user presence API endpoints which were contributing the most load to the system; this allowed users to load their Zulip Cloud accounts. After access patterns stabilized, we re-enabled the user presence API.

At 15:50, we identified the bot messages that had caused the original out-of-memory issues, mitigated their contribution by deleting them, and conveyed that information to owners of the affected organization. This addressed the immediate risk of the intermittent outages that started the incident.

We apologize for any disruption caused by this incident, and are working on addressing the underlying causes, so that this doesn’t happen again. Ongoing follow-ups:

Investigating the contributing factors which led to the over-capacity issue.
Creating a more robust solution to the memory issues which this incident uncovered.
Improving the number of links to this status page, as well as broadening who can update it, to ensure that it is updated in a more timely fashion.

Posted Jan 23, 2025 - 19:00 UTC

Resolved

This incident has been resolved.

Posted Jan 23, 2025 - 16:00 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 23, 2025 - 12:57 UTC

Investigating

We are currently investigating this issue.

Posted Jan 23, 2025 - 12:29 UTC

This incident affected: Zulip Cloud and Mobile Push Notification Service.