Outage Report: RESTART Checkpoint Hangs App

We would like to share a major production outage we experienced that was caused by litestream and was very difficult to diagnose. We hope that this report can help other litestream users and be used to improve litestream in the future.

We were running litestream v0.3.13 which is the most recent release. (I see that there are some development changes that may improve the situation described here.)

The symptom of the outage was that our app could not start any database operations. The core error was `SQLITE_BUSY` but due to the 100% failure rate we also ended up seeing lots of other high level errors such as "Timed out waiting for acquire database connection" as our 5s `busy_timeout` was quickly backing up, and we started seeing timeouts at the connection pool level which obscured the underlying error. This was very perplexing to us because we only have one write connection in the app so the only possible source of other write transactions were litestream or human operators. Neither of these factors seemed at first glance that they should be holding extremely long write locks.

The TL;DR of the outage is fairly simple now that we understand it:

1. Our app has many long-lived read transactions.
2. Very occasionally (never until this outage) this causes the WAL to exceed `max-checkpoint-page-count`.
3. Litestream triggers a [`RESTART` checkpoint](https://www.sqlite.org/c3ref/wal_checkpoint_v2.html).
4. This checkpoint blocks all writes while waiting on all readers to finish which could take a **very** long time.
5. Our app was unable to perform writes until the checkpoint completed.

Litestream's behaviour is not terrible in isolation. It will trigger `RESTART` checkpoints once the WAL exceeds the (undocumented?) `max-checkpoint-page-count` and `TRUNCATE` checkpoints once the WAL exceeds the (hardcoded) [`DefaultTruncatePageN`](https://github.com/benbjohnson/litestream/blob/v0.3.13/db.go#L34). As far as we can tell this behaviour is completely undocumented. https://litestream.io/tips/#busy-timeout suggests that litestream may execute "short write locks" but never mentions indefinitely long writer locks.

`RESTART` checkpoints can also be triggered by setting [`validation-interval`](https://litestream.io/reference/config/#validation-interval) which is again undocumented.

We built our app with very strict rules on the write connection (as there is only one with sqlite WAL) but were very loose with read transactions as there can be any number of them, and they don't block readers or writers. We knew that long-lived read transactions would block WAL truncation and decided that this wasn't an issue for our use case. We had ample disk space buffer and WAL size monitoring size to ensure that we didn't run out of disk space.

Our plan falls apart when an unexpected `FULL` (or higher) checkpoint is triggered. This causes a "priority inversion" where our write connection is blocked on readers. These readers may be active for tens of minutes which is a completely unacceptable amount of time for our app to not be able to write anything. (The acceptable time is in the hundreds of milliseconds range.)

Our current workaround is to [patch litestream to only ever trigger `PASSIVE` checkpoints](https://github.com/Sovereign-Engineering/litestream/commit/4d53241a5fd1a290c7f255f6715465d8eaacfe69) (as it does most of the time). We have strong evidence that we have entered situations that would have triggered the outage again but successfully recovered with no degradation after running this patch. Long-term our team has decided to move away from litestream due to lack of trust caused by this hidden behaviour.

Suggestions:
1. Document the cases where litestream may do anything beyond "short write locks". Notably including any time it will execute a checkpoint of `FULL` or stronger.
2. Document and make configurable `max-checkpoint-page-count` and `DefaultTruncatePageN`. Explain how these values can be set to never trigger anything but `PASSIVE` checkpoints (example set arbitrarily high or support explicit "disabled" values).
3. In a future breaking release consider turning off non-`PASSIVE` checkpoints by default. It is unknowable to litestream how long other types of checkpoints will hold a full-database lock as it depends on other clients of the database. This indefinite "exclusive" lock is very risky to trigger them without user opt-in. The documentation can then explain the settings available to limit WAL size and the associated costs (block all writes until all current transactions complete).

Another related note: while we were reading the documentation I noticed that https://litestream.io/tips/#disable-autocheckpoints-for-high-write-load-servers suggests setting `PRAGMA wal_autocheckpoint = 0`. I would exercise caution before applying this setting as if litestream is not running (and you don't have any other processes that would trigger a checkpoint) it will cause infinite WAL growth. This may make sense if you **need** to preserve all incremental writes, but I don't think it should be framed as the correct solution without explaining the tradeoffs. For users who don't require every incremental write a much safer recommendation would be to set this value to something comfortably above `max-checkpoint-page-count` but also comfortably within your available disk space. This ensures that even if litestream is not running your app will eventually maintain the WAL size on its own.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Outage Report: RESTART Checkpoint Hangs App #724

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Outage Report: RESTART Checkpoint Hangs App #724

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions