-
Notifications
You must be signed in to change notification settings - Fork 310
Open
Labels
bugSomething isn't workingSomething isn't workingrelease-blockerCritical issue that blocks the next releaseCritical issue that blocks the next release
Description
Bug Summary
Severity: CRITICAL - Complete Data Loss
Version: v0.5.0 (pre-release, main branch)
Impact: Database restoration fails completely after a specific but common failure scenario
Error Message
decode database: decode page 17743: copy page 17750 header:
nonsequential page numbers in snapshot transaction: 17742,17750
Reproduction Steps
Scenario
- Litestream is replicating a database with continuous writes
- Litestream process is killed (crash, OOM, server restart, etc.)
- Database continues receiving writes while Litestream is down
- A non-PASSIVE checkpoint occurs (FULL, RESTART, or TRUNCATE)
- Litestream is restarted
- User attempts to restore from replica
- Result: Restore fails with nonsequential page error
Automated Reproduction Script
Save this as reproduce-critical-bug.sh
:
#!/bin/bash
set -e
echo "Reproducing Critical Bug: Nonsequential Page Numbers"
echo "======================================================"
# Configuration
DB="/tmp/critical-bug-test.db"
REPLICA="/tmp/critical-bug-replica"
LITESTREAM="litestream" # or ./bin/litestream if using local build
LITESTREAM_TEST="litestream-test" # from PR #748
# Clean up
rm -f "$DB"* && rm -rf "$REPLICA"
# Step 1: Create test database (50MB)
echo "[1] Creating test database..."
$LITESTREAM_TEST populate -db "$DB" -target-size 50MB -table-count 2
# Step 2: Start Litestream
echo "[2] Starting Litestream..."
$LITESTREAM replicate "$DB" "file://$REPLICA" > /tmp/litestream.log 2>&1 &
LITESTREAM_PID=$!
sleep 3
# Step 3: Start continuous writes
echo "[3] Starting continuous writes..."
$LITESTREAM_TEST load -db "$DB" -write-rate 100 -duration 2m -pattern constant &
WRITE_PID=$!
# Step 4: Let it run normally
echo "[4] Running normally for 20 seconds..."
sleep 20
# Step 5: Kill Litestream (simulate crash)
echo "[5] Killing Litestream..."
kill -9 $LITESTREAM_PID
# Step 6: Continue writes without Litestream
echo "[6] Continuing writes (Litestream is down)..."
sleep 15
# Step 7: Execute checkpoint while Litestream is down
echo "[7] Executing FULL checkpoint..."
sqlite3 "$DB" "PRAGMA wal_checkpoint(FULL);"
# Step 8: Resume Litestream
echo "[8] Resuming Litestream..."
$LITESTREAM replicate "$DB" "file://$REPLICA" >> /tmp/litestream.log 2>&1 &
NEW_LITESTREAM_PID=$!
sleep 20
# Stop everything
kill $WRITE_PID $NEW_LITESTREAM_PID 2>/dev/null || true
# Step 9: Attempt restore (THIS FAILS)
echo "[9] Attempting restore..."
rm -f /tmp/restored.db
if $LITESTREAM restore -o /tmp/restored.db "file://$REPLICA"; then
echo "SUCCESS: Restore completed"
else
echo "CRITICAL BUG: Restore failed!"
echo "Database cannot be restored after checkpoint during downtime"
fi
Root Cause Analysis
When Litestream resumes after being killed and detects a checkpoint occurred:
- The
verify()
function indb.go:650-738
detects WAL changes - It triggers a full snapshot with
info.snapshotting = true
writeLTXFromDB()
atdb.go:946-991
creates the snapshot- The function iterates pages 1 to commit sequentially
- Problem: Some pages don't exist (weren't modified), creating gaps
- The ltx decoder expects sequential pages and fails on gaps
Impact
- Data Loss: Complete inability to restore the database
- Common Scenario: Any production outage + checkpoint = unrecoverable
- User Trust: Critical failure in disaster recovery tool
Affected Code
db.go:950-989
-writeLTXFromDB()
functionreplica.go:472
-dec.DecodeDatabaseTo(f)
where error occursgithub1.v50.ltd/superfly/ltx
- Decoder expects sequential pages
Potential Solutions
- Skip non-existent pages in
writeLTXFromDB()
- Modify ltx decoder to handle gaps during restore
- Full database scan on resume after checkpoint detection
- Track checkpoint state to handle WAL changes properly
Workaround
Users must avoid non-PASSIVE checkpoints when Litestream is not running. This is difficult to guarantee in production.
Test Requirements
- Test with
litestream-test
binary from PR feat: Add litestream-test harness for comprehensive database testing #748 - Or manually create database with continuous writes
- Reproducible 100% of the time with the provided script
Related Files
- Full analysis:
critical-bug-analysis.md
- Test results:
.local/docs/test-results-v0.5.0.md
- Gist: Available upon request
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingrelease-blockerCritical issue that blocks the next releaseCritical issue that blocks the next release