Skip to content

CRITICAL: Restore fails with 'nonsequential page numbers' after checkpoint during Litestream downtime #752

@corylanou

Description

@corylanou

Bug Summary

Severity: CRITICAL - Complete Data Loss
Version: v0.5.0 (pre-release, main branch)
Impact: Database restoration fails completely after a specific but common failure scenario

Error Message

decode database: decode page 17743: copy page 17750 header: 
nonsequential page numbers in snapshot transaction: 17742,17750

Reproduction Steps

Scenario

  1. Litestream is replicating a database with continuous writes
  2. Litestream process is killed (crash, OOM, server restart, etc.)
  3. Database continues receiving writes while Litestream is down
  4. A non-PASSIVE checkpoint occurs (FULL, RESTART, or TRUNCATE)
  5. Litestream is restarted
  6. User attempts to restore from replica
  7. Result: Restore fails with nonsequential page error

Automated Reproduction Script

Save this as reproduce-critical-bug.sh:

#!/bin/bash
set -e

echo "Reproducing Critical Bug: Nonsequential Page Numbers"
echo "======================================================"

# Configuration
DB="/tmp/critical-bug-test.db"
REPLICA="/tmp/critical-bug-replica"
LITESTREAM="litestream"  # or ./bin/litestream if using local build
LITESTREAM_TEST="litestream-test"  # from PR #748

# Clean up
rm -f "$DB"* && rm -rf "$REPLICA"

# Step 1: Create test database (50MB)
echo "[1] Creating test database..."
$LITESTREAM_TEST populate -db "$DB" -target-size 50MB -table-count 2

# Step 2: Start Litestream
echo "[2] Starting Litestream..."
$LITESTREAM replicate "$DB" "file://$REPLICA" > /tmp/litestream.log 2>&1 &
LITESTREAM_PID=$!
sleep 3

# Step 3: Start continuous writes
echo "[3] Starting continuous writes..."
$LITESTREAM_TEST load -db "$DB" -write-rate 100 -duration 2m -pattern constant &
WRITE_PID=$!

# Step 4: Let it run normally
echo "[4] Running normally for 20 seconds..."
sleep 20

# Step 5: Kill Litestream (simulate crash)
echo "[5] Killing Litestream..."
kill -9 $LITESTREAM_PID

# Step 6: Continue writes without Litestream
echo "[6] Continuing writes (Litestream is down)..."
sleep 15

# Step 7: Execute checkpoint while Litestream is down
echo "[7] Executing FULL checkpoint..."
sqlite3 "$DB" "PRAGMA wal_checkpoint(FULL);"

# Step 8: Resume Litestream
echo "[8] Resuming Litestream..."
$LITESTREAM replicate "$DB" "file://$REPLICA" >> /tmp/litestream.log 2>&1 &
NEW_LITESTREAM_PID=$!
sleep 20

# Stop everything
kill $WRITE_PID $NEW_LITESTREAM_PID 2>/dev/null || true

# Step 9: Attempt restore (THIS FAILS)
echo "[9] Attempting restore..."
rm -f /tmp/restored.db
if $LITESTREAM restore -o /tmp/restored.db "file://$REPLICA"; then
    echo "SUCCESS: Restore completed"
else
    echo "CRITICAL BUG: Restore failed!"
    echo "Database cannot be restored after checkpoint during downtime"
fi

Root Cause Analysis

When Litestream resumes after being killed and detects a checkpoint occurred:

  1. The verify() function in db.go:650-738 detects WAL changes
  2. It triggers a full snapshot with info.snapshotting = true
  3. writeLTXFromDB() at db.go:946-991 creates the snapshot
  4. The function iterates pages 1 to commit sequentially
  5. Problem: Some pages don't exist (weren't modified), creating gaps
  6. The ltx decoder expects sequential pages and fails on gaps

Impact

  • Data Loss: Complete inability to restore the database
  • Common Scenario: Any production outage + checkpoint = unrecoverable
  • User Trust: Critical failure in disaster recovery tool

Affected Code

  • db.go:950-989 - writeLTXFromDB() function
  • replica.go:472 - dec.DecodeDatabaseTo(f) where error occurs
  • github.com/superfly/ltx - Decoder expects sequential pages

Potential Solutions

  1. Skip non-existent pages in writeLTXFromDB()
  2. Modify ltx decoder to handle gaps during restore
  3. Full database scan on resume after checkpoint detection
  4. Track checkpoint state to handle WAL changes properly

Workaround

Users must avoid non-PASSIVE checkpoints when Litestream is not running. This is difficult to guarantee in production.

Test Requirements

Related Files

  • Full analysis: critical-bug-analysis.md
  • Test results: .local/docs/test-results-v0.5.0.md
  • Gist: Available upon request

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrelease-blockerCritical issue that blocks the next release

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions