Skip to content

Conversation

corylanou
Copy link
Collaborator

Summary

This PR fixes a critical bug where Litestream was continuously snapshotting every second instead of respecting the configured intervals, causing excessive S3 transactions (70,000+ overnight as reported).

Fixes #689

Root Cause

When snapshot.interval was not specified in the YAML config, the parser would set it to 0 (zero duration) which overwrote the 24h default. This caused time.Until() to return a negative duration, triggering immediate and continuous snapshots.

Solution

  • Changed SnapshotConfig fields to use *time.Duration pointers to distinguish between "not specified" (nil) and "explicitly set to 0"
  • Added comprehensive validation at config parse time to reject zero/negative intervals
  • Preserved default values when YAML sections are empty or not specified
  • Implemented sentinel errors for type-safe error checking

Changes Made

  1. Configuration validation (cmd/litestream/main.go):

    • Added sentinel errors: ErrInvalidSnapshotInterval, ErrInvalidSnapshotRetention, ErrInvalidCompactionInterval, ErrInvalidSyncInterval
    • Added ConfigValidationError type with Unwrap() for errors.Is() support
    • Changed SnapshotConfig fields to use *time.Duration pointers
    • Added Validate() method to check all intervals > 0
    • Split ReadConfigFile into OpenConfigFile (io.ReadCloser) and ParseConfig (io.Reader)
    • Fixed default preservation when YAML sets fields to nil
  2. Replica configuration (cmd/litestream/replicate.go):

    • Updated to handle pointer values for snapshot intervals
    • Only overrides defaults when explicitly set (non-nil)
  3. Comprehensive tests (cmd/litestream/main_test.go):

    • Refactored complex table-driven tests to clear individual t.Run tests
    • Tests now use ParseConfig with strings.NewReader instead of temp files
    • Added validation for default values when sections are not specified
    • Tests for zero, negative, and unspecified intervals

Test Plan

✅ All existing tests pass
✅ New tests verify:

  • Zero intervals are rejected with proper errors
  • Negative intervals are rejected
  • Default values (24h) are preserved when not specified
  • Empty YAML sections don't overwrite defaults
  • Multiple replicas with different sync intervals work correctly

Verification

Tested with a config similar to the reported issue:

dbs:
  - path: /tmp/test.db
    replicas:
      - url: s3://my-bucket/db
        sync-interval: 1800s

The sync-interval is properly set to 30m and snapshot interval defaults to 24h (not continuously snapshotting).

🤖 Generated with Claude Code

This commit fixes a critical bug where Litestream was continuously
snapshotting every second instead of respecting the configured sync-interval.

The root cause was that when snapshot.interval was not specified in the YAML
config, the parser would set it to 0 (zero duration) which overwrote the
24h default. This caused time.Until() to return a negative duration,
triggering immediate and continuous snapshots.

Key changes:
- Changed SnapshotConfig fields to use *time.Duration pointers to distinguish
  between "not specified" (nil) and "explicitly set to 0"
- Added comprehensive validation at config parse time to reject zero/negative
  intervals
- Implemented sentinel errors (ErrInvalidSnapshotInterval, etc.) for type-safe
  error checking instead of string matching
- Preserved default values when YAML sections are empty or not specified
- Refactored ReadConfigFile to separate file I/O from parsing (OpenConfigFile
  + ParseConfig)
- Simplified overly complex table-driven tests into clearer individual t.Run
  tests

The fix ensures that:
1. Zero intervals are rejected at config parse time with clear errors
2. Default intervals (24h for snapshots) are preserved when not specified
3. Empty YAML sections don't overwrite defaults with nil values

This resolves the issue where users were seeing 70,000+ S3 transactions
overnight due to continuous snapshotting.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@benbjohnson benbjohnson self-requested a review August 11, 2025 19:16
@corylanou corylanou merged commit 4bbd699 into main Aug 11, 2025
16 checks passed
@corylanou corylanou deleted the fix-continuous-snapshotting branch August 11, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sha-8fa09d7a6ea15c66309b7d799518ffbaa8d1ee69 -> DockerHub Version - Not abiding by Sync-Interval
2 participants