Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Aug 1, 2025

fix: Allow composite primary keys in PyAirbyte

Summary

Fixed a bug in PyAirbyte's primary key validation that incorrectly rejected composite primary keys like ['id', 'category'] with the error "Nested primary keys are not supported. Each PK column should have exactly one node."

The validation logic in catalog_providers.py was checking if each primary key had exactly one column (len(pk_nodes) != 1), which blocked composite keys. The intent was to prevent nested field references like ['data.id'], not composite keys.

Changes made:

  • Modified get_primary_keys() validation in airbyte/shared/catalog_providers.py
  • Changed from checking column count to checking for dot notation (nested field indicators)
  • Composite keys like ['id', 'category'] now work correctly
  • Nested keys like ['data.id'] are still properly rejected

Review & Testing Checklist for Human

⚠️ Risk Level: Medium - Core validation logic change affecting all PyAirbyte users

  • Verify validation logic is correct: Confirm that any("." in node for node in pk_nodes) properly distinguishes between composite keys (['id', 'category']) and nested keys (['data.id'])
  • Test with real connectors: Try setting composite primary keys on actual source connectors to ensure no regressions in production workflows
  • Edge case testing: Test with keys containing special characters, numbers, or unusual naming patterns to ensure robustness
  • Integration test coverage: Check if integration tests need updates to cover composite primary key scenarios

Recommended test plan:

  1. Run the reproduction script with MotherDuck API key to verify the original issue is resolved
  2. Test composite primary keys with 2-3 different source connectors
  3. Verify single-column primary keys still work normally
  4. Test that truly nested keys (with dots) are still rejected appropriately

Diagram

%%{ init : { "theme" : "default" }}%%
graph TB
    repro["reproduce_null_pk_issue.py<br/>(test script)"]:::context
    source_base["airbyte/sources/base.py<br/>set_primary_keys()"]:::context
    catalog_providers["airbyte/shared/catalog_providers.py<br/>get_primary_keys()"]:::major-edit
    test_overrides["tests/unit_tests/sources/<br/>test_source_key_overrides.py"]:::context
    
    repro -->|"calls set_primary_keys(['id', 'category'])"| source_base
    source_base -->|"stores as _primary_key_overrides"| catalog_providers
    catalog_providers -->|"validates primary key format"| test_overrides
    
    catalog_providers -->|"OLD: len(pk_nodes) != 1<br/>NEW: any('.' in node)"| catalog_providers
    
    subgraph Legend
        L1[Major Edit]:::major-edit
        L3[Context/No Edit]:::context
    end

classDef major-edit fill:#90EE90
classDef context fill:#FFFFFF
Loading

Notes

This fix resolves the issue reported by AJ Steers where composite primary keys were incorrectly blocked by PyAirbyte's validation. The change is minimal (2 lines) but affects a core validation function used throughout the system.

Key validation change:

  • Before: if len(pk_nodes) != 1: (rejected any multi-column key)
  • After: if any("." in node for node in pk_nodes): (only rejects nested field references)

All existing unit tests pass, including the specific composite primary key test that was previously failing. The reproduction script now successfully processes composite primary keys with both DuckDB and Snowflake destinations.

Session details: Requested by AJ Steers (@aaronsteers) in Devin session: https://app.devin.ai/sessions/bc5912d0a95e499dab8e86e329f1face

Summary by CodeRabbit

  • Bug Fixes
    • Improved validation for primary keys to ensure only top-level fields are accepted, providing clearer error messages when nested fields are used.

- Fixed validation logic in catalog_providers.py that incorrectly rejected composite primary keys
- Changed validation from checking column count to checking for nested field references (dots)
- Composite keys like ['id', 'category'] now work correctly
- Nested keys like ['data.id'] are still properly rejected
- All existing unit tests pass

Fixes issue where set_primary_keys(stream=['col1', 'col2']) would fail with
'Nested primary keys are not supported' error despite composite keys being supported.

Co-Authored-By: AJ Steers <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings August 1, 2025 17:59
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - A user is reporting that the MotherDuck destination is not able to handle nulls in primary keys. I wasn't aware this was supported but apparently it works for the Snowflake destination. I want you to create a PyAirbyte script to repro the issue and examine the differences of handling for the two destinations. You will need to use the source-file connector and point it at custom built JSON source data files with a composite two-column primary key. You may need to use PyAirbyte's override_primary_keys feature to set the primary keys for the streams.

After running first test (with no nulls):
1. Does one or both destinations declare a primary key constraint on the destination tables?
Then run again using a file that does contain nulls in a few records.
2. Does the Snowflake destination correctly write the records or does it write an error message within the row metadata? (Or not write at all?)
3. Can you confirm the motherduck destination fails to insert the records with null values in their primary key?

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

github-actions bot commented Aug 1, 2025

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1754070845-fix-composite-primary-keys' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1754070845-fix-composite-primary-keys'

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test-pr - Runs tests with the updated PyAirbyte

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in PyAirbyte where composite primary keys (multiple columns like ['id', 'category']) were incorrectly rejected due to overly restrictive validation logic.

  • Fixed validation in get_primary_keys() to allow composite primary keys while still preventing nested field references
  • Changed validation from checking column count to checking for dot notation in field names
  • Updated error message to clarify that only nested fields (not composite keys) are unsupported


for pk_nodes in normalized_pks:
if len(pk_nodes) != 1:
if any("." in node for node in pk_nodes):
Copy link
Preview

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic may incorrectly reject valid field names that contain dots but are not nested references. Consider using a more robust check that distinguishes between legitimate field names with dots and actual nested field paths.

Suggested change
if any("." in node for node in pk_nodes):
if len(pk_nodes) > 1:

Copilot uses AI. Check for mistakes.

@aaronsteers aaronsteers changed the title fix: Allow composite primary keys in PyAirbyte fix: Allow composite primary key overrides in PyAirbyte Aug 1, 2025
@aaronsteers aaronsteers marked this pull request as draft August 1, 2025 18:00
Copy link
Contributor

coderabbitai bot commented Aug 1, 2025

📝 Walkthrough

Walkthrough

The validation logic in the get_primary_keys method within airbyte/shared/catalog_providers.py was updated. Instead of checking the number of nodes in primary key columns, the method now raises an error if any node string contains a dot, ensuring primary keys only reference top-level fields. The error message was revised for clarity.

Changes

Cohort / File(s) Change Summary
Primary Key Validation Logic
airbyte/shared/catalog_providers.py
Updated validation in get_primary_keys to raise an error for primary key columns referencing nested fields (containing a dot), and revised the error message.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6189e9c and 048eb21.

📒 Files selected for processing (1)
  • airbyte/shared/catalog_providers.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: yohannj
PR: airbytehq/PyAirbyte#716
File: airbyte/logs.py:384-402
Timestamp: 2025-07-11T19:53:44.427Z
Learning: In the PyAirbyte project, when reviewing PRs, maintain clear separation of concerns. Don't suggest changes that are outside the scope of the PR's main objective, even if they would improve consistency or fix other issues. This helps with reviewing changes and potential reverts.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
airbyte/shared/catalog_providers.py (1)

158-168: Great fix! This correctly addresses the composite primary key issue, wdyt?

The change from checking node count to checking for dot notation is exactly right. Now composite keys like ['id', 'category'] will pass validation while nested keys like ['data.id'] are still properly rejected. The updated error message is also much clearer about what's actually being validated.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1754070845-fix-composite-primary-keys

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

github-actions bot commented Aug 1, 2025

PyTest Results (Fast Tests Only, No Creds)

301 tests  ±0   301 ✅ ±0   4m 16s ⏱️ +2s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 048eb21. ± Comparison against base commit 6189e9c.

Copy link

github-actions bot commented Aug 1, 2025

PyTest Results (Full)

364 tests  ±0   350 ✅ ±0   22m 19s ⏱️ + 1m 6s
  1 suites ±0    14 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 048eb21. ± Comparison against base commit 6189e9c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant