Skip to content

Conversation

tolik0
Copy link
Contributor

@tolik0 tolik0 commented Aug 27, 2025

Summary by CodeRabbit

  • Refactor
    • Optimized Excel record parsing to stream rows incrementally, reducing memory usage and improving performance on large files.
    • Maintains the same output format and row order to ensure compatibility with existing integrations.
    • Enhances stability during long-running or large dataset syncs by avoiding large in-memory conversions.

@Copilot Copilot AI review requested due to automatic review settings August 27, 2025 14:35
@tolik0 tolik0 self-assigned this Aug 27, 2025
@github-actions github-actions bot added the bug Something isn't working label Aug 27, 2025
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@tolik0/file-based/fix-memory-issue-in-excel-parser#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch tolik0/file-based/fix-memory-issue-in-excel-parser

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a memory issue in the ExcelParser by changing how DataFrame records are converted to JSON. Instead of converting the entire DataFrame to JSON at once and then yielding from it, the code now iterates through individual rows and converts each row to JSON separately.

  • Replace bulk DataFrame JSON conversion with row-by-row processing
  • Maintain the same date formatting behavior (ISO format with microseconds)
  • Preserve the orjson parsing for each individual row

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +121 to +123
for index, row in df.iterrows():
# Convert each row (as a Series) to a JSON string
yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))
Copy link
Preview

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using df.iterrows() is inefficient for large DataFrames as it returns copies of data and has significant overhead. Consider using df.to_dict('records') with manual datetime conversion, or df.itertuples() for better performance while maintaining the memory benefits.

Suggested change
for index, row in df.iterrows():
# Convert each row (as a Series) to a JSON string
yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))
# Efficiently convert the DataFrame to a list of records with proper datetime serialization
records = orjson.loads(df.to_json(orient="records", date_format="iso", date_unit="us"))
for record in records:
yield record

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this copilot recommendation, should we use itertuples instead of iterrows?

Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

📝 Walkthrough

Walkthrough

ExcelParser.parse_records now yields records by iterating DataFrame rows and converting each row to JSON individually, replacing the previous approach that converted the entire DataFrame to a JSON array before yielding records. Error handling behavior is unchanged.

Changes

Cohort / File(s) Summary of changes
Excel per-row streaming parse
airbyte_cdk/sources/file_based/file_types/excel_parser.py
Switched from bulk DataFrame JSON conversion (df.to_json(..., orient="records") then orjson.loads(...)) to per-row iteration (df.iterrows() then row.to_json(...) and orjson.loads(...)), yielding one dict per row; exceptions still raise RecordParseError.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ExcelParser
    participant PandasDF as Pandas DataFrame
    participant JSON as orjson

    Caller->>ExcelParser: parse_records(file)
    ExcelParser->>PandasDF: load Excel into DataFrame
    loop For each row (streamed)
        ExcelParser->>PandasDF: iterrows() next()
        ExcelParser->>JSON: row.to_json(date_format="iso", date_unit="us")
        JSON-->>ExcelParser: dict (parsed row)
        ExcelParser-->>Caller: yield dict
    end
    Note over ExcelParser: On exception: raise RecordParseError
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tolik0/file-based/fix-memory-issue-in-excel-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/file_types/excel_parser.py (2)

121-123: Cut per-row JSON roundtrips; pre-format datetimes once and iterate fast (avoid iterrows + Series.to_json).

The current approach pays a JSON encode/decode cost per row and constructs many small strings. Would you consider pre-formatting datetime-like columns vectorially and yielding python dicts via itertuples for better CPU perf while keeping memory low, wdyt?

-                for index, row in df.iterrows():
-                    # Convert each row (as a Series) to a JSON string
-                    yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))
+                # Pre-format datetime-like columns once to ISO-8601 with microseconds
+                dt_cols = df.select_dtypes(include=["datetime64[ns]", "datetimetz"]).columns
+                if len(dt_cols) > 0:
+                    df[dt_cols] = df[dt_cols].applymap(lambda x: x.isoformat() if pd.notna(x) else None)
+
+                # Fast row iteration without per-row JSON ser/de
+                for row in df.itertuples(index=False, name=None):
+                    yield dict(zip(df.columns, row))

Notes:

  • This may return numpy scalar types for numbers/bools instead of pure Python scalars (whereas orjson.loads returns built-ins). If downstream requires built-ins, we can add a lightweight cast (e.g., convert np.generic via .item()) — happy to draft that if needed, wdyt?
  • If you prefer to keep the current behavior, consider batching (e.g., chunk the DataFrame) to reduce Python overhead further.

121-121: Minor: underscore unused variable.

If you keep iterrows, can we use “_” for the unused index to satisfy linters, wdyt?

-                for index, row in df.iterrows():
+                for _, row in df.iterrows():
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e80c173 and 7424cc6.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_types/excel_parser.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_types/excel_parser.py (1)

121-123: Good call on eliminating the bulk DataFrame→JSON string.

This should significantly reduce peak memory. Nice improvement!

Copy link

github-actions bot commented Aug 27, 2025

PyTest Results (Fast)

3 764 tests  ±0   3 752 ✅ ±0   6m 48s ⏱️ +5s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 3c9cc6f. ± Comparison against base commit e5a1fc2.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Aug 27, 2025

PyTest Results (Full)

3 767 tests  ±0   3 755 ✅ ±0   11m 27s ⏱️ +2s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 3c9cc6f. ± Comparison against base commit e5a1fc2.

♻️ This comment has been updated with latest results.

@tolik0 tolik0 requested review from maxi297 and aaronsteers August 27, 2025 15:31
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense as we won't pull everything in memory to yield it after. However, I have one small comment regarding copilot's suggestion

Comment on lines +121 to +123
for index, row in df.iterrows():
# Convert each row (as a Series) to a JSON string
yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this copilot recommendation, should we use itertuples instead of iterrows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants