fix(excel-parser): Fix memory issue in `ExcelParser` #729

tolik0 · 2025-08-27T14:35:33Z

Summary by CodeRabbit

Refactor
- Optimized Excel record parsing to stream rows incrementally, reducing memory usage and improving performance on large files.
- Maintains the same output format and row order to ensure compatibility with existing integrations.
- Enhances stability during long-running or large dataset syncs by avoiding large in-memory conversions.

github-actions · 2025-08-27T14:35:44Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@tolik0/file-based/fix-memory-issue-in-excel-parser#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch tolik0/file-based/fix-memory-issue-in-excel-parser

Helpful Resources

CDK API Reference

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

Copilot

Pull Request Overview

This PR fixes a memory issue in the ExcelParser by changing how DataFrame records are converted to JSON. Instead of converting the entire DataFrame to JSON at once and then yielding from it, the code now iterates through individual rows and converts each row to JSON separately.

Replace bulk DataFrame JSON conversion with row-by-row processing
Maintain the same date formatting behavior (ISO format with microseconds)
Preserve the orjson parsing for each individual row

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-27T14:36:08Z

airbyte_cdk/sources/file_based/file_types/excel_parser.py

+                for index, row in df.iterrows():
+                    # Convert each row (as a Series) to a JSON string
+                    yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))


Using df.iterrows() is inefficient for large DataFrames as it returns copies of data and has significant overhead. Consider using df.to_dict('records') with manual datetime conversion, or df.itertuples() for better performance while maintaining the memory benefits.

Suggested change

for index, row in df.iterrows():

# Convert each row (as a Series) to a JSON string

yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))

# Efficiently convert the DataFrame to a list of records with proper datetime serialization

records = orjson.loads(df.to_json(orient="records", date_format="iso", date_unit="us"))

for record in records:

yield record

Based on this copilot recommendation, should we use itertuples instead of iterrows?

coderabbitai · 2025-08-27T14:39:59Z

📝 Walkthrough

Walkthrough

ExcelParser.parse_records now yields records by iterating DataFrame rows and converting each row to JSON individually, replacing the previous approach that converted the entire DataFrame to a JSON array before yielding records. Error handling behavior is unchanged.

Changes

Cohort / File(s)	Summary of changes
Excel per-row streaming parse `airbyte_cdk/sources/file_based/file_types/excel_parser.py`	Switched from bulk DataFrame JSON conversion (`df.to_json(..., orient="records")` then `orjson.loads(...)`) to per-row iteration (`df.iterrows()` then `row.to_json(...)` and `orjson.loads(...)`), yielding one dict per row; exceptions still raise `RecordParseError`.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ExcelParser
    participant PandasDF as Pandas DataFrame
    participant JSON as orjson

    Caller->>ExcelParser: parse_records(file)
    ExcelParser->>PandasDF: load Excel into DataFrame
    loop For each row (streamed)
        ExcelParser->>PandasDF: iterrows() next()
        ExcelParser->>JSON: row.to_json(date_format="iso", date_unit="us")
        JSON-->>ExcelParser: dict (parsed row)
        ExcelParser-->>Caller: yield dict
    end
    Note over ExcelParser: On exception: raise RecordParseError

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tolik0/file-based/fix-memory-issue-in-excel-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

airbyte_cdk/sources/file_based/file_types/excel_parser.py (2)
121-123: Cut per-row JSON roundtrips; pre-format datetimes once and iterate fast (avoid iterrows + Series.to_json).

The current approach pays a JSON encode/decode cost per row and constructs many small strings. Would you consider pre-formatting datetime-like columns vectorially and yielding python dicts via itertuples for better CPU perf while keeping memory low, wdyt?
-                for index, row in df.iterrows():
-                    # Convert each row (as a Series) to a JSON string
-                    yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))
+                # Pre-format datetime-like columns once to ISO-8601 with microseconds
+                dt_cols = df.select_dtypes(include=["datetime64[ns]", "datetimetz"]).columns
+                if len(dt_cols) > 0:
+                    df[dt_cols] = df[dt_cols].applymap(lambda x: x.isoformat() if pd.notna(x) else None)
+
+                # Fast row iteration without per-row JSON ser/de
+                for row in df.itertuples(index=False, name=None):
+                    yield dict(zip(df.columns, row))
Notes:

This may return numpy scalar types for numbers/bools instead of pure Python scalars (whereas orjson.loads returns built-ins). If downstream requires built-ins, we can add a lightweight cast (e.g., convert np.generic via .item()) — happy to draft that if needed, wdyt?

If you prefer to keep the current behavior, consider batching (e.g., chunk the DataFrame) to reduce Python overhead further.

121-121: Minor: underscore unused variable.

If you keep iterrows, can we use “_” for the unused index to satisfy linters, wdyt?
-                for index, row in df.iterrows():
+                for _, row in df.iterrows():

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e80c173 and 7424cc6.

📒 Files selected for processing (1)

airbyte_cdk/sources/file_based/file_types/excel_parser.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Check: source-shopify
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-intercom
GitHub Check: Check: source-pokeapi
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (Fast)

🔇 Additional comments (1)

airbyte_cdk/sources/file_based/file_types/excel_parser.py (1)

121-123: Good call on eliminating the bulk DataFrame→JSON string.

This should significantly reduce peak memory. Nice improvement!

github-actions · 2025-08-27T14:43:37Z

PyTest Results (Fast)

3 764 tests ±0 3 752 ✅ ±0 6m 48s ⏱️ +5s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 3c9cc6f. ± Comparison against base commit e5a1fc2.

♻️ This comment has been updated with latest results.

github-actions · 2025-08-27T14:47:51Z

PyTest Results (Full)

3 767 tests ±0 3 755 ✅ ±0 11m 27s ⏱️ +2s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 3c9cc6f. ± Comparison against base commit e5a1fc2.

♻️ This comment has been updated with latest results.

maxi297

The change makes sense as we won't pull everything in memory to yield it after. However, I have one small comment regarding copilot's suggestion

maxi297 · 2025-08-27T18:16:45Z

airbyte_cdk/sources/file_based/file_types/excel_parser.py

+                for index, row in df.iterrows():
+                    # Convert each row (as a Series) to a JSON string
+                    yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))


Based on this copilot recommendation, should we use itertuples instead of iterrows?

…parser

tolik0 added 2 commits August 27, 2025 17:32

Fix memory issue in excel_parser

1b4d1d3

Delete unused code

7424cc6

Copilot AI review requested due to automatic review settings August 27, 2025 14:35

tolik0 self-assigned this Aug 27, 2025

github-actions bot added the bug Something isn't working label Aug 27, 2025

Copilot AI reviewed Aug 27, 2025

View reviewed changes

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

coderabbitai bot approved these changes Aug 27, 2025

View reviewed changes

tolik0 requested review from maxi297 and aaronsteers August 27, 2025 15:31

maxi297 reviewed Aug 27, 2025

View reviewed changes

tolik0 temporarily deployed to DockerHub August 28, 2025 14:49 — with GitHub Actions Inactive

tolik0 temporarily deployed to PyPi August 28, 2025 14:49 — with GitHub Actions Inactive

tolik0 temporarily deployed to DockerHub August 28, 2025 14:49 — with GitHub Actions Inactive

aaronsteers added 2 commits August 28, 2025 08:50

Merge branch 'main' into tolik0/file-based/fix-memory-issue-in-excel-…

aa26ef9

…parser

Merge branch 'main' into tolik0/file-based/fix-memory-issue-in-excel-…

3c9cc6f

…parser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(excel-parser): Fix memory issue in `ExcelParser` #729

fix(excel-parser): Fix memory issue in `ExcelParser` #729

Uh oh!

tolik0 commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Aug 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 27, 2025

Uh oh!

maxi297 Aug 27, 2025

Uh oh!

coderabbitai bot commented Aug 27, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

maxi297 left a comment

Uh oh!

maxi297 Aug 27, 2025

Uh oh!

Uh oh!

-                for index, row in df.iterrows():
-                    # Convert each row (as a Series) to a JSON string
-                    yield orjson.loads(row.to_json(date_format="iso", date_unit="us"))
+                # Efficiently convert the DataFrame to a list of records with proper datetime serialization
+                records = orjson.loads(df.to_json(orient="records", date_format="iso", date_unit="us"))
+                for record in records:
+                    yield record

fix(excel-parser): Fix memory issue in ExcelParser #729

Are you sure you want to change the base?

fix(excel-parser): Fix memory issue in ExcelParser #729

Uh oh!

Conversation

tolik0 commented Aug 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Aug 27, 2025

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

Helpful Resources

PR Slash Commands

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

maxi297 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Aug 27, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

maxi297 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fix(excel-parser): Fix memory issue in `ExcelParser` #729

fix(excel-parser): Fix memory issue in `ExcelParser` #729

tolik0 commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

github-actions bot commented Aug 27, 2025 •

edited

Loading

github-actions bot commented Aug 27, 2025 •

edited

Loading