-
Notifications
You must be signed in to change notification settings - Fork 30
fix(excel-parser): Fix memory issue in ExcelParser
#729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@tolik0/file-based/fix-memory-issue-in-excel-parser#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch tolik0/file-based/fix-memory-issue-in-excel-parser Helpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a memory issue in the ExcelParser by changing how DataFrame records are converted to JSON. Instead of converting the entire DataFrame to JSON at once and then yielding from it, the code now iterates through individual rows and converts each row to JSON separately.
- Replace bulk DataFrame JSON conversion with row-by-row processing
- Maintain the same date formatting behavior (ISO format with microseconds)
- Preserve the orjson parsing for each individual row
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
for index, row in df.iterrows(): | ||
# Convert each row (as a Series) to a JSON string | ||
yield orjson.loads(row.to_json(date_format="iso", date_unit="us")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using df.iterrows()
is inefficient for large DataFrames as it returns copies of data and has significant overhead. Consider using df.to_dict('records')
with manual datetime conversion, or df.itertuples()
for better performance while maintaining the memory benefits.
for index, row in df.iterrows(): | |
# Convert each row (as a Series) to a JSON string | |
yield orjson.loads(row.to_json(date_format="iso", date_unit="us")) | |
# Efficiently convert the DataFrame to a list of records with proper datetime serialization | |
records = orjson.loads(df.to_json(orient="records", date_format="iso", date_unit="us")) | |
for record in records: | |
yield record |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this copilot recommendation, should we use itertuples
instead of iterrows
?
📝 WalkthroughWalkthroughExcelParser.parse_records now yields records by iterating DataFrame rows and converting each row to JSON individually, replacing the previous approach that converted the entire DataFrame to a JSON array before yielding records. Error handling behavior is unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ExcelParser
participant PandasDF as Pandas DataFrame
participant JSON as orjson
Caller->>ExcelParser: parse_records(file)
ExcelParser->>PandasDF: load Excel into DataFrame
loop For each row (streamed)
ExcelParser->>PandasDF: iterrows() next()
ExcelParser->>JSON: row.to_json(date_format="iso", date_unit="us")
JSON-->>ExcelParser: dict (parsed row)
ExcelParser-->>Caller: yield dict
end
Note over ExcelParser: On exception: raise RecordParseError
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/file_types/excel_parser.py (2)
121-123
: Cut per-row JSON roundtrips; pre-format datetimes once and iterate fast (avoid iterrows + Series.to_json).The current approach pays a JSON encode/decode cost per row and constructs many small strings. Would you consider pre-formatting datetime-like columns vectorially and yielding python dicts via itertuples for better CPU perf while keeping memory low, wdyt?
- for index, row in df.iterrows(): - # Convert each row (as a Series) to a JSON string - yield orjson.loads(row.to_json(date_format="iso", date_unit="us")) + # Pre-format datetime-like columns once to ISO-8601 with microseconds + dt_cols = df.select_dtypes(include=["datetime64[ns]", "datetimetz"]).columns + if len(dt_cols) > 0: + df[dt_cols] = df[dt_cols].applymap(lambda x: x.isoformat() if pd.notna(x) else None) + + # Fast row iteration without per-row JSON ser/de + for row in df.itertuples(index=False, name=None): + yield dict(zip(df.columns, row))Notes:
- This may return numpy scalar types for numbers/bools instead of pure Python scalars (whereas orjson.loads returns built-ins). If downstream requires built-ins, we can add a lightweight cast (e.g., convert np.generic via .item()) — happy to draft that if needed, wdyt?
- If you prefer to keep the current behavior, consider batching (e.g., chunk the DataFrame) to reduce Python overhead further.
121-121
: Minor: underscore unused variable.If you keep iterrows, can we use “_” for the unused index to satisfy linters, wdyt?
- for index, row in df.iterrows(): + for _, row in df.iterrows():
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/file_types/excel_parser.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Check: source-shopify
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-pokeapi
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_types/excel_parser.py (1)
121-123
: Good call on eliminating the bulk DataFrame→JSON string.This should significantly reduce peak memory. Nice improvement!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense as we won't pull everything in memory to yield it after. However, I have one small comment regarding copilot's suggestion
for index, row in df.iterrows(): | ||
# Convert each row (as a Series) to a JSON string | ||
yield orjson.loads(row.to_json(date_format="iso", date_unit="us")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this copilot recommendation, should we use itertuples
instead of iterrows
?
Summary by CodeRabbit