⚡️ Speed up method _ColumnNamesDataset._get_bq_schema_field_names_recursively
by 71%
#25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 71% (0.71x) speedup for
_ColumnNamesDataset._get_bq_schema_field_names_recursively
ingoogle/cloud/aiplatform/datasets/column_names_dataset.py
⏱️ Runtime :
1.76 milliseconds
→1.03 milliseconds
(best of460
runs)📝 Explanation and details
The optimized version achieves a 71% speedup by eliminating expensive set comprehensions and reducing memory allocations in the recursive function.
Key optimizations:
Eliminated nested set comprehension: The original code created a set comprehension with recursive calls inside (
{nested_field_name for field in schema_field.fields for nested_field_name in ...}
), which is memory-intensive and requires multiple temporary set constructions.Direct leaf node detection: Changed from
len(ancestor_names) == 0
tonot schema_field.fields
, avoiding the need to build the entireancestor_names
set just to check if it's empty.Single set accumulation: Instead of creating multiple intermediate sets, the optimized version uses one
result
set that accumulates values through simpleadd()
operations, which is much more efficient than set comprehensions.Reduced string operations: By storing
name_prefix = schema_field.name
once and reusing it, the code avoids repeated attribute lookups during string formatting.Performance characteristics from tests:
The optimization is particularly effective for BigQuery schema processing with many nested fields, where the original approach's memory allocation overhead becomes a significant bottleneck.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-_ColumnNamesDataset._get_bq_schema_field_names_recursively-mgje3iiu
and push.