Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 9, 2025

📄 71% (0.71x) speedup for _ColumnNamesDataset._get_bq_schema_field_names_recursively in google/cloud/aiplatform/datasets/column_names_dataset.py

⏱️ Runtime : 1.76 milliseconds 1.03 milliseconds (best of 460 runs)

📝 Explanation and details

The optimized version achieves a 71% speedup by eliminating expensive set comprehensions and reducing memory allocations in the recursive function.

Key optimizations:

  1. Eliminated nested set comprehension: The original code created a set comprehension with recursive calls inside ({nested_field_name for field in schema_field.fields for nested_field_name in ...}), which is memory-intensive and requires multiple temporary set constructions.

  2. Direct leaf node detection: Changed from len(ancestor_names) == 0 to not schema_field.fields, avoiding the need to build the entire ancestor_names set just to check if it's empty.

  3. Single set accumulation: Instead of creating multiple intermediate sets, the optimized version uses one result set that accumulates values through simple add() operations, which is much more efficient than set comprehensions.

  4. Reduced string operations: By storing name_prefix = schema_field.name once and reusing it, the code avoids repeated attribute lookups during string formatting.

Performance characteristics from tests:

  • Leaf nodes: 50-120% faster (simple cases benefit most from avoiding unnecessary set operations)
  • Nested structures: 50-75% faster (benefits from reduced recursion overhead and memory allocations)
  • Large flat structures: 70-75% faster (accumulation approach scales better than set comprehensions)
  • Deep nesting: 70-75% faster (reduced memory pressure and function call overhead)

The optimization is particularly effective for BigQuery schema processing with many nested fields, where the original approach's memory allocation overhead becomes a significant bottleneck.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 38 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Set

# imports
import pytest  # used for our unit tests
from aiplatform.datasets.column_names_dataset import _ColumnNamesDataset

# function to test
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Minimal SchemaField stub for testing (since we can't import google.cloud.bigquery in tests)
class SchemaField:
    def __init__(self, name, fields=None):
        self.name = name
        self.fields = fields or []

# ------------------------
# Basic Test Cases
# ------------------------

def test_single_leaf_field():
    # Single field, no children
    field = SchemaField("col1")
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 894ns -> 446ns (100% faster)

def test_two_leaf_fields():
    # Parent with two leaf children
    field = SchemaField("parent", [
        SchemaField("child1"),
        SchemaField("child2"),
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 3.13μs -> 1.79μs (74.7% faster)

def test_nested_fields():
    # Parent -> child -> grandchild (all leaf at bottom)
    field = SchemaField("root", [
        SchemaField("branch", [
            SchemaField("leaf1"),
            SchemaField("leaf2"),
        ])
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 3.84μs -> 2.34μs (63.9% faster)

def test_mixed_leaf_and_nested_fields():
    # Parent with one leaf and one nested child
    field = SchemaField("top", [
        SchemaField("leaf"),
        SchemaField("nested", [
            SchemaField("deep_leaf"),
        ])
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 3.55μs -> 2.22μs (60.2% faster)

# ------------------------
# Edge Test Cases
# ------------------------

def test_empty_field_name():
    # Field name is empty string
    field = SchemaField("")
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 895ns -> 435ns (106% faster)

def test_empty_fields_list():
    # Field with empty fields list (should be treated as leaf)
    field = SchemaField("empty", [])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 840ns -> 381ns (120% faster)

def test_deeply_nested_single_chain():
    # A chain: a.b.c.d.e
    field = SchemaField("a", [
        SchemaField("b", [
            SchemaField("c", [
                SchemaField("d", [
                    SchemaField("e")
                ])
            ])
        ])
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 4.49μs -> 2.56μs (75.5% faster)

def test_field_with_duplicate_names():
    # Sibling fields with same name (should both be included)
    field = SchemaField("dup", [
        SchemaField("same"),
        SchemaField("same"),
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 2.60μs -> 1.73μs (50.8% faster)

def test_field_with_non_str_name():
    # Field name is not a string (should still work, but output will be str)
    field = SchemaField(123)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 878ns -> 458ns (91.7% faster)

def test_field_with_none_name():
    # Field name is None
    field = SchemaField(None)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 858ns -> 459ns (86.9% faster)

def test_field_with_none_fields():
    # Field with fields=None (should be treated as leaf)
    field = SchemaField("nonefields", None)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 818ns -> 413ns (98.1% faster)

def test_field_with_empty_string_and_nested():
    # Empty string name with nested child
    field = SchemaField("", [
        SchemaField("child")
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 2.24μs -> 1.36μs (64.9% faster)

def test_field_with_special_characters():
    # Field names with dots and special chars
    field = SchemaField("a.b", [
        SchemaField("c-d", [
            SchemaField("e_f"),
        ])
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 3.04μs -> 1.83μs (66.0% faster)

# ------------------------
# Large Scale Test Cases
# ------------------------

def test_large_number_of_sibling_fields():
    # Parent with many leaf children
    num_children = 500
    field = SchemaField("parent", [
        SchemaField(f"child{i}") for i in range(num_children)
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 185μs -> 105μs (75.9% faster)
    expected = {f"parent.child{i}" for i in range(num_children)}


def test_large_tree_structure():
    # Root with 10 children, each with 10 grandchildren (total 100 leaf nodes)
    root = SchemaField("root", [
        SchemaField(f"child{i}", [
            SchemaField(f"grandchild{j}") for j in range(10)
        ]) for i in range(10)
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 57.9μs -> 36.4μs (59.2% faster)
    expected = {f"root.child{i}.grandchild{j}" for i in range(10) for j in range(10)}

def test_wide_and_deep_tree():
    # Root with 5 children, each with 5 children, each with 5 leaf nodes (total 125 leaf nodes)
    root = SchemaField("root", [
        SchemaField(f"child{i}", [
            SchemaField(f"grandchild{j}", [
                SchemaField(f"leaf{k}") for k in range(5)
            ]) for j in range(5)
        ]) for i in range(5)
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 94.4μs -> 60.0μs (57.5% faster)
    expected = {
        f"root.child{i}.grandchild{j}.leaf{k}"
        for i in range(5) for j in range(5) for k in range(5)
    }

# ------------------------
# Mutation-sensitive tests
# ------------------------

def test_mutation_missing_dot():
    # If dots are missing, test will fail
    field = SchemaField("a", [
        SchemaField("b", [
            SchemaField("c"),
        ])
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 2.99μs -> 1.71μs (75.5% faster)

def test_mutation_including_non_leaf_nodes():
    # Non-leaf nodes should NOT be included
    field = SchemaField("parent", [
        SchemaField("child", [
            SchemaField("leaf")
        ])
    ])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 2.88μs -> 1.73μs (66.5% faster)

def test_mutation_leaf_with_empty_fields():
    # Leaf node with empty fields list should be included
    field = SchemaField("leaf", [])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 854ns -> 346ns (147% faster)

def test_mutation_leaf_with_none_fields():
    # Leaf node with None fields should be included
    field = SchemaField("leaf", None)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 859ns -> 405ns (112% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Set

# imports
import pytest  # used for our unit tests
from aiplatform.datasets.column_names_dataset import _ColumnNamesDataset


# Simulate bigquery.SchemaField for testing purposes
class MockSchemaField:
    def __init__(self, name, fields=None):
        self.name = name
        self.fields = fields if fields is not None else []

# ----------------------
# Basic Test Cases
# ----------------------

def test_single_leaf_field():
    # Single field, no nesting
    field = MockSchemaField("foo")
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 885ns -> 422ns (110% faster)

def test_multiple_leaf_fields():
    # Multiple fields, no nesting
    field1 = MockSchemaField("foo")
    field2 = MockSchemaField("bar")
    parent = MockSchemaField("root", fields=[field1, field2])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(parent); result = codeflash_output # 2.85μs -> 1.73μs (64.1% faster)

def test_simple_nested_fields():
    # Nested fields: root -> child -> leaf
    leaf = MockSchemaField("leaf")
    child = MockSchemaField("child", fields=[leaf])
    root = MockSchemaField("root", fields=[child])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 2.77μs -> 1.85μs (49.4% faster)

def test_two_level_nesting_multiple_leaves():
    # root -> child1 (leaf1, leaf2), child2 (leaf3)
    leaf1 = MockSchemaField("leaf1")
    leaf2 = MockSchemaField("leaf2")
    leaf3 = MockSchemaField("leaf3")
    child1 = MockSchemaField("child1", fields=[leaf1, leaf2])
    child2 = MockSchemaField("child2", fields=[leaf3])
    root = MockSchemaField("root", fields=[child1, child2])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 4.74μs -> 3.10μs (53.0% faster)
    expected = {
        "root.child1.leaf1",
        "root.child1.leaf2",
        "root.child2.leaf3",
    }

# ----------------------
# Edge Test Cases
# ----------------------

def test_empty_fields_list():
    # Field with empty fields list should be treated as leaf node
    field = MockSchemaField("empty", fields=[])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 852ns -> 430ns (98.1% faster)

def test_field_with_empty_name():
    # Field with empty string name
    field = MockSchemaField("")
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 834ns -> 398ns (110% faster)

def test_field_with_none_name():
    # Field with None as name
    field = MockSchemaField(None)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 912ns -> 457ns (99.6% faster)

def test_deeply_nested_fields():
    # 5 levels deep nesting
    leaf = MockSchemaField("leaf")
    node4 = MockSchemaField("node4", fields=[leaf])
    node3 = MockSchemaField("node3", fields=[node4])
    node2 = MockSchemaField("node2", fields=[node3])
    node1 = MockSchemaField("node1", fields=[node2])
    root = MockSchemaField("root", fields=[node1])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 5.07μs -> 2.94μs (72.9% faster)

def test_field_with_duplicate_names():
    # Sibling fields with same name, should treat as unique by path
    leaf1 = MockSchemaField("dup")
    leaf2 = MockSchemaField("dup")
    parent = MockSchemaField("parent", fields=[leaf1, leaf2])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(parent); result = codeflash_output # 2.57μs -> 1.78μs (44.4% faster)

def test_field_with_no_fields_attribute():
    # Simulate missing 'fields' attribute (should treat as leaf)
    class BrokenField:
        def __init__(self, name):
            self.name = name
    field = BrokenField("foo")
    # Patch the function to handle missing 'fields' gracefully
    def patched_get_bq_schema_field_names_recursively(schema_field):
        fields = getattr(schema_field, "fields", [])
        ancestor_names = {
            nested_field_name
            for field in fields
            for nested_field_name in patched_get_bq_schema_field_names_recursively(field)
        }
        if len(ancestor_names) == 0:
            return {getattr(schema_field, "name", None)}
        else:
            return {f"{getattr(schema_field, 'name', None)}.{name}" for name in ancestor_names}
    result = patched_get_bq_schema_field_names_recursively(field)

def test_field_with_non_list_fields():
    # Simulate 'fields' attribute as None (should treat as leaf)
    field = MockSchemaField("foo", fields=None)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(field); result = codeflash_output # 896ns -> 414ns (116% faster)

# ----------------------
# Large Scale Test Cases
# ----------------------

def test_large_number_of_leaf_fields():
    # root with 1000 leaf children
    leaves = [MockSchemaField(f"leaf{i}") for i in range(1000)]
    root = MockSchemaField("root", fields=leaves)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 349μs -> 202μs (72.3% faster)
    expected = {f"root.leaf{i}" for i in range(1000)}

def test_large_nested_structure():
    # root -> node1 -> node2 -> ... -> nodeN -> leaf
    N = 50  # Deep nesting, but not exceeding 1000
    leaf = MockSchemaField("leaf")
    node = leaf
    for i in range(N, 0, -1):
        node = MockSchemaField(f"node{i}", fields=[node])
    root = MockSchemaField("root", fields=[node])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 33.7μs -> 19.8μs (70.1% faster)
    expected_name = "root." + ".".join([f"node{i}" for i in range(1, N+1)]) + ".leaf"

def test_large_wide_and_deep_structure():
    # root with 10 children, each with 10 children, each with 10 leaf nodes (10*10*10 = 1000 leaves)
    leaves = []
    for i in range(10):
        for j in range(10):
            leaf_nodes = [MockSchemaField(f"leaf{i}_{j}_{k}") for k in range(10)]
            child = MockSchemaField(f"child{i}_{j}", fields=leaf_nodes)
            leaves.append(child)
    root = MockSchemaField("root", fields=leaves)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 566μs -> 325μs (73.8% faster)
    expected = set()
    for i in range(10):
        for j in range(10):
            for k in range(10):
                expected.add(f"root.child{i}_{j}.leaf{i}_{j}_{k}")

def test_performance_on_large_flat_structure():
    # root with 999 leaf children (near the limit)
    leaves = [MockSchemaField(f"leaf{i}") for i in range(999)]
    root = MockSchemaField("root", fields=leaves)
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 336μs -> 195μs (71.8% faster)
    expected = {f"root.leaf{i}" for i in range(999)}

# ----------------------
# Additional Edge Cases
# ----------------------

def test_field_with_non_string_names():
    # Field names as integers
    leaf1 = MockSchemaField(123)
    leaf2 = MockSchemaField(456)
    root = MockSchemaField("root", fields=[leaf1, leaf2])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 3.25μs -> 1.91μs (70.4% faster)

def test_field_with_special_characters_in_name():
    # Field names with special characters
    leaf1 = MockSchemaField("foo.bar")
    leaf2 = MockSchemaField("baz-qux")
    root = MockSchemaField("root", fields=[leaf1, leaf2])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 2.82μs -> 1.74μs (62.7% faster)

def test_field_with_unicode_names():
    # Field names with unicode characters
    leaf1 = MockSchemaField("naïve")
    leaf2 = MockSchemaField("résumé")
    root = MockSchemaField("root", fields=[leaf1, leaf2])
    codeflash_output = _ColumnNamesDataset._get_bq_schema_field_names_recursively(root); result = codeflash_output # 3.09μs -> 2.01μs (53.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_ColumnNamesDataset._get_bq_schema_field_names_recursively-mgje3iiu and push.

Codeflash

The optimized version achieves a **71% speedup** by eliminating expensive set comprehensions and reducing memory allocations in the recursive function.

**Key optimizations:**

1. **Eliminated nested set comprehension**: The original code created a set comprehension with recursive calls inside (`{nested_field_name for field in schema_field.fields for nested_field_name in ...}`), which is memory-intensive and requires multiple temporary set constructions.

2. **Direct leaf node detection**: Changed from `len(ancestor_names) == 0` to `not schema_field.fields`, avoiding the need to build the entire `ancestor_names` set just to check if it's empty.

3. **Single set accumulation**: Instead of creating multiple intermediate sets, the optimized version uses one `result` set that accumulates values through simple `add()` operations, which is much more efficient than set comprehensions.

4. **Reduced string operations**: By storing `name_prefix = schema_field.name` once and reusing it, the code avoids repeated attribute lookups during string formatting.

**Performance characteristics from tests:**
- **Leaf nodes**: 50-120% faster (simple cases benefit most from avoiding unnecessary set operations)  
- **Nested structures**: 50-75% faster (benefits from reduced recursion overhead and memory allocations)
- **Large flat structures**: 70-75% faster (accumulation approach scales better than set comprehensions)
- **Deep nesting**: 70-75% faster (reduced memory pressure and function call overhead)

The optimization is particularly effective for BigQuery schema processing with many nested fields, where the original approach's memory allocation overhead becomes a significant bottleneck.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 9, 2025 12:24
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants