Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 10, 2025

📄 86% (0.86x) speedup for get_source_bucket in google/cloud/aiplatform/tensorboard/uploader_utils.py

⏱️ Runtime : 13.9 microseconds 7.48 microseconds (best of 10 runs)

📝 Explanation and details

The optimized code achieves an 86% speedup through two key optimizations:

1. Pre-compiled Regex Pattern
The original code compiles the regex pattern r"gs:\/\/(.*?)(?=\/|$)" on every function call using re.match(). The optimized version pre-compiles this pattern as a module-level constant _GS_BUCKET_REGEX and uses .match() directly on the compiled pattern. This eliminates the regex compilation overhead on each call, reducing the regex matching time from 9478ns to 2723ns per hit (71% faster).

2. Singleton Storage Client
The original code creates a new storage.Client() instance every time a valid GS path is encountered, which involves expensive authentication and initialization overhead (98.9% of total runtime). The optimized version implements a function-level singleton pattern using hasattr() to check if _storage_client exists, creating it only once and reusing it for subsequent calls. This dramatically reduces the per-call overhead for storage client creation.

Performance Benefits by Test Case:

  • Non-GS paths (most common): 79-132% faster due to regex pre-compilation since no storage client is created
  • Empty strings: 79-89% faster from regex optimization alone
  • Invalid URI patterns: 74-104% faster from avoiding regex recompilation

The optimizations are particularly effective for workloads with repeated calls to get_source_bucket(), as the regex compilation and storage client initialization costs are amortized across multiple invocations rather than paid on every call.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 13 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 80.0%
🌀 Generated Regression Tests and Runtime
import re
from typing import Optional

# imports
import pytest  # used for our unit tests
from aiplatform.tensorboard.uploader_utils import get_source_bucket

# --- MOCKING google.cloud.storage for testing without external dependencies ---

# Minimal mock Bucket and Client classes to simulate google.cloud.storage
class MockBucket:
    def __init__(self, name):
        self.name = name

class MockClient:
    def bucket(self, name):
        return MockBucket(name)

# Patch point for our tests
class storage:
    Client = MockClient
from aiplatform.tensorboard.uploader_utils import get_source_bucket

# unit tests

# 1. Basic Test Cases






def test_non_gs_path_returns_none():
    # Test with a non-gs path
    codeflash_output = get_source_bucket("/local/path/to/logs") # 2.45μs -> 1.33μs (84.7% faster)
    codeflash_output = get_source_bucket("s3://my-bucket/logs") # 775ns -> 334ns (132% faster)
    codeflash_output = get_source_bucket("http://example.com/logs") # 606ns -> 295ns (105% faster)

def test_empty_string_returns_none():
    # Test with an empty string
    codeflash_output = get_source_bucket("") # 1.64μs -> 917ns (79.3% faster)



















#------------------------------------------------
import re
# Patch the storage module in this test context
import sys
import types
from typing import Optional

# imports
import pytest  # used for our unit tests
from aiplatform.tensorboard.uploader_utils import get_source_bucket
from google.cloud import storage

# function to test
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


# --- Minimal stub for google.cloud.storage for testing ---
class DummyBucket:
    def __init__(self, name):
        self.name = name

class DummyClient:
    def bucket(self, name):
        return DummyBucket(name)

storage = types.SimpleNamespace(Client=DummyClient)
from aiplatform.tensorboard.uploader_utils import get_source_bucket

# unit tests

# 1. Basic Test Cases






def test_non_gs_uri_returns_none():
    # Test a non-gs URI returns None
    codeflash_output = get_source_bucket("s3://bucket/path") # 2.40μs -> 1.32μs (80.8% faster)
    codeflash_output = get_source_bucket("http://bucket/path") # 744ns -> 427ns (74.2% faster)
    codeflash_output = get_source_bucket("file:///tmp/logs") # 644ns -> 318ns (103% faster)
    codeflash_output = get_source_bucket("/local/path/to/logs") # 554ns -> 272ns (104% faster)

def test_empty_string_returns_none():
    # Test an empty string returns None
    codeflash_output = get_source_bucket("") # 1.67μs -> 882ns (89.5% faster)




def test_gs_uri_with_portion_that_looks_like_gs():
    # Test a path that contains 'gs://' but is not at the start
    codeflash_output = get_source_bucket("some/path/gs://bucket") # 2.46μs -> 1.38μs (78.2% faster)









def test_bulk_invalid_uris():
    # Test a batch of invalid URIs to ensure all return None
    invalid_uris = [
        "gs:/bucket",         # missing one slash
        "gs//bucket",         # missing colon
        "gs:////bucket",      # too many slashes
        "gcs://bucket",       # wrong scheme
        "gs://",              # no bucket
        "gs:///",             # no bucket, just slash
        "gs://?foo=bar",      # no bucket, just query
        "gs://#fragment",     # no bucket, just fragment
        "notags://bucket",    # wrong scheme
    ]
    for uri in invalid_uris:
        codeflash_output = get_source_bucket(uri)

To edit these changes git checkout codeflash/optimize-get_source_bucket-mgkjphxe and push.

Codeflash

The optimized code achieves an 86% speedup through two key optimizations:

**1. Pre-compiled Regex Pattern**
The original code compiles the regex pattern `r"gs:\/\/(.*?)(?=\/|$)"` on every function call using `re.match()`. The optimized version pre-compiles this pattern as a module-level constant `_GS_BUCKET_REGEX` and uses `.match()` directly on the compiled pattern. This eliminates the regex compilation overhead on each call, reducing the regex matching time from 9478ns to 2723ns per hit (71% faster).

**2. Singleton Storage Client**
The original code creates a new `storage.Client()` instance every time a valid GS path is encountered, which involves expensive authentication and initialization overhead (98.9% of total runtime). The optimized version implements a function-level singleton pattern using `hasattr()` to check if `_storage_client` exists, creating it only once and reusing it for subsequent calls. This dramatically reduces the per-call overhead for storage client creation.

**Performance Benefits by Test Case:**
- Non-GS paths (most common): 79-132% faster due to regex pre-compilation since no storage client is created
- Empty strings: 79-89% faster from regex optimization alone
- Invalid URI patterns: 74-104% faster from avoiding regex recompilation

The optimizations are particularly effective for workloads with repeated calls to `get_source_bucket()`, as the regex compilation and storage client initialization costs are amortized across multiple invocations rather than paid on every call.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 10, 2025 07:49
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants