Skip to content

Conversation

Mantisus
Copy link
Collaborator

Description

  • Add support for non-GET requests for PlaywrightCrawler using Route

Issues

@Mantisus Mantisus requested a review from Copilot May 21, 2025 09:57
@Mantisus Mantisus self-assigned this May 21, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for issuing non-GET HTTP requests (including custom methods, headers, and payloads) in the PlaywrightCrawler by leveraging Playwright’s routing API, and updates the existing unit test to cover GET, POST, and POST-with-payload scenarios.

  • Introduces a private _prepare_request_handler to configure method, headers, and payload via route.continue_.
  • Emits a warning for performance implications when using non-GET methods.
  • Refactors test_basic_request to parametrize HTTP methods and payloads.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/unit/crawlers/_playwright/test_playwright_crawler.py Refactored test_basic_request into a parametrized test covering GET, POST, and POST-with-payload.
src/crawlee/crawlers/_playwright/_playwright_crawler.py Added _prepare_request_handler, warning logic in _navigate, and route registration for custom requests.
Comments suppressed due to low confidence (3)

tests/unit/crawlers/_playwright/test_playwright_crawler.py:51

  • Consider adding a test case that passes custom headers (e.g. headers={'X-Test': 'value'}) to verify non-GET requests include expected HTTP headers.
async def test_basic_request(method: HttpMethod, path: str, payload: HttpPayload, server_url: URL) -> None:

src/crawlee/crawlers/_playwright/_playwright_crawler.py:230

  • Passing raw bytes as post_data may cause a type error in Playwright; consider decoding payload to str (e.g. payload.decode()) or otherwise converting it to the format expected by route.continue_.
await route.continue_(method=method, headers=dict(headers) if headers else None, post_data=payload)

src/crawlee/crawlers/_playwright/_playwright_crawler.py:230

  • Using dict(headers) may not correctly serialize the HttpHeaders model; consider calling headers.model_dump() or .dict() to produce a proper dict[str, str].
await route.continue_(method=method, headers=dict(headers) if headers else None, post_data=payload)

@Mantisus Mantisus requested review from janbuchar and vdusek May 21, 2025 10:04
@Mantisus Mantisus changed the title fix: add support non-GET request for PlaywrightCrawler fix: add support non-GET requests for PlaywrightCrawler May 21, 2025
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! I onlyf found some naming and documentation issues.

await hook(pre_navigation_context)
yield pre_navigation_context

def _prepare_request_handler(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_handler is already taken, let's come up with a better name. How does _prepare_non_get_request_interceptor sound? Not great, huh... but it's a start, I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just _prepare_request_interceptor? The method is not public, maybe we can name it shorter.

headers: HttpHeaders | dict[str, str] | None = None,
payload: HttpPayload | None = None,
) -> Callable:
"""Create a route handler for Playwright to support non-GET methods with custom parameters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mention that the purpose of the interception is to add custom headers and payload to the request.

@Mantisus Mantisus requested a review from janbuchar May 26, 2025 14:03
@janbuchar janbuchar merged commit dbb9f44 into apify:master May 26, 2025
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PlaywrightCrawler silently switches POST requests to GET
3 participants