Skip to content

Conversation

Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Apr 17, 2025

Description

Adds retry to unprocessed requests in call add_requests_batched.
Retry calls recursively _process_batch, which initially works on full request batch and then on batches of unprocessed requests until retry limit is reached or all requests are processed. Each retry is done after linearly increasing delay with each attempt.

Unprocessed requests are not counted in request_queue.get_total_count
Add test.

Issues

TODO:  finish tests and format and mypy
@github-actions github-actions bot added this to the 112nd sprint - Tooling team milestone Apr 17, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Apr 17, 2025
@Pijukatel Pijukatel marked this pull request as ready for review April 22, 2025 10:55
@Pijukatel Pijukatel requested a review from Copilot April 22, 2025 10:55
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds retry logic for unprocessed requests in the add_requests_batched function and introduces a unit test to validate the behavior.

  • Implements automatic retry for unprocessed requests with a maximum of 5 attempts.
  • Adds a new test (test_add_batched_requests_with_retry) to simulate unprocessed requests and verify that the total count is correctly updated.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/unit/storages/test_request_queue.py Added test to verify retry behavior and request count adjustments
src/crawlee/storages/_request_queue.py Updated add_requests_batched to use a new _process_batch method with retry logic
Comments suppressed due to low confidence (1)

tests/unit/storages/test_request_queue.py:309

  • The call to service_locator.get_storage_client() is not assigned to any variable or used in any manner. Consider removing it if it is unnecessary.
service_locator.get_storage_client()

@Pijukatel Pijukatel requested review from vdusek and janbuchar April 22, 2025 11:16

async def _process_batch(self, batch: Sequence[Request], base_retry_wait: timedelta, attempt: int = 1) -> None:
max_attempts = 5
response = await self._resource_client.batch_add_requests(batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, could you please make sure that ApifyStorageClient in the SDK does not do any retries of its own?

Of course, this is assuming we find that undesirable, but I firmly believe that the retry logic should be concentrated in a single place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it does. The retries that happens are currently in apify-client-python. First in Http clients and then in RequestQueueClients. Those retries are on different level and are both dealing with status codes of http responses and are not doing retries based on parsed content of successful API call - which is the case in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ApifyStorageClient does its own retries via apify-client-python. Because the apify-client implementation also does its own kind of batching, the situation gets extra blurry. It doesn't make sense to change that in this PR, but I think we should make an issue to resolve this some day. If you agree, that is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 7851175 into master Apr 23, 2025
23 checks passed
@Pijukatel Pijukatel deleted the handle-unprocessed-requests branch April 23, 2025 12:18
Mantisus pushed a commit to Mantisus/crawlee-python that referenced this pull request Apr 24, 2025
### Description
Adds retry to unprocessed requests in call `add_requests_batched`.
Retry calls recursively `_process_batch`, which initially works on full
request batch and then on batches of unprocessed requests until retry
limit is reached or all requests are processed. Each retry is done after
linearly increasing delay with each attempt.

Unprocessed requests are not counted in `request_queue.get_total_count`
Add test.

### Issues

- Closes: [Handle unprocessed requests in
batch_add_requests](apify/apify-sdk-python#456)
Mantisus pushed a commit to Mantisus/crawlee-python that referenced this pull request Apr 24, 2025
### Description
Adds retry to unprocessed requests in call `add_requests_batched`.
Retry calls recursively `_process_batch`, which initially works on full
request batch and then on batches of unprocessed requests until retry
limit is reached or all requests are processed. Each retry is done after
linearly increasing delay with each attempt.

Unprocessed requests are not counted in `request_queue.get_total_count`
Add test.

### Issues

- Closes: [Handle unprocessed requests in
batch_add_requests](apify/apify-sdk-python#456)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle unprocessed requests in batch_add_requests
3 participants