Skip to content

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Sep 1, 2025

Description

  • Do not add a request that is already in progress for MemoryRequestQueueClient.

Issues

@Mantisus Mantisus requested a review from vdusek September 1, 2025 21:33
@Mantisus Mantisus self-assigned this Sep 1, 2025
@Mantisus Mantisus changed the title fix: Do not add a request that is already in progress to MemoryRequestQueueClient. fix: Do not add a request that is already in progress to MemoryRequestQueueClient Sep 1, 2025
@vdusek
Copy link
Collaborator

vdusek commented Sep 2, 2025

Old version:

┌───────────────────────────────┬────────────────┐
│ requests_finished             │ 9024           │
│ requests_failed               │ 0              │
│ retry_histogram               │ [9024]         │
│ request_avg_failed_duration   │ None           │
│ request_avg_finished_duration │ 482.4ms        │
│ requests_finished_per_minute  │ 541            │
│ requests_failed_per_minute    │ 0              │
│ request_total_duration        │ 1h 12min 33.1s │
│ requests_total                │ 9024           │
│ crawler_runtime               │ 16min 40.5s    │
└───────────────────────────────┴────────────────┘

New version:

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 4512        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [4512]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 444.8ms     │
│ requests_finished_per_minute  │ 579         │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 33min 26.9s │
│ requests_total                │ 4512        │
│ crawler_runtime               │ 7min 47.7s  │
└───────────────────────────────┴─────────────┘

With the new version, we only process about half as many requests.

However, based on the SDK's deduplication, there should only be 2362 unique pages, so there still seems to be an issue.

cc @Pijukatel

@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 2, 2025

However, based on the SDK's deduplication, there should only be 2362 unique pages, so there still seems to be an issue.

2363 if the start URL is https://crawlee.dev/
4512 if the start URL is http://crawlee.dev/

This is a non-deduplication error.

In the JS version, similarly

@vdusek
Copy link
Collaborator

vdusek commented Sep 2, 2025

Oh, thanks for the explanation

@vdusek vdusek requested a review from Pijukatel September 2, 2025 11:47
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for @Pijukatel as well, but I believe this is fine

@vdusek vdusek merged commit 3af326c into apify:master Sep 3, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MemoryStorageClient issue with deduplication
3 participants