Skip to content

Conversation

Pijukatel
Copy link
Collaborator

Description

This adds the possibility to stop crawler from user defined handler function.
Updated docs, added example.
Added test.

Issues

@github-actions github-actions bot added this to the 104th sprint - Tooling team milestone Dec 11, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Dec 11, 2024
@Pijukatel Pijukatel added enhancement New feature or request. and removed tested Temporary label used only programatically for some analytics. labels Dec 11, 2024
@Pijukatel Pijukatel marked this pull request as ready for review December 11, 2024 13:50
@B4nan
Copy link
Member

B4nan commented Dec 12, 2024

I thought we will use the same name as in JS version, was there some discussion around this?

@Pijukatel
Copy link
Collaborator Author

Pijukatel commented Dec 12, 2024

I thought we will use the same name as in JS version, was there some discussion around this?

Well, this is my proposal. Reasoning is following:

I looked at the JS version and it has .teardown(). To me it calling crawler.stop() vs crawler.teardown() is definitely better.
From user perspective I imagine that crawler.teardown() is kind of some internal stuff that I do not care about and that can be at some point called by crawler.stop(), but that to me seems like an internal detail, not a user exposed function to call in normal circumstances.

Also crawler.stop() does not really do any teardown. It just "sets conditions" that will make crawler stop on it's own, calling whatever teardowns and cleanups it normally calls. (I did not introduce any new mechanism, I just abstracted how it was already used in case of _max_requests_count_exceeded and made it available to the user.)

@B4nan
Copy link
Member

B4nan commented Dec 12, 2024

So we don't wait for the in-progress tasks in this new stop method? Maybe we should have both then, I think we need a teardown method that cleans up things properly (waits for the started tasks with a timeout).

@Pijukatel
Copy link
Collaborator Author

So we don't wait for the in-progress tasks in this new stop method? Maybe we should have both then, I think we need a teardown method that cleans up things properly (waits for the started tasks with a timeout).

Stop does following:

  • stop processing of any new request, even if there are some request left in whatever RequestProvider we use, by forcing the __is_task_ready_function to return False.
  • stop will force __is_finished_function to return True.
    Those two actions make the AutoscaledPool to shut down in normal fashion with all it's normal teardown steps.

One example consequence of the "normal teardown" of AutoscaledPool is that AutoscaledPool will wait for any still running tasks. https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L262

I don't really see usecase for the teardown method to be used externally. You can understand stop method as "stop the crawler now and do whatever teardowns you normally do when crawler finishes."

@Pijukatel
Copy link
Collaborator Author

So we don't wait for the in-progress tasks in this new stop method? Maybe we should have both then, I think we need a teardown method that cleans up things properly (waits for the started tasks with a timeout).

Stop does following:

* stop processing of any **new** request, even if there are some request left in whatever RequestProvider we use, by forcing the __is_task_ready_function to return False.

* stop will force __is_finished_function to return True.
  Those two actions make the AutoscaledPool to shut down in normal fashion with all it's normal teardown steps.

One example consequence of the "normal teardown" of AutoscaledPool is that AutoscaledPool will wait for any still running tasks. https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L262

I don't really see usecase for the teardown method to be used externally. You can understand stop method as "stop the crawler now and do whatever teardowns you normally do when crawler finishes."

Maybe one difference to highlight:
In JS version teardown calls autoscaledPool.abort() .
In Python the stop does not call autoscaled_pool.abort() instead it "starves autoscaled_pool to stop itself" by denying it new tasks and letting it know its job is finished.

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Dec 13, 2024
@Pijukatel
Copy link
Collaborator Author

I added one more test that makes sure that ongoing requests are finished. In that test concurrency is set to 2 and it will visit 2 pages. One page will trigger stop() immediately and the other will trigger it after short sleep time. This creates situation where the second request is still being processed after first stop() was called. Running the test with DEBUG level of logs for autoscaled_pool demonstrates what is going on under the hood and how the autoscaled_pool "natural teardown" is waiting for the ongoing requests to finish.

[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
[crawlee._autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee._autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - already running at desired concurrency
[crawlee.basic_crawler._basic_crawler] INFO  Crawler.stop() was called with following reason: Stop called on  https://httpbin.org/1.
[crawlee._autoscaling.autoscaled_pool] DEBUG Worker task finished
[crawlee.basic_crawler._basic_crawler] INFO  The crawler will finish any remaining ongoing requests and shut down.
[crawlee._autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Terminating - waiting for tasks to complete
[crawlee.basic_crawler._basic_crawler] INFO  Crawler.stop() was called with following reason: Stop called on  https://httpbin.org/2.
[crawlee._autoscaling.autoscaled_pool] DEBUG Worker task finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Worker tasks finished
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee._autoscaling.autoscaled_pool] DEBUG Pool cleanup finished

@Pijukatel Pijukatel requested a review from vdusek December 17, 2024 07:23
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 6d01af4 into master Dec 18, 2024
23 checks passed
@Pijukatel Pijukatel deleted the stop-crawler branch December 18, 2024 13:22
await store.set_value(key, value.content, value.content_type)

async def __is_finished_function(self) -> bool:
self._stop_if_max_requests_count_exceeded()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I understand the logic, but I don't like the names:
_is_finished calls _stop_if -> IMO "property getter" is stopping the crawler (by reading the names of the methods)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be ugly name but it is as explicit as it can get. This is internal name so it is no big deal to change. Do you have preferred naming?

Copy link

@MatousMarik MatousMarik Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my idea is a bit different, but you don't have to consider it at all.

I will try to explain it:
I don't like the logic that previously if you wanted to know if you were stopped you checked all the relevant flags and properties, mainly _max_requests_count_exceeded.
Now you are adding a new flag _unexpected_stop. So why don't you just check as before + _unexpected_stop. Why do you have to add call called _stop_if_something before each of those checks that would do the same thing as accessing the property _max_requests_count_exceeded and then checking only for "unexpected".
As I see it you had 1 flag and wanted to add a second one, that is a different one but is used in the same decision. So instead of checking them both you just decided you rename flag 1 and set it also when you would set the flag 2.

I would say you did something like this:

i_am_dirty is flag 1

do_i_want_to_take_a_shower = decision: 
1. return i_am_dirty

now adding flag 2 = i_am_hot
the process would be =>
rename flag 1 to i_want_to_take_a_shower # that is always set when i_am_hot would be and

do_i_want_to_take_a_shower would change to:
1. if i_am_dirty_check: i_want_to_take_a_shower = true
2. return i_want_to_take_a_shower # this makes sense, because were setting it also whenever we would set i_am_hot

Proposal:

  • keep original property _max_requests_count_exceeded
  • and also check for _unexpected_stop
  • if you don't want to check both of them in each case you are deciding whether your current _unexpected_stop flag is set (while currently you are kind of doing that by 1. calling _stop_if 2. checking _unexpected_stop), you can create new single point of truth like _should_stop_flag, that would act as property that would check both _max_requests_count_exceeded and _unexpected_stop and possibly some other in the future

In the example:

do_i_want_to_take_a_shower:
1. return i_am_dirty or i_am_hot

Sorry for this useless [might not even fit in nitpick category]...


async def __is_task_ready_function(self) -> bool:
if self._max_requests_count_exceeded:
self._stop_if_max_requests_count_exceeded()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a way to stop crawler from the user function
4 participants