-
Notifications
You must be signed in to change notification settings - Fork 474
feat: Add stop method to BasicCrawler #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add test. TODO: Document this and create example.
I thought we will use the same name as in JS version, was there some discussion around this? |
Well, this is my proposal. Reasoning is following: I looked at the JS version and it has .teardown(). To me it calling crawler.stop() vs crawler.teardown() is definitely better. Also crawler.stop() does not really do any teardown. It just "sets conditions" that will make crawler stop on it's own, calling whatever teardowns and cleanups it normally calls. (I did not introduce any new mechanism, I just abstracted how it was already used in case of _max_requests_count_exceeded and made it available to the user.) |
So we don't wait for the in-progress tasks in this new |
Stop does following:
One example consequence of the "normal teardown" of AutoscaledPool is that AutoscaledPool will wait for any still running tasks. https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L262 I don't really see usecase for the teardown method to be used externally. You can understand stop method as "stop the crawler now and do whatever teardowns you normally do when crawler finishes." |
Maybe one difference to highlight: |
Imporve logs.
I added one more test that makes sure that ongoing requests are finished. In that test concurrency is set to 2 and it will visit 2 pages. One page will trigger stop() immediately and the other will trigger it after short sleep time. This creates situation where the second request is still being processed after first stop() was called. Running the test with DEBUG level of logs for autoscaled_pool demonstrates what is going on under the hood and how the autoscaled_pool "natural teardown" is waiting for the ongoing requests to finish.
|
Co-authored-by: Vlada Dusek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
await store.set_value(key, value.content, value.content_type) | ||
|
||
async def __is_finished_function(self) -> bool: | ||
self._stop_if_max_requests_count_exceeded() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I understand the logic, but I don't like the names:
_is_finished calls _stop_if -> IMO "property getter" is stopping the crawler (by reading the names of the methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be ugly name but it is as explicit as it can get. This is internal name so it is no big deal to change. Do you have preferred naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my idea is a bit different, but you don't have to consider it at all.
I will try to explain it:
I don't like the logic that previously if you wanted to know if you were stopped you checked all the relevant flags and properties, mainly _max_requests_count_exceeded.
Now you are adding a new flag _unexpected_stop. So why don't you just check as before + _unexpected_stop. Why do you have to add call called _stop_if_something before each of those checks that would do the same thing as accessing the property _max_requests_count_exceeded and then checking only for "unexpected".
As I see it you had 1 flag and wanted to add a second one, that is a different one but is used in the same decision. So instead of checking them both you just decided you rename flag 1 and set it also when you would set the flag 2.
I would say you did something like this:
i_am_dirty is flag 1
do_i_want_to_take_a_shower = decision:
1. return i_am_dirty
now adding flag 2 = i_am_hot
the process would be =>
rename flag 1 to i_want_to_take_a_shower # that is always set when i_am_hot would be and
do_i_want_to_take_a_shower would change to:
1. if i_am_dirty_check: i_want_to_take_a_shower = true
2. return i_want_to_take_a_shower # this makes sense, because were setting it also whenever we would set i_am_hot
Proposal:
- keep original property _max_requests_count_exceeded
- and also check for _unexpected_stop
- if you don't want to check both of them in each case you are deciding whether your current _unexpected_stop flag is set (while currently you are kind of doing that by 1. calling _stop_if 2. checking _unexpected_stop), you can create new single point of truth like _should_stop_flag, that would act as property that would check both _max_requests_count_exceeded and _unexpected_stop and possibly some other in the future
In the example:
do_i_want_to_take_a_shower:
1. return i_am_dirty or i_am_hot
Sorry for this useless [might not even fit in nitpick category]...
|
||
async def __is_task_ready_function(self) -> bool: | ||
if self._max_requests_count_exceeded: | ||
self._stop_if_max_requests_count_exceeded() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Description
This adds the possibility to stop crawler from user defined handler function.
Updated docs, added example.
Added test.
Issues