feat: Add stop method to BasicCrawler #807

Pijukatel · 2024-12-11T13:44:37Z

Description

This adds the possibility to stop crawler from user defined handler function.
Updated docs, added example.
Added test.

Issues

Closes: Implement a way to stop crawler from the user function #651

Add test. TODO: Document this and create example.

B4nan · 2024-12-12T07:38:01Z

I thought we will use the same name as in JS version, was there some discussion around this?

Pijukatel · 2024-12-12T11:00:40Z

I thought we will use the same name as in JS version, was there some discussion around this?

Well, this is my proposal. Reasoning is following:

I looked at the JS version and it has .teardown(). To me it calling crawler.stop() vs crawler.teardown() is definitely better.
From user perspective I imagine that crawler.teardown() is kind of some internal stuff that I do not care about and that can be at some point called by crawler.stop(), but that to me seems like an internal detail, not a user exposed function to call in normal circumstances.

Also crawler.stop() does not really do any teardown. It just "sets conditions" that will make crawler stop on it's own, calling whatever teardowns and cleanups it normally calls. (I did not introduce any new mechanism, I just abstracted how it was already used in case of _max_requests_count_exceeded and made it available to the user.)

B4nan · 2024-12-12T12:02:07Z

So we don't wait for the in-progress tasks in this new stop method? Maybe we should have both then, I think we need a teardown method that cleans up things properly (waits for the started tasks with a timeout).

Pijukatel · 2024-12-12T13:32:07Z

So we don't wait for the in-progress tasks in this new stop method? Maybe we should have both then, I think we need a teardown method that cleans up things properly (waits for the started tasks with a timeout).

Stop does following:

stop processing of any new request, even if there are some request left in whatever RequestProvider we use, by forcing the __is_task_ready_function to return False.
stop will force __is_finished_function to return True.
Those two actions make the AutoscaledPool to shut down in normal fashion with all it's normal teardown steps.

One example consequence of the "normal teardown" of AutoscaledPool is that AutoscaledPool will wait for any still running tasks. https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L262

I don't really see usecase for the teardown method to be used externally. You can understand stop method as "stop the crawler now and do whatever teardowns you normally do when crawler finishes."

Pijukatel · 2024-12-12T13:37:46Z

So we don't wait for the in-progress tasks in this new stop method? Maybe we should have both then, I think we need a teardown method that cleans up things properly (waits for the started tasks with a timeout).

Stop does following:
* stop processing of any **new** request, even if there are some request left in whatever RequestProvider we use, by forcing the __is_task_ready_function to return False.

* stop will force __is_finished_function to return True.
  Those two actions make the AutoscaledPool to shut down in normal fashion with all it's normal teardown steps.
One example consequence of the "normal teardown" of AutoscaledPool is that AutoscaledPool will wait for any still running tasks. https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L262

I don't really see usecase for the teardown method to be used externally. You can understand stop method as "stop the crawler now and do whatever teardowns you normally do when crawler finishes."

Maybe one difference to highlight:
In JS version teardown calls autoscaledPool.abort() .
In Python the stop does not call autoscaled_pool.abort() instead it "starves autoscaled_pool to stop itself" by denying it new tasks and letting it know its job is finished.

Imporve logs.

Pijukatel · 2024-12-13T08:57:20Z

I added one more test that makes sure that ongoing requests are finished. In that test concurrency is set to 2 and it will visit 2 pages. One page will trigger stop() immediately and the other will trigger it after short sleep time. This creates situation where the second request is still being processed after first stop() was called. Running the test with DEBUG level of logs for autoscaled_pool demonstrates what is going on under the hood and how the autoscaled_pool "natural teardown" is waiting for the ongoing requests to finish.

[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
[crawlee._autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee._autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee._autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - already running at desired concurrency
[crawlee.basic_crawler._basic_crawler] INFO  Crawler.stop() was called with following reason: Stop called on  https://httpbin.org/1.
[crawlee._autoscaling.autoscaled_pool] DEBUG Worker task finished
[crawlee.basic_crawler._basic_crawler] INFO  The crawler will finish any remaining ongoing requests and shut down.
[crawlee._autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Terminating - waiting for tasks to complete
[crawlee.basic_crawler._basic_crawler] INFO  Crawler.stop() was called with following reason: Stop called on  https://httpbin.org/2.
[crawlee._autoscaling.autoscaled_pool] DEBUG Worker task finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Worker tasks finished
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee._autoscaling.autoscaled_pool] DEBUG Pool cleanup finished

src/crawlee/basic_crawler/_basic_crawler.py

tests/unit/basic_crawler/test_basic_crawler.py

Co-authored-by: Vlada Dusek <[email protected]>

docs/examples/crawler_stop.mdx

vdusek

LGTM

MatousMarik · 2024-12-18T13:33:06Z

src/crawlee/basic_crawler/_basic_crawler.py

                await store.set_value(key, value.content, value.content_type)

    async def __is_finished_function(self) -> bool:
+        self._stop_if_max_requests_count_exceeded()


Hi, I understand the logic, but I don't like the names:
_is_finished calls _stop_if -> IMO "property getter" is stopping the crawler (by reading the names of the methods)

This might be ugly name but it is as explicit as it can get. This is internal name so it is no big deal to change. Do you have preferred naming?

Sorry, my idea is a bit different, but you don't have to consider it at all.

I will try to explain it:
I don't like the logic that previously if you wanted to know if you were stopped you checked all the relevant flags and properties, mainly _max_requests_count_exceeded.
Now you are adding a new flag _unexpected_stop. So why don't you just check as before + _unexpected_stop. Why do you have to add call called _stop_if_something before each of those checks that would do the same thing as accessing the property _max_requests_count_exceeded and then checking only for "unexpected".
As I see it you had 1 flag and wanted to add a second one, that is a different one but is used in the same decision. So instead of checking them both you just decided you rename flag 1 and set it also when you would set the flag 2.

I would say you did something like this:

i_am_dirty is flag 1

do_i_want_to_take_a_shower = decision: 1. return i_am_dirty

now adding flag 2 = i_am_hot
the process would be =>
rename flag 1 to i_want_to_take_a_shower # that is always set when i_am_hot would be and

do_i_want_to_take_a_shower would change to: 1. if i_am_dirty_check: i_want_to_take_a_shower = true 2. return i_want_to_take_a_shower # this makes sense, because were setting it also whenever we would set i_am_hot

Proposal:

keep original property _max_requests_count_exceeded

and also check for _unexpected_stop

if you don't want to check both of them in each case you are deciding whether your current _unexpected_stop flag is set (while currently you are kind of doing that by 1. calling _stop_if 2. checking _unexpected_stop), you can create new single point of truth like _should_stop_flag, that would act as property that would check both _max_requests_count_exceeded and _unexpected_stop and possibly some other in the future

In the example:

do_i_want_to_take_a_shower: 1. return i_am_dirty or i_am_hot

Sorry for this useless [might not even fit in nitpick category]...

MatousMarik · 2024-12-18T13:35:07Z

src/crawlee/basic_crawler/_basic_crawler.py


    async def __is_task_ready_function(self) -> bool:
-        if self._max_requests_count_exceeded:
+        self._stop_if_max_requests_count_exceeded()


Same as above

Pijukatel added 2 commits December 11, 2024 12:34

Add possibility to stop crawler.

253041f

Add test. TODO: Document this and create example.

Add docs and example how to use crawler.stop()

c0d3090

github-actions bot assigned Pijukatel Dec 11, 2024

github-actions bot added this to the 104th sprint - Tooling team milestone Dec 11, 2024

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Dec 11, 2024

Pijukatel added enhancement New feature or request. and removed tested Temporary label used only programatically for some analytics. labels Dec 11, 2024

Pijukatel marked this pull request as ready for review December 11, 2024 13:50

Pijukatel requested review from vdusek and janbuchar December 11, 2024 13:50

Add extra test for shutdown with ongoing requets.

b364e82

Imporve logs.

github-actions bot added the tested Temporary label used only programatically for some analytics. label Dec 13, 2024

vdusek modified the milestones: 104th sprint - Tooling team, 105th sprint - Tooling team Dec 16, 2024

vdusek reviewed Dec 16, 2024

View reviewed changes

src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved

src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved

tests/unit/basic_crawler/test_basic_crawler.py Outdated Show resolved Hide resolved

Pijukatel and others added 3 commits December 17, 2024 08:01

Apply suggestions from code review

2078ead

Co-authored-by: Vlada Dusek <[email protected]>

Implicit type

6a99031

Remove self._unexpected_stop_reason

da6ef6e

Pijukatel requested a review from vdusek December 17, 2024 07:23

vdusek reviewed Dec 17, 2024

View reviewed changes

docs/examples/crawler_stop.mdx Outdated Show resolved Hide resolved

Update example title - review comment

ab46744

vdusek approved these changes Dec 18, 2024

View reviewed changes

Pijukatel merged commit 6d01af4 into master Dec 18, 2024
23 checks passed

Pijukatel deleted the stop-crawler branch December 18, 2024 13:22

MatousMarik reviewed Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add stop method to BasicCrawler #807

feat: Add stop method to BasicCrawler #807

Uh oh!

Pijukatel commented Dec 11, 2024

Uh oh!

B4nan commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 12, 2024 •

edited

Loading

Uh oh!

B4nan commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Uh oh!

Uh oh!

MatousMarik Dec 18, 2024

Uh oh!

Pijukatel Dec 18, 2024

Uh oh!

MatousMarik Dec 18, 2024 •

edited

Loading

Uh oh!

MatousMarik Dec 18, 2024

Uh oh!

Uh oh!

feat: Add stop method to BasicCrawler #807

feat: Add stop method to BasicCrawler #807

Uh oh!

Conversation

Pijukatel commented Dec 11, 2024

Description

Issues

Uh oh!

B4nan commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

B4nan commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 12, 2024

Uh oh!

Pijukatel commented Dec 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MatousMarik Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

MatousMarik Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatousMarik Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pijukatel commented Dec 12, 2024 •

edited

Loading

MatousMarik Dec 18, 2024 •

edited

Loading