-
Notifications
You must be signed in to change notification settings - Fork 477
feat: Add stop method to BasicCrawler #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
253041f
Add possibility to stop crawler.
Pijukatel c0d3090
Add docs and example how to use crawler.stop()
Pijukatel b364e82
Add extra test for shutdown with ongoing requets.
Pijukatel 2078ead
Apply suggestions from code review
Pijukatel 6a99031
Implicit type
Pijukatel da6ef6e
Remove self._unexpected_stop_reason
Pijukatel ab46744
Update example title - review comment
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
import asyncio | ||
|
||
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext | ||
|
||
|
||
async def main() -> None: | ||
# Create an instance of the BeautifulSoupCrawler class, a crawler that automatically | ||
# loads the URLs and parses their HTML using the BeautifulSoup library. | ||
crawler = BeautifulSoupCrawler() | ||
|
||
# Define the default request handler, which will be called for every request. | ||
# The handler receives a context parameter, providing various properties and | ||
# helper methods. Here are a few key ones we use for demonstration: | ||
# - request: an instance of the Request class containing details such as the URL | ||
# being crawled and the HTTP method used. | ||
# - soup: the BeautifulSoup object containing the parsed HTML of the response. | ||
@crawler.router.default_handler | ||
async def request_handler(context: BeautifulSoupCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url} ...') | ||
|
||
# Create custom condition to stop crawler once it finds what it is looking for. | ||
if 'crawlee' in context.request.url: | ||
crawler.stop(reason='Manual stop of crawler after finding `crawlee` in the url.') | ||
|
||
# Extract data from the page. | ||
data = { | ||
'url': context.request.url, | ||
} | ||
|
||
# Push the extracted data to the default dataset. In local configuration, | ||
# the data will be stored as JSON files in ./storage/datasets/default. | ||
await context.push_data(data) | ||
|
||
# Run the crawler with the initial list of URLs. | ||
await crawler.run(['https://crawlee.dev']) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
id: crawler-stop | ||
title: Stopping a Crawler with stop method | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import BeautifulSoupExample from '!!raw-loader!./code/beautifulsoup_crawler_stop.py'; | ||
|
||
This example demonstrates how to use `stop` method of <ApiLink to="class/BasicCrawler">`BasicCrawler`</ApiLink> to stop crawler once the crawler finds what it is looking for. This method is available to all crawlers that inherit from <ApiLink to="class/BasicCrawler">`BasicCrawler`</ApiLink> and in the example below it is shown on <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. Simply call `crawler.stop()` to stop the crawler. It will not continue to crawl through new requests. Requests that are already being concurrently processed are going to get finished. It is possible to call `stop` method with optional argument `reason` that is a string that will be used in logs and it can improve logs readability especially if you have multiple different conditions for triggering `stop`. | ||
|
||
<CodeBlock className="language-python"> | ||
{BeautifulSoupExample} | ||
</CodeBlock> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -303,6 +303,8 @@ def __init__( | |
self._failed = False | ||
self._abort_on_error = abort_on_error | ||
|
||
self._unexpected_stop = False | ||
|
||
@property | ||
def log(self) -> logging.Logger: | ||
"""The logger used by the crawler.""" | ||
|
@@ -328,13 +330,26 @@ def statistics(self) -> Statistics[StatisticsState]: | |
"""Statistics about the current (or last) crawler run.""" | ||
return self._statistics | ||
|
||
@property | ||
def _max_requests_count_exceeded(self) -> bool: | ||
"""Whether the maximum number of requests to crawl has been reached.""" | ||
def stop(self, reason: str = 'Stop was called externally.') -> None: | ||
"""Set flag to stop crawler. | ||
|
||
This stops current crawler run regardless of whether all requests were finished. | ||
|
||
Args: | ||
reason: Reason for stopping that will be used in logs. | ||
""" | ||
self._logger.info(f'Crawler.stop() was called with following reason: {reason}.') | ||
self._unexpected_stop = True | ||
|
||
def _stop_if_max_requests_count_exceeded(self) -> None: | ||
"""Call `stop` when the maximum number of requests to crawl has been reached.""" | ||
if self._max_requests_per_crawl is None: | ||
return False | ||
return | ||
|
||
return self._statistics.state.requests_finished >= self._max_requests_per_crawl | ||
if self._statistics.state.requests_finished >= self._max_requests_per_crawl: | ||
self.stop( | ||
reason=f'The crawler has reached its limit of {self._max_requests_per_crawl} requests per crawl. ' | ||
) | ||
|
||
async def _get_session(self) -> Session | None: | ||
"""If session pool is being used, try to take a session from it.""" | ||
|
@@ -912,27 +927,25 @@ async def _commit_request_handler_result( | |
await store.set_value(key, value.content, value.content_type) | ||
|
||
async def __is_finished_function(self) -> bool: | ||
self._stop_if_max_requests_count_exceeded() | ||
if self._unexpected_stop: | ||
self._logger.info('The crawler will finish any remaining ongoing requests and shut down.') | ||
return True | ||
|
||
request_provider = await self.get_request_provider() | ||
is_finished = await request_provider.is_finished() | ||
|
||
if self._max_requests_count_exceeded: | ||
self._logger.info( | ||
f'The crawler has reached its limit of {self._max_requests_per_crawl} requests per crawl. ' | ||
f'All ongoing requests have now completed. Total requests processed: ' | ||
f'{self._statistics.state.requests_finished}. The crawler will now shut down.' | ||
) | ||
return True | ||
|
||
if self._abort_on_error and self._failed: | ||
return True | ||
|
||
return is_finished | ||
|
||
async def __is_task_ready_function(self) -> bool: | ||
if self._max_requests_count_exceeded: | ||
self._stop_if_max_requests_count_exceeded() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above |
||
if self._unexpected_stop: | ||
self._logger.info( | ||
f'The crawler has reached its limit of {self._max_requests_per_crawl} requests per crawl. ' | ||
f'The crawler will soon shut down. Ongoing requests will be allowed to complete.' | ||
'No new requests are allowed because crawler `stop` method was called. ' | ||
'Ongoing requests will be allowed to complete.' | ||
) | ||
return False | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I understand the logic, but I don't like the names:
_is_finished calls _stop_if -> IMO "property getter" is stopping the crawler (by reading the names of the methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be ugly name but it is as explicit as it can get. This is internal name so it is no big deal to change. Do you have preferred naming?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my idea is a bit different, but you don't have to consider it at all.
I will try to explain it:
I don't like the logic that previously if you wanted to know if you were stopped you checked all the relevant flags and properties, mainly _max_requests_count_exceeded.
Now you are adding a new flag _unexpected_stop. So why don't you just check as before + _unexpected_stop. Why do you have to add call called _stop_if_something before each of those checks that would do the same thing as accessing the property _max_requests_count_exceeded and then checking only for "unexpected".
As I see it you had 1 flag and wanted to add a second one, that is a different one but is used in the same decision. So instead of checking them both you just decided you rename flag 1 and set it also when you would set the flag 2.
I would say you did something like this:
i_am_dirty is flag 1
now adding flag 2 = i_am_hot
the process would be =>
rename flag 1 to i_want_to_take_a_shower # that is always set when i_am_hot would be and
Proposal:
In the example:
Sorry for this useless [might not even fit in nitpick category]...