Skip to content

Commit 4172652

Browse files
janbucharvdusek
andauthored
feat!: Implement RequestManagerTandem, remove add_request from RequestList, accept any iterable in RequestList constructor (#777)
> Tandem, or in tandem, is an arrangement in which two or more animals, machines, or people are lined up one behind another, all facing in the same direction.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1) Tandem can also be used more generally to refer to any group of persons or objects working together, not necessarily in line.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1) (https://en.wikipedia.org/wiki/Tandem) - Inspired by https://github.com/apify/crawlee/blob/4c95847d5cedd6514620ccab31d5b242ba76de80/packages/basic-crawler/src/internals/basic-crawler.ts#L1154-L1177 and related code in the same class - In my opinion, it implements the feature more cleanly and without polluting `BasicCrawler` (...any further) - The motivation for the feature is twofold: 1. Apify Actor development - it is common that an Actor receives a `requestListSources` input from the user, which may be pretty complex (regexp-based extraction from remote URL lists), and which is usually parsed using `apify.RequestList.open`. At the same time, the Actor wants to use the built in `RequestQueue`. 2. Sitemap parsing (#248) - similar to 1, but not coupled to the Apify platform - we want to read URLs from a sitemap in the background, but the URLs should go through the standard request queue ## Breaking changes - `RequestList` does not support `.drop()`, `.reclaim_request()`, `.add_request()` and `add_requests_batched()` anymore - `RequestManagerTandem` with a `RequestQueue` should be used for this use case, `await list.to_tandem()` can be used as a shortcut - The `RequestProvider` interface has been renamed to `RequestManager` and moved to the `crawlee.request_loaders` package - `RequestList` has been moved to the `crawlee.request_loaders` package - The `BasicCrawler.get_request_provider` method has been renamed to `BasicCrawler.get_request_manager` and it does not accept the `id` and `name` arguments anymore - The `request_provider` parameter of `BasicCrawler.__init__` has been renamed to `request_manager` ## TODO - [x] new tests - [x] fix existing tests --------- Co-authored-by: Vlada Dusek <[email protected]>
1 parent 3dc1c7d commit 4172652

24 files changed

+773
-496
lines changed
Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import asyncio
22

3-
from crawlee.storages import RequestList
3+
from crawlee.request_loaders import RequestList
44

55

66
async def main() -> None:
@@ -11,24 +11,13 @@ async def main() -> None:
1111
requests=['https://apify.com/', 'https://crawlee.dev/', 'https://crawlee.dev/python/'],
1212
)
1313

14-
# You can interact with the request list in the same way as with the request queue.
15-
await request_list.add_requests_batched(
16-
[
17-
'https://crawlee.dev/python/docs/quick-start',
18-
'https://crawlee.dev/python/api',
19-
]
20-
)
21-
2214
# Fetch and process requests from the queue.
2315
while request := await request_list.fetch_next_request():
2416
# Do something with it...
2517

2618
# And mark it as handled.
2719
await request_list.mark_request_as_handled(request)
2820

29-
# Remove the request queue.
30-
await request_list.drop()
31-
3221

3322
if __name__ == '__main__':
3423
asyncio.run(main())

docs/guides/code/request_storage/rl_with_crawler_example.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import asyncio
22

33
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
4-
from crawlee.storages import RequestList
4+
from crawlee.request_loaders import RequestList
55

66

77
async def main() -> None:
@@ -12,9 +12,11 @@ async def main() -> None:
1212
requests=['https://apify.com/', 'https://crawlee.dev/'],
1313
)
1414

15-
# Create a new crawler (it can be any subclass of BasicCrawler) and pass the request
16-
# list as request provider to it. It will be managed by the crawler.
17-
crawler = HttpCrawler(request_provider=request_list)
15+
# Join the request list into a tandem with the default request queue
16+
request_manager = await request_list.to_tandem()
17+
18+
# Create a new crawler (it can be any subclass of BasicCrawler) and pass the request manager tandem
19+
crawler = HttpCrawler(request_manager=request_manager)
1820

1921
# Define the default request handler, which will be called for every request.
2022
@crawler.router.default_handler

docs/guides/code/request_storage/rq_with_crawler_explicit_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ async def main() -> None:
1414

1515
# Create a new crawler (it can be any subclass of BasicCrawler) and pass the request
1616
# list as request provider to it. It will be managed by the crawler.
17-
crawler = HttpCrawler(request_provider=request_queue)
17+
crawler = HttpCrawler(request_manager=request_queue)
1818

1919
# Define the default request handler, which will be called for every request.
2020
@crawler.router.default_handler
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import asyncio
2+
3+
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext
4+
from crawlee.request_loaders import RequestList
5+
6+
7+
async def main() -> None:
8+
# Create a static request list
9+
request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])
10+
11+
crawler = ParselCrawler(
12+
# Requests from the list will be processed first, but they will be enqueued in the default request queue first
13+
request_manager=await request_list.to_tandem(),
14+
)
15+
16+
@crawler.router.default_handler
17+
async def handler(context: ParselCrawlingContext) -> None:
18+
await context.enqueue_links() # New links will be enqueued directly to the queue
19+
20+
await crawler.run()
21+
22+
23+
asyncio.run(main())
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import asyncio
2+
3+
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext
4+
from crawlee.request_loaders import RequestList, RequestManagerTandem
5+
from crawlee.storages import RequestQueue
6+
7+
8+
async def main() -> None:
9+
# Create a static request list
10+
request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])
11+
12+
# Open the default request queue
13+
request_queue = await RequestQueue.open()
14+
15+
crawler = ParselCrawler(
16+
# Requests from the list will be processed first, but they will be enqueued in the default request queue first
17+
request_manager=RequestManagerTandem(request_list, request_queue),
18+
)
19+
20+
@crawler.router.default_handler
21+
async def handler(context: ParselCrawlingContext) -> None:
22+
await context.enqueue_links() # New links will be enqueued directly to the queue
23+
24+
await crawler.run()
25+
26+
27+
asyncio.run(main())

docs/guides/request_storage.mdx

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,14 @@ import RsHelperAddRequestsExample from '!!raw-loader!./code/request_storage/help
1818
import RsHelperEnqueueLinksExample from '!!raw-loader!./code/request_storage/helper_enqueue_links_example.py';
1919
import RsDoNotPurgeExample from '!!raw-loader!./code/request_storage/do_not_purge_example.py';
2020
import RsPurgeExplicitlyExample from '!!raw-loader!./code/request_storage/purge_explicitly_example.py';
21+
import TandemExample from '!!raw-loader!./code/request_storage/tandem_example.py';
22+
import ExplicitTandemExample from '!!raw-loader!./code/request_storage/tandem_example_explicit.py';
2123

2224
This guide explains the different types of request storage available in Crawlee, how to store the requests that your crawler will process, and which storage type to choose based on your needs.
2325

2426
## Introduction
2527

26-
All request storage types in Crawlee implement the same interface - <ApiLink to="class/RequestProvider">`RequestProvider`</ApiLink>. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. For instance, <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure:
28+
All request storage types in Crawlee implement the same interface - <ApiLink to="class/RequestManager">`RequestManager`</ApiLink>. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. For instance, <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure:
2729

2830
```text
2931
{CRAWLEE_STORAGE_DIR}/{request_provider}/{QUEUE_ID}/
@@ -95,6 +97,28 @@ TODO: write this section, once https://github.com/apify/crawlee-python/issues/99
9597
9698
*/}
9799

100+
## Processing requests from multiple sources
101+
102+
In some cases, you might need to combine requests from multiple sources, most frequently from a static list of URLs (such as <ApiLink to="class/RequestList">`RequestList`</ApiLink>) and a <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, where the queue takes care of persistence and retrying failed requests.
103+
104+
This use case is supported via the <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class. You may also use the <ApiLink to="class/RequestLoader#to_tandem">`RequestLoader.to_tandem`</ApiLink> method as a shortcut.
105+
106+
<CodeBlock className="language-python">
107+
{TandemExample}
108+
</CodeBlock>
109+
<Tabs groupId="request_manager_tandem">
110+
<TabItem value="request_manager_tandem_helper" label="Using to_tandem helper" default>
111+
<CodeBlock className="language-python">
112+
{TandemExample}
113+
</CodeBlock>
114+
</TabItem>
115+
<TabItem value="request_manager_tandem_explicit" label="Explicitly using RequestManagerTandem">
116+
<CodeBlock className="language-python">
117+
{ExplicitTandemExample}
118+
</CodeBlock>
119+
</TabItem>
120+
</Tabs>
121+
98122
## Request-related helpers
99123

100124
We offer several helper functions to simplify interactions with request storages:

docs/introduction/code/02_bs.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ async def main() -> None:
1212
# And then you add one or more requests to it.
1313
await rq.add_request('https://crawlee.dev')
1414

15-
crawler = BeautifulSoupCrawler(request_provider=rq)
15+
crawler = BeautifulSoupCrawler(request_manager=rq)
1616

1717
# Define a request handler and attach it to the crawler using the decorator.
1818
@crawler.router.default_handler

docs/upgrading/upgrading_to_v0x.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,17 @@ This section summarizes the breaking changes between v0.4.x and v0.5.0.
2626

2727
- Removed properties `json_` and `order_no`.
2828

29+
### Request storages and loaders
30+
31+
- The `request_provider` parameter of `BasicCrawler.__init__` has been renamed to `request_manager`
32+
- The `BasicCrawler.get_request_provider` method has been renamed to `BasicCrawler.get_request_manager` and it does not accept the `id` and `name` arguments anymore
33+
- If using a specific request queue is desired, pass it as the `request_manager` on `BasicCrawler` creation
34+
- The `RequestProvider` interface has been renamed to `RequestManager` and moved to the `crawlee.request_loaders` package
35+
- `RequestList` has been moved to the `crawlee.request_loaders` package
36+
- `RequestList` does not support `.drop()`, `.reclaim_request()`, `.add_request()` and `add_requests_batched()` anymore
37+
- It implements the new `RequestLoader` interface instead of `RequestManager`
38+
- `RequestManagerTandem` with a `RequestQueue` should be used to enable passing a `RequestList` (or any other `RequestLoader` implementation) as a `request_manager`, `await list.to_tandem()` can be used as a shortcut
39+
2940
### PlaywrightCrawler
3041

3142
- The `PlaywrightPreNavigationContext` was renamed to `PlaywrightPreNavCrawlingContext`.

0 commit comments

Comments
 (0)