You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat!: Implement RequestManagerTandem, remove add_request from RequestList, accept any iterable in RequestList constructor (#777)
> Tandem, or in tandem, is an arrangement in which two or more animals,
machines, or people are lined up one behind another, all facing in the
same
direction.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1)
Tandem can also be used more generally to refer to any group of persons
or objects working together, not necessarily in
line.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1)
(https://en.wikipedia.org/wiki/Tandem)
- Inspired by
https://github.com/apify/crawlee/blob/4c95847d5cedd6514620ccab31d5b242ba76de80/packages/basic-crawler/src/internals/basic-crawler.ts#L1154-L1177
and related code in the same class
- In my opinion, it implements the feature more cleanly and without
polluting `BasicCrawler` (...any further)
- The motivation for the feature is twofold:
1. Apify Actor development - it is common that an Actor receives a
`requestListSources` input from the user, which may be pretty complex
(regexp-based extraction from remote URL lists), and which is usually
parsed using `apify.RequestList.open`. At the same time, the Actor wants
to use the built in `RequestQueue`.
2. Sitemap parsing (#248) - similar to 1, but not coupled to the Apify
platform - we want to read URLs from a sitemap in the background, but
the URLs should go through the standard request queue
## Breaking changes
- `RequestList` does not support `.drop()`, `.reclaim_request()`,
`.add_request()` and `add_requests_batched()` anymore
- `RequestManagerTandem` with a `RequestQueue` should be used for this
use case, `await list.to_tandem()` can be used as a shortcut
- The `RequestProvider` interface has been renamed to `RequestManager`
and moved to the `crawlee.request_loaders` package
- `RequestList` has been moved to the `crawlee.request_loaders` package
- The `BasicCrawler.get_request_provider` method has been renamed to
`BasicCrawler.get_request_manager` and it does not accept the `id` and
`name` arguments anymore
- The `request_provider` parameter of `BasicCrawler.__init__` has been
renamed to `request_manager`
## TODO
- [x] new tests
- [x] fix existing tests
---------
Co-authored-by: Vlada Dusek <[email protected]>
This guide explains the different types of request storage available in Crawlee, how to store the requests that your crawler will process, and which storage type to choose based on your needs.
23
25
24
26
## Introduction
25
27
26
-
All request storage types in Crawlee implement the same interface - <ApiLinkto="class/RequestProvider">`RequestProvider`</ApiLink>. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of <ApiLinkto="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. For instance, <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure:
28
+
All request storage types in Crawlee implement the same interface - <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink>. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of <ApiLinkto="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. For instance, <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure:
@@ -95,6 +97,28 @@ TODO: write this section, once https://github.com/apify/crawlee-python/issues/99
95
97
96
98
*/}
97
99
100
+
## Processing requests from multiple sources
101
+
102
+
In some cases, you might need to combine requests from multiple sources, most frequently from a static list of URLs (such as <ApiLinkto="class/RequestList">`RequestList`</ApiLink>) and a <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, where the queue takes care of persistence and retrying failed requests.
103
+
104
+
This use case is supported via the <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class. You may also use the <ApiLinkto="class/RequestLoader#to_tandem">`RequestLoader.to_tandem`</ApiLink> method as a shortcut.
Copy file name to clipboardExpand all lines: docs/upgrading/upgrading_to_v0x.md
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,17 @@ This section summarizes the breaking changes between v0.4.x and v0.5.0.
26
26
27
27
- Removed properties `json_` and `order_no`.
28
28
29
+
### Request storages and loaders
30
+
31
+
- The `request_provider` parameter of `BasicCrawler.__init__` has been renamed to `request_manager`
32
+
- The `BasicCrawler.get_request_provider` method has been renamed to `BasicCrawler.get_request_manager` and it does not accept the `id` and `name` arguments anymore
33
+
- If using a specific request queue is desired, pass it as the `request_manager` on `BasicCrawler` creation
34
+
- The `RequestProvider` interface has been renamed to `RequestManager` and moved to the `crawlee.request_loaders` package
35
+
-`RequestList` has been moved to the `crawlee.request_loaders` package
36
+
-`RequestList` does not support `.drop()`, `.reclaim_request()`, `.add_request()` and `add_requests_batched()` anymore
37
+
- It implements the new `RequestLoader` interface instead of `RequestManager`
38
+
-`RequestManagerTandem` with a `RequestQueue` should be used to enable passing a `RequestList` (or any other `RequestLoader` implementation) as a `request_manager`, `await list.to_tandem()` can be used as a shortcut
39
+
29
40
### PlaywrightCrawler
30
41
31
42
- The `PlaywrightPreNavigationContext` was renamed to `PlaywrightPreNavCrawlingContext`.
0 commit comments