You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor!: Refactor storage creation and caching, configuration and services (#1386)
### Description
This is a collection of closely related changes that are hard to
separate from one another. The main purpose is to enable flexible
storage use across the code base without unexpected limitations and
limit unexpected side effects in global services.
#### Top-level changes:
- There can be multiple crawlers with different storage clients,
configurations, or event managers. (Previously, this would cause
`ServiceConflictError`)
- `StorageInstanceManager` allows for similar but different storage
instances to be used at the same time(Previously, similar storage
instances could be incorrectly retrieved instead of creating a new
storage instance).
- Differently configured storages can be used at the same time, even the
storages that are using the same `StorageClient` and are different only
by using different `Configuration`.
- `Crawler` can no longer cause side effects in the global
service_locator (apart from adding new instances to
`StorageInstanceManager`).
- Global `service_locator` can be used at the same time as local
instances of `ServiceLocator` (for example, each Crawler has its own
`ServiceLocator` instance, which does not interfere with the global
service_locator.)
- Services in `ServiceLocator` can be set only once. Any attempt to
reset them will throw an Error. Not setting the services and using them
is possible. That will set services in `ServiceLocator` to some implicit
default, and it will log warnings as implicit services can lead to
hard-to-predict code. The preferred way is to set services explicitly.
Either manually or through some helper code, for example, through
`Actor`. [See related
PR](apify/apify-sdk-python#576)
#### Implementation notes:
- Storage caching now supports all relevant ways to distinguish storage
instances. Apart from generic parameters like `name`, `id`,
`storage_type`, `storage_client_type`, there is also an
`additional_cache_key`. This can be used by the `StorageClient` to
define a unique way to distinguish between two similar but different
instances. For example, `FileSystemStorageClient` depends on
`Configuration.storage_dir`, which is included in the custom cache key
for `FileSystemStorageClient`, but this is not true for
`MemoryStorageClient` as the `storage_dir` is not relevant for it, see
example:
(This `additional_cache_key` could possibly be used for caching of NDU
in #1401)
```python
storage_client = FileSystemStorageClient()
d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1"))
d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2"))
d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1"))
assert d2 is not d1
assert d3 is d1
storage_client_2 =MemoryStorageClient()
d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1"))
d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2"))
assert d4 is d5
```
- Each crawler will create its own instance of `ServiceLocator`. It will
either use explicitly passed services(configuration, storage client,
event_manager) to crawler init or services from the global
`service_locator` as implicit defaults. This allows multiple differently
configured crawlers to work in the same code. For example:
```python
custom_configuration_1 = Configuration()
custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1)
custom_storage_client_1 = MemoryStorageClient()
custom_configuration_2 = Configuration()
custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2)
custom_storage_client_2 = MemoryStorageClient()
crawler_1 = BasicCrawler(
configuration=custom_configuration_1,
event_manager=custom_event_manager_1,
storage_client=custom_storage_client_1,
)
crawler_2 = BasicCrawler(
configuration=custom_configuration_2,
event_manager=custom_event_manager_2,
storage_client=custom_storage_client_2,
)
# use crawlers without runtime crash...
```
- `ServiceLocator` is now way more strict when it comes to setting the
services. Previously, it allowed changing services until some service
had `_was_retrieved` flag set to `True`. Then it would throw a runtime
error. This led to hard-to-predict code as the global `service_locator`
could be changed as a side effect from many places. Now the services in
`ServiceLocator` can be set only once, and the side effects of
attempting to change the services are limited as much as possible. Such
side effects are also accompanied by warning messages to draw attention
to code that could cause RuntimeError.
### Issues
Closes: #1379
Connected to:
- #1354 (through necessary changes in `StorageInstanceManagaer`)
- apify/apify-sdk-python#513 (through
necessary changes in `StorageInstanceManagaer` and storage
clients/configuration related changes in `service_locator`)
### Testing
- New unit tests were added.
- Tested on the `Apify` platform together with SDK changes in [related
PR](apify/apify-sdk-python#576)
---------
Co-authored-by: Vlada Dusek <[email protected]>
service_locator.set_storage_client(MemoryStorageClient()) # Raises an error
213
+
```
214
+
215
+
### BasicCrawler has its own instance of ServiceLocator to track its own services
216
+
Explicitly passed services to the crawler can be different the global ones accessible in `crawlee.service_locator`. `BasicCrawler` no longer causes the global services in `service_locator` to be set to the crawler's explicitly passed services.
217
+
218
+
**Before (v0.6):**
219
+
```python
220
+
from crawlee import service_locator
221
+
from crawlee.crawlers import BasicCrawler
222
+
from crawlee.storage_clients import MemoryStorageClient
0 commit comments