Skip to content

Commit 04649bd

Browse files
Pijukatelvdusek
andauthored
refactor!: Refactor storage creation and caching, configuration and services (#1386)
### Description This is a collection of closely related changes that are hard to separate from one another. The main purpose is to enable flexible storage use across the code base without unexpected limitations and limit unexpected side effects in global services. #### Top-level changes: - There can be multiple crawlers with different storage clients, configurations, or event managers. (Previously, this would cause `ServiceConflictError`) - `StorageInstanceManager` allows for similar but different storage instances to be used at the same time(Previously, similar storage instances could be incorrectly retrieved instead of creating a new storage instance). - Differently configured storages can be used at the same time, even the storages that are using the same `StorageClient` and are different only by using different `Configuration`. - `Crawler` can no longer cause side effects in the global service_locator (apart from adding new instances to `StorageInstanceManager`). - Global `service_locator` can be used at the same time as local instances of `ServiceLocator` (for example, each Crawler has its own `ServiceLocator` instance, which does not interfere with the global service_locator.) - Services in `ServiceLocator` can be set only once. Any attempt to reset them will throw an Error. Not setting the services and using them is possible. That will set services in `ServiceLocator` to some implicit default, and it will log warnings as implicit services can lead to hard-to-predict code. The preferred way is to set services explicitly. Either manually or through some helper code, for example, through `Actor`. [See related PR](apify/apify-sdk-python#576) #### Implementation notes: - Storage caching now supports all relevant ways to distinguish storage instances. Apart from generic parameters like `name`, `id`, `storage_type`, `storage_client_type`, there is also an `additional_cache_key`. This can be used by the `StorageClient` to define a unique way to distinguish between two similar but different instances. For example, `FileSystemStorageClient` depends on `Configuration.storage_dir`, which is included in the custom cache key for `FileSystemStorageClient`, but this is not true for `MemoryStorageClient` as the `storage_dir` is not relevant for it, see example: (This `additional_cache_key` could possibly be used for caching of NDU in #1401) ```python storage_client = FileSystemStorageClient() d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2")) d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) assert d2 is not d1 assert d3 is d1 storage_client_2 =MemoryStorageClient() d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1")) d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2")) assert d4 is d5 ``` - Each crawler will create its own instance of `ServiceLocator`. It will either use explicitly passed services(configuration, storage client, event_manager) to crawler init or services from the global `service_locator` as implicit defaults. This allows multiple differently configured crawlers to work in the same code. For example: ```python custom_configuration_1 = Configuration() custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1) custom_storage_client_1 = MemoryStorageClient() custom_configuration_2 = Configuration() custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2) custom_storage_client_2 = MemoryStorageClient() crawler_1 = BasicCrawler( configuration=custom_configuration_1, event_manager=custom_event_manager_1, storage_client=custom_storage_client_1, ) crawler_2 = BasicCrawler( configuration=custom_configuration_2, event_manager=custom_event_manager_2, storage_client=custom_storage_client_2, ) # use crawlers without runtime crash... ``` - `ServiceLocator` is now way more strict when it comes to setting the services. Previously, it allowed changing services until some service had `_was_retrieved` flag set to `True`. Then it would throw a runtime error. This led to hard-to-predict code as the global `service_locator` could be changed as a side effect from many places. Now the services in `ServiceLocator` can be set only once, and the side effects of attempting to change the services are limited as much as possible. Such side effects are also accompanied by warning messages to draw attention to code that could cause RuntimeError. ### Issues Closes: #1379 Connected to: - #1354 (through necessary changes in `StorageInstanceManagaer`) - apify/apify-sdk-python#513 (through necessary changes in `StorageInstanceManagaer` and storage clients/configuration related changes in `service_locator`) ### Testing - New unit tests were added. - Tested on the `Apify` platform together with SDK changes in [related PR](apify/apify-sdk-python#576) --------- Co-authored-by: Vlada Dusek <[email protected]>
1 parent 2217894 commit 04649bd

29 files changed

+767
-457
lines changed

docs/guides/code_examples/service_locator/service_storage_configuration.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
import asyncio
22
from datetime import timedelta
33

4+
from crawlee import service_locator
45
from crawlee.configuration import Configuration
6+
from crawlee.storage_clients import MemoryStorageClient
57
from crawlee.storages import Dataset
68

79

@@ -11,10 +13,16 @@ async def main() -> None:
1113
headless=False,
1214
persist_state_interval=timedelta(seconds=30),
1315
)
16+
# Set the custom configuration as the global default configuration.
17+
service_locator.set_configuration(configuration)
1418

15-
# Pass the configuration to the dataset (or other storage) when opening it.
16-
dataset = await Dataset.open(
17-
configuration=configuration,
19+
# Use the global defaults when creating the dataset (or other storage).
20+
dataset_1 = await Dataset.open()
21+
22+
# Or set explicitly specific configuration if
23+
# you do not want to rely on global defaults.
24+
dataset_2 = await Dataset.open(
25+
storage_client=MemoryStorageClient(), configuration=configuration
1826
)
1927

2028

docs/upgrading/upgrading_to_v1.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,97 @@ The interface for custom storage clients has been simplified:
189189
- Collection storage clients have been removed.
190190
- The number of methods that have to be implemented have been reduced.
191191

192+
## ServiceLocator changes
193+
194+
### ServiceLocator is stricter with registering services
195+
You can register the services just once, and you can no longer override already registered services.
196+
197+
**Before (v0.6):**
198+
```python
199+
from crawlee import service_locator
200+
from crawlee.storage_clients import MemoryStorageClient
201+
202+
service_locator.set_storage_client(MemoryStorageClient())
203+
service_locator.set_storage_client(MemoryStorageClient())
204+
```
205+
**Now (v1.0):**
206+
207+
```python
208+
from crawlee import service_locator
209+
from crawlee.storage_clients import MemoryStorageClient
210+
211+
service_locator.set_storage_client(MemoryStorageClient())
212+
service_locator.set_storage_client(MemoryStorageClient()) # Raises an error
213+
```
214+
215+
### BasicCrawler has its own instance of ServiceLocator to track its own services
216+
Explicitly passed services to the crawler can be different the global ones accessible in `crawlee.service_locator`. `BasicCrawler` no longer causes the global services in `service_locator` to be set to the crawler's explicitly passed services.
217+
218+
**Before (v0.6):**
219+
```python
220+
from crawlee import service_locator
221+
from crawlee.crawlers import BasicCrawler
222+
from crawlee.storage_clients import MemoryStorageClient
223+
from crawlee.storages import Dataset
224+
225+
226+
async def main() -> None:
227+
custom_storage_client = MemoryStorageClient()
228+
crawler = BasicCrawler(storage_client=custom_storage_client)
229+
230+
assert service_locator.get_storage_client() is custom_storage_client
231+
assert await crawler.get_dataset() is await Dataset.open()
232+
```
233+
**Now (v1.0):**
234+
235+
```python
236+
from crawlee import service_locator
237+
from crawlee.crawlers import BasicCrawler
238+
from crawlee.storage_clients import MemoryStorageClient
239+
from crawlee.storages import Dataset
240+
241+
242+
async def main() -> None:
243+
custom_storage_client = MemoryStorageClient()
244+
crawler = BasicCrawler(storage_client=custom_storage_client)
245+
246+
assert service_locator.get_storage_client() is not custom_storage_client
247+
assert await crawler.get_dataset() is not await Dataset.open()
248+
```
249+
250+
This allows two crawlers with different services at the same time.
251+
252+
**Now (v1.0):**
253+
254+
```python
255+
from crawlee.crawlers import BasicCrawler
256+
from crawlee.storage_clients import MemoryStorageClient, FileSystemStorageClient
257+
from crawlee.configuration import Configuration
258+
from crawlee.events import LocalEventManager
259+
260+
custom_configuration_1 = Configuration()
261+
custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1)
262+
custom_storage_client_1 = MemoryStorageClient()
263+
264+
custom_configuration_2 = Configuration()
265+
custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2)
266+
custom_storage_client_2 = FileSystemStorageClient()
267+
268+
crawler_1 = BasicCrawler(
269+
configuration=custom_configuration_1,
270+
event_manager=custom_event_manager_1,
271+
storage_client=custom_storage_client_1,
272+
)
273+
274+
crawler_2 = BasicCrawler(
275+
configuration=custom_configuration_2,
276+
event_manager=custom_event_manager_2,
277+
storage_client=custom_storage_client_2,
278+
)
279+
280+
# use crawlers without runtime crash...
281+
```
282+
192283
## Other smaller updates
193284

194285
There are more smaller updates.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ dependencies = [
4040
"protego>=0.5.0",
4141
"psutil>=6.0.0",
4242
"pydantic-settings>=2.2.0,!=2.7.0,!=2.7.1,!=2.8.0",
43-
"pydantic>=2.8.0,!=2.10.0,!=2.10.1,!=2.10.2",
43+
"pydantic>=2.11.0",
4444
"pyee>=9.0.0",
4545
"tldextract>=5.1.0",
4646
"typing-extensions>=4.1.0",

src/crawlee/_autoscaling/snapshotter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def from_config(cls, config: Configuration | None = None) -> Snapshotter:
113113
Args:
114114
config: The `Configuration` instance. Uses the global (default) one if not provided.
115115
"""
116-
config = service_locator.get_configuration()
116+
config = config or service_locator.get_configuration()
117117

118118
# Compute the maximum memory size based on the provided configuration. If `memory_mbytes` is provided,
119119
# it uses that value. Otherwise, it calculates the `max_memory_size` as a proportion of the system's

src/crawlee/_service_locator.py

Lines changed: 44 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@
1111
if TYPE_CHECKING:
1212
from crawlee.storages._storage_instance_manager import StorageInstanceManager
1313

14+
from logging import getLogger
15+
16+
logger = getLogger(__name__)
17+
1418

1519
@docs_group('Configuration')
1620
class ServiceLocator:
@@ -19,23 +23,24 @@ class ServiceLocator:
1923
All services are initialized to its default value lazily.
2024
"""
2125

22-
def __init__(self) -> None:
23-
self._configuration: Configuration | None = None
24-
self._event_manager: EventManager | None = None
25-
self._storage_client: StorageClient | None = None
26-
self._storage_instance_manager: StorageInstanceManager | None = None
26+
global_storage_instance_manager: StorageInstanceManager | None = None
2727

28-
# Flags to check if the services were already set.
29-
self._configuration_was_retrieved = False
30-
self._event_manager_was_retrieved = False
31-
self._storage_client_was_retrieved = False
28+
def __init__(
29+
self,
30+
configuration: Configuration | None = None,
31+
event_manager: EventManager | None = None,
32+
storage_client: StorageClient | None = None,
33+
) -> None:
34+
self._configuration = configuration
35+
self._event_manager = event_manager
36+
self._storage_client = storage_client
3237

3338
def get_configuration(self) -> Configuration:
3439
"""Get the configuration."""
3540
if self._configuration is None:
41+
logger.warning('No configuration set, implicitly creating and using default Configuration.')
3642
self._configuration = Configuration()
3743

38-
self._configuration_was_retrieved = True
3944
return self._configuration
4045

4146
def set_configuration(self, configuration: Configuration) -> None:
@@ -47,21 +52,25 @@ def set_configuration(self, configuration: Configuration) -> None:
4752
Raises:
4853
ServiceConflictError: If the configuration has already been retrieved before.
4954
"""
50-
if self._configuration_was_retrieved:
55+
if self._configuration is configuration:
56+
# Same instance, no need to anything
57+
return
58+
if self._configuration:
5159
raise ServiceConflictError(Configuration, configuration, self._configuration)
5260

5361
self._configuration = configuration
5462

5563
def get_event_manager(self) -> EventManager:
5664
"""Get the event manager."""
5765
if self._event_manager is None:
58-
self._event_manager = (
59-
LocalEventManager().from_config(config=self._configuration)
60-
if self._configuration
61-
else LocalEventManager.from_config()
62-
)
66+
logger.warning('No event manager set, implicitly creating and using default LocalEventManager.')
67+
if self._configuration is None:
68+
logger.warning(
69+
'Implicit creation of event manager will implicitly set configuration as side effect. '
70+
'It is advised to explicitly first set the configuration instead.'
71+
)
72+
self._event_manager = LocalEventManager().from_config(config=self._configuration)
6373

64-
self._event_manager_was_retrieved = True
6574
return self._event_manager
6675

6776
def set_event_manager(self, event_manager: EventManager) -> None:
@@ -73,17 +82,25 @@ def set_event_manager(self, event_manager: EventManager) -> None:
7382
Raises:
7483
ServiceConflictError: If the event manager has already been retrieved before.
7584
"""
76-
if self._event_manager_was_retrieved:
85+
if self._event_manager is event_manager:
86+
# Same instance, no need to anything
87+
return
88+
if self._event_manager:
7789
raise ServiceConflictError(EventManager, event_manager, self._event_manager)
7890

7991
self._event_manager = event_manager
8092

8193
def get_storage_client(self) -> StorageClient:
8294
"""Get the storage client."""
8395
if self._storage_client is None:
96+
logger.warning('No storage client set, implicitly creating and using default FileSystemStorageClient.')
97+
if self._configuration is None:
98+
logger.warning(
99+
'Implicit creation of storage client will implicitly set configuration as side effect. '
100+
'It is advised to explicitly first set the configuration instead.'
101+
)
84102
self._storage_client = FileSystemStorageClient()
85103

86-
self._storage_client_was_retrieved = True
87104
return self._storage_client
88105

89106
def set_storage_client(self, storage_client: StorageClient) -> None:
@@ -95,21 +112,24 @@ def set_storage_client(self, storage_client: StorageClient) -> None:
95112
Raises:
96113
ServiceConflictError: If the storage client has already been retrieved before.
97114
"""
98-
if self._storage_client_was_retrieved:
115+
if self._storage_client is storage_client:
116+
# Same instance, no need to anything
117+
return
118+
if self._storage_client:
99119
raise ServiceConflictError(StorageClient, storage_client, self._storage_client)
100120

101121
self._storage_client = storage_client
102122

103123
@property
104124
def storage_instance_manager(self) -> StorageInstanceManager:
105-
"""Get the storage instance manager."""
106-
if self._storage_instance_manager is None:
125+
"""Get the storage instance manager. It is global manager shared by all instances of ServiceLocator."""
126+
if ServiceLocator.global_storage_instance_manager is None:
107127
# Import here to avoid circular imports.
108128
from crawlee.storages._storage_instance_manager import StorageInstanceManager # noqa: PLC0415
109129

110-
self._storage_instance_manager = StorageInstanceManager()
130+
ServiceLocator.global_storage_instance_manager = StorageInstanceManager()
111131

112-
return self._storage_instance_manager
132+
return ServiceLocator.global_storage_instance_manager
113133

114134

115135
service_locator = ServiceLocator()

src/crawlee/configuration.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ class Configuration(BaseSettings):
2828
Settings can also be configured via environment variables, prefixed with `CRAWLEE_`.
2929
"""
3030

31-
model_config = SettingsConfigDict(populate_by_name=True)
31+
model_config = SettingsConfigDict(validate_by_name=True, validate_by_alias=True)
3232

3333
internal_timeout: Annotated[timedelta | None, Field(alias='crawlee_internal_timeout')] = None
3434
"""Timeout for the internal asynchronous operations."""

0 commit comments

Comments
 (0)