You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Storage clients provide a unified interface for interacting with <ApiLinkto="class/Dataset">`Dataset`</ApiLink>, <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>, and <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, regardless of the underlying implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. This abstraction makes it easy to switch between different environments, such as local development and cloud production setups.
19
22
@@ -23,6 +26,7 @@ Crawlee provides three main storage client implementations:
23
26
24
27
- <ApiLinkto="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> - Provides persistent file system storage with in-memory caching.
25
28
- <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> - Stores data in memory with no persistence.
29
+
- <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> – Provides persistent storage using a SQL database ([SQLite](https://sqlite.org/) or [PostgreSQL](https://www.postgresql.org/)). Requires installing the extra dependency: 'crawlee[sql_sqlite]' for SQLite or 'crawlee[sql_postgres]' for PostgreSQL.
26
30
-[`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient) - Manages storage on the [Apify platform](https://apify.com), implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
27
31
28
32
```mermaid
@@ -50,6 +54,8 @@ class FileSystemStorageClient
50
54
51
55
class MemoryStorageClient
52
56
57
+
class SqlStorageClient
58
+
53
59
class ApifyStorageClient
54
60
55
61
%% ========================
@@ -58,6 +64,7 @@ class ApifyStorageClient
58
64
59
65
StorageClient --|> FileSystemStorageClient
60
66
StorageClient --|> MemoryStorageClient
67
+
StorageClient --|> SqlStorageClient
61
68
StorageClient --|> ApifyStorageClient
62
69
```
63
70
@@ -125,6 +132,187 @@ The `MemoryStorageClient` does not persist data between runs. All data is lost w
125
132
{MemoryStorageClientBasicExample}
126
133
</RunnableCodeBlock>
127
134
135
+
### SQL storage client
136
+
137
+
:::warning Experimental feature
138
+
The `SqlStorageClient` is experimental. Its API and behavior may change in future releases.
139
+
:::
140
+
141
+
The <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> provides persistent storage using a SQL database (SQLite by default, or PostgreSQL). It supports all Crawlee storage types and enables concurrent access from multiple independent clients or processes.
142
+
143
+
:::note dependencies
144
+
The <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> is not included in the core Crawlee package.
145
+
To use it, you need to install Crawlee with the appropriate extra dependency:
146
+
147
+
- For SQLite support, run:
148
+
<code>pip install 'crawlee[sql_sqlite]'</code>
149
+
- For PostgreSQL support, run:
150
+
<code>pip install 'crawlee[sql_postgres]'</code>
151
+
:::
152
+
153
+
By default, <ApiLinkto="class/SqlStorageClient">SqlStorageClient</ApiLink> uses SQLite.
154
+
To use PostgreSQL instead, just provide a PostgreSQL connection string via the `connection_string` parameter. No other code changes are needed—the same client works for both databases.
Configuration options for the <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> can be set through environment variables or the <ApiLinkto="class/Configuration">`Configuration`</ApiLink> class:
301
+
302
+
-**`storage_dir`** (env: `CRAWLEE_STORAGE_DIR`, default: `'./storage'`) - The root directory where the default SQLite database will be created if no connection string is provided.
303
+
-**`purge_on_start`** (env: `CRAWLEE_PURGE_ON_START`, default: `True`) - Whether to purge default storages on start.
304
+
305
+
Configuration options for the <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> can be set via constructor arguments:
306
+
307
+
-**`connection_string`** (default: SQLite in <ApiLinkto="class/Configuration">`Configuration`</ApiLink> storage dir) – SQLAlchemy connection string, e.g. `sqlite+aiosqlite:///my.db` or `postgresql+asyncpg://user:pass@host/db`.
For advanced scenarios, you can configure <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> with a custom SQLAlchemy engine and additional options via the <ApiLinkto="class/Configuration">`Configuration`</ApiLink> class. This is useful, for example, when connecting to an external PostgreSQL database or customizing connection pooling.
A storage client consists of two parts: the storage client factory and individual storage type clients. The <ApiLinkto="class/StorageClient">`StorageClient`</ApiLink> acts as a factory that creates specific clients (<ApiLinkto="class/DatasetClient">`DatasetClient`</ApiLink>, <ApiLinkto="class/KeyValueStoreClient">`KeyValueStoreClient`</ApiLink>, <ApiLinkto="class/RequestQueueClient">`RequestQueueClient`</ApiLink>) where the actual storage logic is implemented.
0 commit comments