You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/platform/integrations/workflows-and-notifications/n8n/website-content-crawler.md
+32-49Lines changed: 32 additions & 49 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,15 @@
1
1
---
2
-
title: n8n - AI crawling Actor integration
3
-
description: Learn about AI Crawling scraper modules.
4
-
sidebar_label: AI Crawling
2
+
title: n8n - Website Content Crawler Actor integration
3
+
description: Learn about Website Content Crawler module.
4
+
sidebar_label: Website Content Crawler
5
5
sidebar_position: 6
6
-
slug: /integrations/n8n/ai-crawling
6
+
slug: /integrations/n8n/website-content-crawler
7
7
toc_max_heading_level: 4
8
8
---
9
9
10
-
## Apify Scraper for AI Crawling
10
+
## Website Content Crawler By Apify
11
11
12
-
Apify Scraper for AI Crawling from [Apify](https://apify.com/apify/website-content-crawler) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks.
12
+
Website Content Crawler from [Apify](https://apify.com/apify/website-content-crawler) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks.
13
13
14
14
To use these modules, you need an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate content extraction at scale and incorporate the results into your AI workflows.
15
15
@@ -33,21 +33,23 @@ If you're running a self-hosted n8n instance, you can install the Apify communit
33
33
34
34

35
35
36
-
## Install the Apify Scraper for AI Crawling Node (n8n Cloud)
36
+
## Install the Website Content Crawler by Apify Node (n8n Cloud)
37
37
38
38
For n8n Cloud users, installation is even simpler and doesn't require manual package entry. Just search and add the node from the canvas.
39
39
40
40
1. Go to the **Canvas** and open the **nodes panel**
41
-
1. Search for **Apify Scraper for AI Crawling** in the community node registry
41
+
1. Search for **Website Content Crawler by Apify** in the community node registry
42
42
1. Click **Install node** to add the Apify node to your instance
43
43
44
+

45
+
44
46
:::note Verified community nodes visibility
45
47
46
-
On n8n Cloud, instance owners can toggle visibility of verified community nodes in the Cloud Admin Panel. Ensure this setting is enabled to install the Apify Scraper for AI Crawling node.
48
+
On n8n Cloud, instance owners can toggle visibility of verified community nodes in the Cloud Admin Panel. Ensure this setting is enabled to install the Website Content Crawler by Apify node.
47
49
48
50
:::
49
51
50
-
## Connect Apify Scraper for AI Crawling (self-hosted)
52
+
## Connect Website Content Crawler by Apify (self-hosted)
51
53
52
54
1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account.
53
55
@@ -74,7 +76,7 @@ On n8n Cloud, instance owners can toggle visibility of verified community nodes
74
76
1. Select **Connect my account** and authorize with your Apify account.
75
77
1. n8n automatically retrieves and stores the OAuth2 tokens.
76
78
77
-

79
+

78
80
79
81
:::note
80
82
@@ -84,19 +86,24 @@ For simplicity on n8n Cloud, use the API key method if you prefer manual control
84
86
85
87
With authentication set up, you can now create workflows that incorporate the Apify node.
86
88
87
-
## Apify Scraper for AI Crawling modules
89
+
## Website Content Crawler by Apify module
88
90
89
-
After connecting the app, you can use one of the two modules as native scrapers to extract website content.
91
+
This module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
90
92
91
-
###Standard Settings module
93
+
#### Key features
92
94
93
-
The Standard Settings module lets you quickly extract content from websites using optimized default settings. This module is ideal for extracting content from blogs, documentation, and knowledge bases to feed into AI models.
95
+
-_Multiple Crawler Options_: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio)
96
+
-_Custom Content Selection_: Specify exactly which elements to keep or remove
97
+
-_Advanced Navigation Control_: Set crawling depth, scope, and URL patterns
98
+
-_Dynamic Content Handling_: Wait for JavaScript-rendered content to load
99
+
-_Interactive Element Support_: Click expandable sections to reveal hidden content
100
+
-_Multiple Output Formats_: Save content as Markdown, HTML, or plain text
101
+
-_Proxy Configuration_: Use proxies to handle geo-restrictions or avoid IP blocks
102
+
-_Content Transformation Options_: Multiple algorithms for optimal content extraction
94
103
95
104
#### How it works
96
105
97
-
The crawler starts with one or more URLs. It then crawls these initial URLs and discovers links to other pages on the same site, which it adds to a queue. The crawler will recursively follow these links as long as they are under the same path as the start URL. You can customize this behavior by defining specific URL patterns for inclusion or exclusion. To ensure efficiency, the crawler automatically skips any duplicate pages it encounters. A variety of settings are available to fine-tune the crawling process, including the crawler type, the maximum number of pages to crawl, the crawl depth, and concurrency.
98
-
99
-
Once a page is loaded, the Actor processes its HTML to extract high-quality content. It can be configured to wait for dynamic content to load and can scroll the page to trigger the loading of additional content. To access information hidden in interactive sections, the crawler can be set up to expand clickable elements. It also cleans the HTML by removing irrelevant DOM nodes, such as navigation bars, headers, and footers, and can be configured to keep only the content that matches specific CSS selectors. The crawler also handles cookie warnings automatically and transforms the page to extract the main content.
106
+
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility.
100
107
101
108
#### Output data
102
109
@@ -107,6 +114,10 @@ For each crawled web page, you'll receive:
107
114
-_Markdown formatting_: Structured content with headers, lists, links, and other formatting preserved
108
115
-_Crawl information_: Loaded URL, referrer URL, timestamp, HTTP status
109
116
-_Optional file downloads_: PDFs, DOCs, and other linked documents
117
+
-_Multiple format options_: Content in Markdown, HTML, or plain text
118
+
-_Debug information_: Detailed extraction diagnostics and snapshots
119
+
-_HTML transformations_: Results from different content extraction algorithms
120
+
-_File storage options_: Flexible storage for HTML, screenshots, or downloaded files
110
121
111
122
```json title="Sample output (shortened)"
112
123
{
@@ -129,39 +140,11 @@ For each crawled web page, you'll receive:
129
140
}
130
141
```
131
142
132
-
### Advanced Settings module
133
-
134
-
The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
135
-
136
-
#### Key features
137
-
138
-
-_Multiple Crawler Options_: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio)
139
-
-_Custom Content Selection_: Specify exactly which elements to keep or remove
140
-
-_Advanced Navigation Control_: Set crawling depth, scope, and URL patterns
141
-
-_Dynamic Content Handling_: Wait for JavaScript-rendered content to load
142
-
-_Interactive Element Support_: Click expandable sections to reveal hidden content
143
-
-_Multiple Output Formats_: Save content as Markdown, HTML, or plain text
144
-
-_Proxy Configuration_: Use proxies to handle geo-restrictions or avoid IP blocks
145
-
-_Content Transformation Options_: Multiple algorithms for optimal content extraction
146
-
147
-
#### How it works
148
-
149
-
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility.
143
+
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
150
144
151
145
#### Configuration options
152
146
153
-
Advanced Settings offers a wide range of configuration options. You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage.
154
-
155
-
#### Output data
156
-
157
-
In addition to the standard output fields, this module provides:
158
-
159
-
-_Multiple format options_: Content in Markdown, HTML, or plain text
160
-
-_Debug information_: Detailed extraction diagnostics and snapshots
161
-
-_HTML transformations_: Results from different content extraction algorithms
162
-
-_File storage options_: Flexible storage for HTML, screenshots, or downloaded files
163
-
164
-
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
147
+
You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage.
165
148
166
149
## Usage as an AI Agent Tool
167
150
@@ -173,7 +156,7 @@ You can setup Apify's Scraper for AI Crawling node as a tool for your AI Agents.
173
156
174
157
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically. This allows the Agent to decide on which pages to scrape off the internet.
175
158
176
-
We recommend using the Advanced Settings module with your AI Agent. Two key parameters to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits.
159
+
Two key parameters to configure for optimized AI Agent usage are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits.
0 commit comments