Skip to content

Commit 4cb4646

Browse files
committed
feat: update docs according to n8n WCC changes
1 parent 7a9ce7e commit 4cb4646

File tree

5 files changed

+32
-49
lines changed

5 files changed

+32
-49
lines changed
33.4 KB
Loading
107 KB
Loading
19.3 KB
Loading
19.3 KB
Loading

sources/platform/integrations/workflows-and-notifications/n8n/ai-crawling.md renamed to sources/platform/integrations/workflows-and-notifications/n8n/website-content-crawler.md

Lines changed: 32 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
---
2-
title: n8n - AI crawling Actor integration
3-
description: Learn about AI Crawling scraper modules.
4-
sidebar_label: AI Crawling
2+
title: n8n - Website Content Crawler Actor integration
3+
description: Learn about Website Content Crawler module.
4+
sidebar_label: Website Content Crawler
55
sidebar_position: 6
6-
slug: /integrations/n8n/ai-crawling
6+
slug: /integrations/n8n/website-content-crawler
77
toc_max_heading_level: 4
88
---
99

10-
## Apify Scraper for AI Crawling
10+
## Website Content Crawler By Apify
1111

12-
Apify Scraper for AI Crawling from [Apify](https://apify.com/apify/website-content-crawler) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks.
12+
Website Content Crawler from [Apify](https://apify.com/apify/website-content-crawler) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks.
1313

1414
To use these modules, you need an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate content extraction at scale and incorporate the results into your AI workflows.
1515

@@ -33,21 +33,23 @@ If you're running a self-hosted n8n instance, you can install the Apify communit
3333

3434
![Apify Install Node](images/install.png)
3535

36-
## Install the Apify Scraper for AI Crawling Node (n8n Cloud)
36+
## Install the Website Content Crawler by Apify Node (n8n Cloud)
3737

3838
For n8n Cloud users, installation is even simpler and doesn't require manual package entry. Just search and add the node from the canvas.
3939

4040
1. Go to the **Canvas** and open the **nodes panel**
41-
1. Search for **Apify Scraper for AI Crawling** in the community node registry
41+
1. Search for **Website Content Crawler by Apify** in the community node registry
4242
1. Click **Install node** to add the Apify node to your instance
4343

44+
![Website Content Crawler by Apify on n8n](images/operations.png)
45+
4446
:::note Verified community nodes visibility
4547

46-
On n8n Cloud, instance owners can toggle visibility of verified community nodes in the Cloud Admin Panel. Ensure this setting is enabled to install the Apify Scraper for AI Crawling node.
48+
On n8n Cloud, instance owners can toggle visibility of verified community nodes in the Cloud Admin Panel. Ensure this setting is enabled to install the Website Content Crawler by Apify node.
4749

4850
:::
4951

50-
## Connect Apify Scraper for AI Crawling (self-hosted)
52+
## Connect Website Content Crawler by Apify (self-hosted)
5153

5254
1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account.
5355

@@ -74,7 +76,7 @@ On n8n Cloud, instance owners can toggle visibility of verified community nodes
7476
1. Select **Connect my account** and authorize with your Apify account.
7577
1. n8n automatically retrieves and stores the OAuth2 tokens.
7678

77-
![Apify Auth](../../images/n8n-oauth.png)
79+
![Apify Auth](images/credentials.png)
7880

7981
:::note
8082

@@ -84,19 +86,24 @@ For simplicity on n8n Cloud, use the API key method if you prefer manual control
8486

8587
With authentication set up, you can now create workflows that incorporate the Apify node.
8688

87-
## Apify Scraper for AI Crawling modules
89+
## Website Content Crawler by Apify module
8890

89-
After connecting the app, you can use one of the two modules as native scrapers to extract website content.
91+
This module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
9092

91-
### Standard Settings module
93+
#### Key features
9294

93-
The Standard Settings module lets you quickly extract content from websites using optimized default settings. This module is ideal for extracting content from blogs, documentation, and knowledge bases to feed into AI models.
95+
- _Multiple Crawler Options_: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio)
96+
- _Custom Content Selection_: Specify exactly which elements to keep or remove
97+
- _Advanced Navigation Control_: Set crawling depth, scope, and URL patterns
98+
- _Dynamic Content Handling_: Wait for JavaScript-rendered content to load
99+
- _Interactive Element Support_: Click expandable sections to reveal hidden content
100+
- _Multiple Output Formats_: Save content as Markdown, HTML, or plain text
101+
- _Proxy Configuration_: Use proxies to handle geo-restrictions or avoid IP blocks
102+
- _Content Transformation Options_: Multiple algorithms for optimal content extraction
94103

95104
#### How it works
96105

97-
The crawler starts with one or more URLs. It then crawls these initial URLs and discovers links to other pages on the same site, which it adds to a queue. The crawler will recursively follow these links as long as they are under the same path as the start URL. You can customize this behavior by defining specific URL patterns for inclusion or exclusion. To ensure efficiency, the crawler automatically skips any duplicate pages it encounters. A variety of settings are available to fine-tune the crawling process, including the crawler type, the maximum number of pages to crawl, the crawl depth, and concurrency.
98-
99-
Once a page is loaded, the Actor processes its HTML to extract high-quality content. It can be configured to wait for dynamic content to load and can scroll the page to trigger the loading of additional content. To access information hidden in interactive sections, the crawler can be set up to expand clickable elements. It also cleans the HTML by removing irrelevant DOM nodes, such as navigation bars, headers, and footers, and can be configured to keep only the content that matches specific CSS selectors. The crawler also handles cookie warnings automatically and transforms the page to extract the main content.
106+
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility.
100107

101108
#### Output data
102109

@@ -107,6 +114,10 @@ For each crawled web page, you'll receive:
107114
- _Markdown formatting_: Structured content with headers, lists, links, and other formatting preserved
108115
- _Crawl information_: Loaded URL, referrer URL, timestamp, HTTP status
109116
- _Optional file downloads_: PDFs, DOCs, and other linked documents
117+
- _Multiple format options_: Content in Markdown, HTML, or plain text
118+
- _Debug information_: Detailed extraction diagnostics and snapshots
119+
- _HTML transformations_: Results from different content extraction algorithms
120+
- _File storage options_: Flexible storage for HTML, screenshots, or downloaded files
110121

111122
```json title="Sample output (shortened)"
112123
{
@@ -129,39 +140,11 @@ For each crawled web page, you'll receive:
129140
}
130141
```
131142

132-
### Advanced Settings module
133-
134-
The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
135-
136-
#### Key features
137-
138-
- _Multiple Crawler Options_: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio)
139-
- _Custom Content Selection_: Specify exactly which elements to keep or remove
140-
- _Advanced Navigation Control_: Set crawling depth, scope, and URL patterns
141-
- _Dynamic Content Handling_: Wait for JavaScript-rendered content to load
142-
- _Interactive Element Support_: Click expandable sections to reveal hidden content
143-
- _Multiple Output Formats_: Save content as Markdown, HTML, or plain text
144-
- _Proxy Configuration_: Use proxies to handle geo-restrictions or avoid IP blocks
145-
- _Content Transformation Options_: Multiple algorithms for optimal content extraction
146-
147-
#### How it works
148-
149-
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility.
143+
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
150144

151145
#### Configuration options
152146

153-
Advanced Settings offers a wide range of configuration options. You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage.
154-
155-
#### Output data
156-
157-
In addition to the standard output fields, this module provides:
158-
159-
- _Multiple format options_: Content in Markdown, HTML, or plain text
160-
- _Debug information_: Detailed extraction diagnostics and snapshots
161-
- _HTML transformations_: Results from different content extraction algorithms
162-
- _File storage options_: Flexible storage for HTML, screenshots, or downloaded files
163-
164-
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
147+
You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage.
165148

166149
## Usage as an AI Agent Tool
167150

@@ -173,7 +156,7 @@ You can setup Apify's Scraper for AI Crawling node as a tool for your AI Agents.
173156

174157
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically. This allows the Agent to decide on which pages to scrape off the internet.
175158

176-
We recommend using the Advanced Settings module with your AI Agent. Two key parameters to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits.
159+
Two key parameters to configure for optimized AI Agent usage are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits.
177160

178161
![Config Apify](./images/config.png)
179162

0 commit comments

Comments
 (0)