Automatisation de scraping web avec n8n : collecte de données en temps réel
Ce workflow n8n a pour objectif d'automatiser le processus de scraping de données web en utilisant les outils Bright Data et Google Gemini. Dans un contexte où la collecte d'informations précises et à jour est cruciale pour les entreprises, ce workflow permet de récupérer des données de manière efficace et structurée. Les cas d'usage incluent la surveillance de la concurrence, l'analyse de marché et l'enrichissement de bases de données. Étape 1 : le workflow est déclenché manuellement via un nœud de type 'Manual Trigger'. Étape 2 : il utilise un agent AI pour traiter les données et interagir avec les outils de scraping. Étape 3 : les nœuds 'MCP Client' sont utilisés pour interagir avec Bright Data, permettant de lister les outils disponibles et de configurer le scraper. Étape 4 : les données sont ensuite envoyées à un webhook pour être traitées et formatées en Markdown ou HTML. Étape 5 : enfin, les résultats sont écrits sur disque pour une utilisation ultérieure. Les bénéfices business incluent une réduction significative du temps consacré à la collecte de données, une amélioration de la précision des informations et une capacité accrue à réagir rapidement aux évolutions du marché. Tags clés : automatisation, scraping, Google Gemini.
Vue d'ensemble du workflow n8n
Schéma des nœuds et connexions de ce workflow n8n, généré à partir du JSON n8n.
Détail des nœuds du workflow n8n
Inscris-toi pour voir l'intégralité du workflow
Inscription gratuite
S'inscrire gratuitementBesoin d'aide ?{
"id": "U6cY7PPR0vaRl1I0",
"meta": {
"instanceId": "885b4fb4a6a9c2cb5621429a7b972df0d05bb724c20ac7dac7171b62f1c7ef40",
"templateCredsSetupCompleted": true
},
"name": "Scrape Web Data with Bright Data, Google Gemini and MCP Automated AI Agent",
"tags": [
{
"id": "ZOwtAMLepQaGW76t",
"name": "Building Blocks",
"createdAt": "2025-04-13T15:23:40.462Z",
"updatedAt": "2025-04-13T15:23:40.462Z"
},
{
"id": "ddPkw7Hg5dZhQu2w",
"name": "AI",
"createdAt": "2025-04-13T05:38:08.053Z",
"updatedAt": "2025-04-13T05:38:08.053Z"
}
],
"nodes": [
{
"id": "0c747f5b-ef72-4e00-a028-4a08461dae28",
"name": "AI Agent",
"type": "@n8n/n8n-nodes-langchain.agent",
"notes": "Bright Data Web Scraping Agent",
"position": [
-140,
60
],
"parameters": {
"text": "=Scrape the web data as per the provided URL: {{ $json.url }} using the format as {{ $json.format }}",
"options": {
"systemMessage": "=You are a helpful assistant."
},
"promptType": "define"
},
"notesInFlow": true,
"typeVersion": 1.8
},
{
"id": "16c7cd90-39da-47a4-8020-a0aa8f87275a",
"name": "When clicking ‘Test workflow’",
"type": "n8n-nodes-base.manualTrigger",
"position": [
-640,
-300
],
"parameters": {},
"typeVersion": 1
},
{
"id": "7b544505-6b6b-4500-a2ad-f7cf62f98c13",
"name": "MCP Client list all tools for Bright Data",
"type": "n8n-nodes-mcp.mcpClient",
"position": [
-380,
-300
],
"parameters": {},
"credentials": {
"mcpClientApi": {
"id": "JtatFSfA2kkwctYa",
"name": "MCP Client (STDIO) account"
}
},
"typeVersion": 1
},
{
"id": "9f5bf319-9414-4974-bad2-ea24f09ae351",
"name": "Sticky Note1",
"type": "n8n-nodes-base.stickyNote",
"position": [
-220,
-400
],
"parameters": {
"color": 3,
"width": 440,
"height": 320,
"content": "## Bright Data Web Scraper"
},
"typeVersion": 1
},
{
"id": "222e81cf-878e-42de-a325-6b6659145f98",
"name": "MCP Client List all tools",
"type": "n8n-nodes-mcp.mcpClientTool",
"position": [
440,
420
],
"parameters": {},
"credentials": {
"mcpClientApi": {
"id": "JtatFSfA2kkwctYa",
"name": "MCP Client (STDIO) account"
}
},
"typeVersion": 1
},
{
"id": "a10dfd4f-cadf-4952-8449-1865406358d4",
"name": "MCP Client Bright Data Web Scraper",
"type": "n8n-nodes-mcp.mcpClient",
"notes": "Scrape a single webpage URL with advanced options for content extraction and get back the results in MarkDown language.",
"position": [
60,
-300
],
"parameters": {
"toolName": "=scrape_as_markdown",
"operation": "executeTool",
"toolParameters": "={\n \"url\": \"{{ $json.url }}\"\n} "
},
"credentials": {
"mcpClientApi": {
"id": "JtatFSfA2kkwctYa",
"name": "MCP Client (STDIO) account"
}
},
"notesInFlow": true,
"typeVersion": 1
},
{
"id": "0acbd4ff-ce4a-4ff2-b213-2d80dd91e302",
"name": "Webhook for web scraper",
"type": "n8n-nodes-base.httpRequest",
"position": [
280,
-300
],
"parameters": {
"url": "=https://webhook.site/daf9d591-a130-4010-b1d3-0c66f8fcf467",
"options": {},
"sendBody": true,
"bodyParameters": {
"parameters": [
{
"name": "response",
"value": "={{ $json.result.content[0].text }}"
}
]
}
},
"typeVersion": 4.2
},
{
"id": "009ac29f-8cad-4b58-9ca4-e75470a52dcc",
"name": "Set the URLs",
"type": "n8n-nodes-base.set",
"position": [
-160,
-300
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "214e61a0-3587-453f-baf5-eac013990857",
"name": "url",
"type": "string",
"value": "https://about.google/"
},
{
"id": "45014942-0a2e-4f46-b395-f82f97bfa93e",
"name": "webhook_url",
"type": "string",
"value": "https://webhook.site/ce41e056-c097-48c8-a096-9b876d3abbf7"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "104706dd-ae58-47fd-8fea-cefa986ae40c",
"name": "MCP Client to Scrape as Markdown",
"type": "n8n-nodes-mcp.mcpClientTool",
"notes": "Scrape a single webpage URL with advanced options for content extraction and get back the results in MarkDown language.",
"position": [
-60,
400
],
"parameters": {
"toolName": "scrape_as_markdown",
"operation": "executeTool",
"toolParameters": "={\n \"url\": \"{{ $json.url }}\"\n} ",
"descriptionType": "manual",
"toolDescription": "Scrape a single webpage URL with advanced options for content extraction and get back the results in MarkDown language."
},
"credentials": {
"mcpClientApi": {
"id": "JtatFSfA2kkwctYa",
"name": "MCP Client (STDIO) account"
}
},
"notesInFlow": false,
"typeVersion": 1
},
{
"id": "c03c655e-45c8-4278-a01f-ba48282459c5",
"name": "MCP Client to Scrape as HTML",
"type": "n8n-nodes-mcp.mcpClientTool",
"notes": "Scrape a single webpage URL with advanced options for content extraction and get back the results in HTML.",
"position": [
200,
400
],
"parameters": {
"toolName": "scrape_as_html",
"operation": "executeTool",
"toolParameters": "{\n \"url\": \"{{ $json.url }}\"\n} ",
"descriptionType": "manual",
"toolDescription": "Scrape a single webpage URL with advanced options for content extraction and get back the results in HTML."
},
"credentials": {
"mcpClientApi": {
"id": "JtatFSfA2kkwctYa",
"name": "MCP Client (STDIO) account"
}
},
"notesInFlow": true,
"typeVersion": 1
},
{
"id": "587300ff-44e5-4ff2-9dee-a4f8720ca26b",
"name": "Google Gemini Chat Model for AI Agent",
"type": "@n8n/n8n-nodes-langchain.lmChatGoogleGemini",
"position": [
-520,
400
],
"parameters": {
"options": {},
"modelName": "models/gemini-2.0-flash-exp"
},
"credentials": {
"googlePalmApi": {
"id": "YeO7dHZnuGBVQKVZ",
"name": "Google Gemini(PaLM) Api account"
}
},
"typeVersion": 1
},
{
"id": "38ba13a1-f8f7-48ce-a05d-2c2526de606d",
"name": "Sticky Note",
"type": "n8n-nodes-base.stickyNote",
"position": [
-140,
340
],
"parameters": {
"color": 4,
"width": 480,
"height": 260,
"content": "## Bright Data Web Scraper Tools"
},
"typeVersion": 1
},
{
"id": "e7c8d333-e256-4944-a584-575162072ca4",
"name": "Simple Memory",
"type": "@n8n/n8n-nodes-langchain.memoryBufferWindow",
"position": [
-280,
400
],
"parameters": {
"sessionKey": "=Perform the web scraping for the below URL\n\n{{ $json.url }}",
"sessionIdType": "customKey",
"contextWindowLength": 10
},
"typeVersion": 1.3
},
{
"id": "5e89519e-8ee5-4c3b-807d-21cef6e36c32",
"name": "Webhook for Web Scraper AI Agent",
"type": "n8n-nodes-base.httpRequest",
"position": [
260,
120
],
"parameters": {
"url": "={{ $('Set the URL with the Webhook URL and data format').item.json.webhook_url }}",
"options": {},
"sendBody": true,
"bodyParameters": {
"parameters": [
{
"name": "response",
"value": "={{ $json.output }}"
}
]
}
},
"typeVersion": 4.2
},
{
"id": "2de093f4-15e5-4710-83d9-e6d9ed852873",
"name": "Set the URL with the Webhook URL and data format",
"type": "n8n-nodes-base.set",
"position": [
-400,
60
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "214e61a0-3587-453f-baf5-eac013990857",
"name": "url",
"type": "string",
"value": "https://about.google/"
},
{
"id": "45014942-0a2e-4f46-b395-f82f97bfa93e",
"name": "webhook_url",
"type": "string",
"value": "https://webhook.site/daf9d591-a130-4010-b1d3-0c66f8fcf467"
},
{
"id": "7f6c03f6-9fa3-45f9-bf81-243b7106bdac",
"name": "format",
"type": "string",
"value": "scrape_as_markdown"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "04a5b11f-1990-440e-be23-2fbcb985dd4a",
"name": "Create a binary data",
"type": "n8n-nodes-base.function",
"position": [
260,
-80
],
"parameters": {
"functionCode": "items[0].binary = {\n data: {\n data: new Buffer(JSON.stringify(items[0].json, null, 2)).toString('base64')\n }\n};\nreturn items;"
},
"typeVersion": 1
},
{
"id": "0765ffcd-4746-45f7-add8-f82d66709321",
"name": "Write the scraped content to disk",
"type": "n8n-nodes-base.readWriteFile",
"position": [
460,
-80
],
"parameters": {
"options": {},
"fileName": "d:\\Scraped-Content.json",
"operation": "write"
},
"typeVersion": 1
},
{
"id": "16cd9b48-765d-4f0d-8e78-6c41cfc50b03",
"name": "Sticky Note2",
"type": "n8n-nodes-base.stickyNote",
"position": [
-220,
-540
],
"parameters": {
"width": 440,
"height": 120,
"content": "## Disclaimer\nThis template is only available on n8n self-hosted as it's making use of the community node for MCP Client."
},
"typeVersion": 1
},
{
"id": "ff4b9e9c-a0f0-4cdb-ad53-55dc412794fc",
"name": "Sticky Note3",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1080,
-80
],
"parameters": {
"color": 5,
"width": 480,
"height": 380,
"content": "## Note\nThe AI agent utilizes Bright Data's MCP tools to perform web scraping based on user requests. It intelligently selects the most suitable web scraping tool to fulfill the user's query.\n\nOnce the web scraping is complete, the AI agent's response is:\n\n1. Used to trigger a webhook call.\n\n2. Persisted to disk for future reference.\n\nGoogle Gemini is employed by the AI agent to understand and interpret user queries. Based on this interpretation, the agent initiates a call to the appropriate MCP client to perform the required web scraping task.\n\nSource - https://github.com/luminati-io/brightdata-mcp"
},
"typeVersion": 1
}
],
"active": false,
"pinData": {},
"settings": {
"executionOrder": "v1"
},
"versionId": "ccf9f9af-82d5-4751-b06d-497c043c85fc",
"connections": {
"AI Agent": {
"main": [
[
{
"node": "Webhook for Web Scraper AI Agent",
"type": "main",
"index": 0
},
{
"node": "Create a binary data",
"type": "main",
"index": 0
}
]
]
},
"Set the URLs": {
"main": [
[
{
"node": "MCP Client Bright Data Web Scraper",
"type": "main",
"index": 0
}
]
]
},
"Simple Memory": {
"ai_memory": [
[
{
"node": "AI Agent",
"type": "ai_memory",
"index": 0
}
]
]
},
"Create a binary data": {
"main": [
[
{
"node": "Write the scraped content to disk",
"type": "main",
"index": 0
}
]
]
},
"MCP Client List all tools": {
"ai_tool": [
[
{
"node": "AI Agent",
"type": "ai_tool",
"index": 0
}
]
]
},
"MCP Client to Scrape as HTML": {
"ai_tool": [
[
{
"node": "AI Agent",
"type": "ai_tool",
"index": 0
}
]
]
},
"MCP Client to Scrape as Markdown": {
"ai_tool": [
[
{
"node": "AI Agent",
"type": "ai_tool",
"index": 0
}
]
]
},
"Webhook for Web Scraper AI Agent": {
"main": [
[]
]
},
"When clicking ‘Test workflow’": {
"main": [
[
{
"node": "MCP Client list all tools for Bright Data",
"type": "main",
"index": 0
},
{
"node": "Set the URL with the Webhook URL and data format",
"type": "main",
"index": 0
}
]
]
},
"MCP Client Bright Data Web Scraper": {
"main": [
[
{
"node": "Webhook for web scraper",
"type": "main",
"index": 0
}
]
]
},
"Google Gemini Chat Model for AI Agent": {
"ai_languageModel": [
[
{
"node": "AI Agent",
"type": "ai_languageModel",
"index": 0
}
]
]
},
"MCP Client list all tools for Bright Data": {
"main": [
[
{
"node": "Set the URLs",
"type": "main",
"index": 0
}
]
]
},
"Set the URL with the Webhook URL and data format": {
"main": [
[
{
"node": "AI Agent",
"type": "main",
"index": 0
}
]
]
}
}
}Pour qui est ce workflow ?
Ce workflow s'adresse aux entreprises de taille moyenne à grande, aux équipes de marketing et d'analyse de données, ainsi qu'aux professionnels souhaitant automatiser la collecte d'informations sur le web. Un niveau technique intermédiaire est recommandé pour la mise en place et la personnalisation du workflow.
Problème résolu
Ce workflow résout le problème de la collecte manuelle de données, souvent chronophage et sujette à des erreurs. En automatisant ce processus, les utilisateurs peuvent éliminer les frustrations liées à la recherche d'informations, réduire les risques d'inexactitudes et obtenir des résultats concrets rapidement. Cela permet également de libérer du temps pour se concentrer sur des tâches à plus forte valeur ajoutée.
Étapes du workflow
Étape 1 : le workflow commence par un déclencheur manuel. Étape 2 : il utilise un agent AI pour traiter les données. Étape 3 : le nœud 'MCP Client' liste les outils disponibles pour Bright Data. Étape 4 : les URLs à scraper sont configurées via le nœud 'Set the URLs'. Étape 5 : les données sont récupérées par le nœud 'MCP Client Bright Data Web Scraper'. Étape 6 : les résultats sont formatés en Markdown ou HTML. Étape 7 : enfin, les données sont écrites sur disque pour une utilisation future.
Guide de personnalisation du workflow n8n
Pour personnaliser ce workflow, commencez par ajuster les paramètres du nœud 'Set the URLs' pour indiquer les sites que vous souhaitez scraper. Modifiez également les configurations des nœuds 'MCP Client' pour adapter les outils de scraping à vos besoins spécifiques. Pensez à sécuriser le webhook en utilisant des tokens d'authentification. Vous pouvez également intégrer d'autres outils n8n pour enrichir le flux, comme des notifications par email ou des envois vers des bases de données.