Transform Websites into a Conversational Knowledge Base with OpenAI RAG & Supabase

Go to Workflow
0 views
Built by franck fambou franck fambou
Created on June 08, 2026

Description

Overview

This advanced automation workflow enables deep web scraping combined with Retrieval-Augmented Generation (RAG) to transform websites into intelligent, queryable knowledge bases. The system recursively crawls target websites, extracts content, and indexes all data in a vector database for AI conversational access.

How the system works

Intelligent Web Scraping and RAG Pipeline

Recursive Web Scraper - Automatically crawls every accessible page of a target website
Data Extraction - Collects text, metadata, emails, links, and PDF documents
Supabase Integration - Stores content in PostgreSQL tables for scalability
RAG Vectorization - Generates embeddings and stores them for semantic search
AI Query Layer - Connects embeddings to an AI chat engine with citations
Error Handling - Automatically retriggers failed queries

Setup Instructions

Estimated setup time: 30-45 minutes

Prerequisites

Self-hosted n8n instance (v0.200.0 or higher)
Supabase account and project (PostgreSQL enabled)
OpenAI/Gemini/Claude API key for embeddings and chat
Optional: External vector database (Pinecone, Qdrant)

Detailed configuration steps

Step 1: Supabase configuration

Project creation**: New Supabase project with PostgreSQL enabled
Generating credentials**: API keys (anon key and service_role key) and connection string
Security configuration**: RLS policies according to your access requirements

Step 2: Connect Supabase to n8n

Configure Supabase node**: Add credentials to n8n Credentials
Test connection**: Verify with a simple query
Configure PostgreSQL**: Direct connection for advanced operations

Step 3: Preparing the database

Main tables**:
pages: URLs, content, metadata, scraping statuses
documents: Extracted and processed PDF files
embeddings: Vectors for semantic search
links: Link graph for navigation

Management functions**: Scripts to reactivate failed URLs and manage retries

Step 4: Configuring automation

Recursive scraper**: Starting URL, crawling depth, CSS selectors
HTTP extraction**: User-Agent, headers, timeouts, and retry policies
Supabase backup**: Batch insertion, data validation, duplicate management

Step 5: Error handling and re-executions

Failure monitoring**: Automatic detection of failed URLs
Manual triggers**: Selective re-execution by domain or date
Recovery sub-streams**: Retry logic with exponential backoff

Step 6: RAG processing

Embedding generation**: Text-embedding models with intelligent chunking
Vector storage**: Supabase pgvector or external database
Conversational engine**: Connection to chat models with source citations

Data structure

Main Supabase tables
| Table | Content | Usage |
|-------|---------|-------|
| pages | URLs, HTML content, metadata | Main storage for scraped content |
| documents | PDF files, extracted text | Downloaded and processed documents |
| embeddings | Vectors, text chunks | Semantic search and RAG |
| links | Link graph, navigation | Relationships between pages |

Use cases

Business and enterprise
Competitive intelligence with conversational querying
Market research from complex web domains
Compliance monitoring and regulatory watch

Research and academia
Literature extraction with semantic search
Building datasets from fragmented sources

Legal and technical
Scraping legal repositories with intelligent queries
Technical documentation transformed into a conversational assistant

Key features

Advanced scraping
Recursive crawling with automatic link discovery
Multi-format extraction (HTML, PDF, emails)
Intelligent error handling and retry

Intelligent RAG
Contextual embeddings for semantic search
Multi-document queries with citations
Intuitive conversational interface

Performance and scalability
Processing of thousands of pages per execution
Embedding cache for fast responses
Scalable architecture with Supabase
Technical Architecture

Main flow: Target URL → Recursive scraping → Content extraction → Supabase storage → Vectorization → Conversational interface

Supported types: HTML pages, PDF documents, metadata, links, emails

Performance specifications

Capacity**: 10,000+ pages per run
Response time**: < 5 seconds for RAG queries
Accuracy**: >90% relevance for specific domains
Scalability**: Distributed architecture via Supabase

Advanced configuration

Customization
Crawling depth and scope controls
Domain and content type filters
Chunking settings to optimize RAG

Monitoring
Real-time monitoring in Supabase
Cost and performance metrics
Detailed conversation logs

Nodes Used (7)

Code
n8n-nodes-base.code
Default Data Loader
@n8n/n8n-nodes-langchain.documentDefaultDataLoader
Embeddings OpenAI
@n8n/n8n-nodes-langchain.embeddingsOpenAi
HTTP Request
n8n-nodes-base.httpRequest
Recursive Character Text Splitter
@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter
Supabase
n8n-nodes-base.supabase
Supabase Vector Store
@n8n/n8n-nodes-langchain.vectorStoreSupabase