Scrape and ingest web content into Supabase pgvector with Firecrawl

1 views

Built by

Firecrawl

Created on June 08, 2026

Description

What this does

Receives a URL via webhook, uses Firecrawl to scrape the page into clean markdown, and stores it as vector embeddings in Supabase pgvector. A visual, self-hosted ingestion pipeline for RAG knowledge bases. Adding a new source is as simple as sending a URL.

The second part of the workflow exposes a chat interface where an AI Agent queries the stored knowledge base to answer questions, with Cohere reranking for better retrieval quality.

How it works

Part 1: Ingestion Pipeline

Webhook receives a POST request with a url field
Verify URL validates and normalizes the domain
Supabase checks if the URL was already ingested (deduplication)
If the URL already exists, ingestion is skipped; otherwise it continues
Firecrawl fetches the page and converts it to clean markdown
OpenAI generates vector embeddings from the scraped content
Default Data Loader attaches the source URL as metadata
Supabase Vector Store inserts the content and embeddings into pgvector
Respond to Webhook confirms how many items were added

Part 2: RAG Chat Agent

Chat trigger receives a user question
AI Agent (OpenRouter) queries the Supabase vector store filtered by URL
Cohere Reranker improves retrieval quality before the agent responds
Agent answers based solely on the ingested knowledge base

Requirements

Firecrawl API key
OpenAI API key (for embeddings)
OpenRouter API key (for the chat agent)
Cohere API key (for reranking)
Supabase project with pgvector enabled

Setup

Create a Supabase project and run the following SQL in the SQL editor:

-- Enable the pgvector extension
create extension vector
with
schema extensions;

-- Create a table to store documents
create table documents (
id bigserial primary key,
content text,
metadata jsonb,
embedding extensions.vector(1536)
);

-- Create a function to search for documents
create function match_documents (
query_embedding extensions.vector(1536),
match_count int default null,
filter jsonb default '{}'
) returns table (
id bigint,
content text,
metadata jsonb,
similarity float
)
language plpgsql
as $$
#variable_conflict use_column
begin
return query
select
id,
content,
metadata,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where metadata @> filter
order by documents.embedding <=> query_embedding
limit match_count;
end;
$$;

Add your Firecrawl API key as a credential in n8n
Add your OpenAI API key as a credential (for embeddings)
Add your OpenRouter API key as a credential (for the chat agent)
Add your Cohere API key as a credential (for reranking)
Activate the workflow

How to use

Send a POST request to the webhook URL:

curl -X POST https://your-n8n-instance/webhook/your-id \
-H "Content-Type: application/json" \
-d '{"url": "https://firecrawl.dev/docs"}'

Then open the chat interface in n8n to ask questions about the ingested content.

Nodes Used (9)

AI Agent

@n8n/n8n-nodes-langchain.agent

Code

n8n-nodes-base.code

Default Data Loader

@n8n/n8n-nodes-langchain.documentDefaultDataLoader

Embeddings OpenAI

@n8n/n8n-nodes-langchain.embeddingsOpenAi

OpenRouter Chat Model

@n8n/n8n-nodes-langchain.lmChatOpenRouter

Reranker Cohere

@n8n/n8n-nodes-langchain.rerankerCohere

Simple Memory

@n8n/n8n-nodes-langchain.memoryBufferWindow

Supabase

n8n-nodes-base.supabase

Supabase Vector Store

@n8n/n8n-nodes-langchain.vectorStoreSupabase

Scrape and ingest web content into Supabase pgvector with Firecrawl

Description

Nodes Used (9)

Select Nodes to Filter