AI Assistant Agent: RAG Input Flow¶
This workflow serves as the primary knowledge ingestion pipeline for an AI assistant system, automatically processing documents from Google Drive, extracting and chunking content, generating embeddings, and storing them in a vector database to power intelligent document retrieval and question-answering capabilities.
Purpose¶
No business context provided yet — add a context.md to enrich this documentation.
This workflow enables organizations to build and maintain a searchable knowledge base by automatically processing documents stored in Google Drive. The system transforms various document formats (DOCX, PDF, Google Docs, JSON) into searchable vector embeddings that can be queried by AI assistants to provide accurate, context-aware responses based on organizational knowledge.
How It Works¶
- Trigger Activation: The workflow starts via manual trigger, webhook call, or scheduled execution
- Configuration Setup: Loads Google Drive folder ID and admin notification settings
- Folder Discovery: Scans the configured Google Drive folder for subfolders containing documents
- Human Approval: Requests admin approval via Telegram for each folder to be processed
- Document Retrieval: Downloads files from approved folders, supporting multiple formats
- Content Extraction: Extracts text content based on file type (Google Docs API for .docx, PDF extraction, etc.)
- Text Processing: Chunks documents into manageable segments with configurable overlap
- Metadata Extraction: Uses AI to extract themes and keywords from document content
- Vector Generation: Creates embeddings using OpenAI's embedding model
- Database Storage: Stores vectors and metadata in Supabase vector database
- Cleanup Operations: Optionally deletes old document versions with human approval
- Notification: Sends completion status via Telegram
Workflow Diagram¶
graph TD
A[Manual Trigger/Webhook] --> B[Configuration]
B --> C[List Subfolders]
C --> D[Split Subfolders]
D --> E[Loop Over Items1]
E --> F[Choose Folder]
F --> G[If2 - Approval Check]
G -->|Approved| H[List Files]
G -->|Declined| E
H --> I[Mapping]
I --> J[Collection Name]
I --> K[File Id List]
I --> L[Wait for delete flow]
K --> M[Merge1]
J --> M
M --> N[Confirm Delete Vectors]
N --> O[If - Delete Approval]
O -->|Approved| P[Delete Old Documents]
O -->|Declined| Q[Send Declined Message]
P --> R[Start Upsert]
Q --> R
R --> L
L --> S[Loop Over Items]
S --> T[Download File From Google Drive]
T --> U[Switch - File Type]
U -->|DOCX/MD| V[Google Docs]
U -->|PDF| W[Extract from File]
U -->|JSON_FIN| X[Extract from JSON]
U -->|Invalid| Y[Send Invalid Filetype Message]
V --> Z[Edit Fields]
W --> Z
Z --> AA[Split Out1]
AA --> BB[Chunking]
BB --> CC[Extract Meta Data]
CC --> DD[3.5-turbo]
DD --> EE[Merge]
EE --> FF[Data Loader]
FF --> GG[Supabase Vector Store]
GG --> HH[Wait]
HH --> S
X --> II[Split Out Codes]
X --> JJ[Split Out Categories]
X --> KK[Intents]
II --> LL[Upsert Codes]
JJ --> MM[Upsert Categories]
KK --> NN[Split out intents]
NN --> OO[Upsert Intents]
LL --> PP[Merge2]
MM --> PP
OO --> PP
PP --> S
Y --> S
S -->|Complete| QQ[Send Completed Message]
Triggers¶
- Manual Trigger: Click "Test workflow" button for manual execution
- Webhook: HTTP POST to
/webhook/upsertendpoint - Schedule Trigger: Daily execution at 12:00 PM (currently disabled)
Nodes Used¶
| Node Type | Purpose |
|---|---|
| Manual Trigger | Starts workflow for testing |
| Webhook | Accepts external trigger requests |
| Set (Configuration) | Stores Google Drive folder ID and admin chat ID |
| Google Drive | Lists folders/files and downloads documents |
| Split Out/Split In Batches | Processes multiple items iteratively |
| Switch | Routes files based on type (DOCX, PDF, JSON, etc.) |
| Google Docs | Extracts content from Google Docs |
| Extract from File | Processes PDF and other file formats |
| Code (Chunking) | Splits documents into overlapping text chunks |
| Information Extractor | Uses AI to extract metadata and keywords |
| OpenAI Chat Model | Powers metadata extraction |
| Embeddings OpenAI | Generates vector embeddings |
| Supabase Vector Store | Stores documents in vector database |
| Postgres | Manages budget codes and categories |
| Telegram | Sends notifications and approval requests |
| If/Merge | Controls workflow logic and data combination |
| Wait | Pauses execution between batch processing |
External Services & Credentials Required¶
- Google Drive OAuth2: Access to Google Drive folders and files
- Google Docs OAuth2: Read Google Docs content
- OpenAI API: Generate embeddings and power AI extraction
- Supabase: Vector database storage
- PostgreSQL: Structured data storage for budget codes
- Telegram Bot: User notifications and approvals
Environment Variables¶
Configuration is handled through the "Configuration" node with hardcoded values:
- folder_id: Google Drive folder ID (currently: "1sfTnMGube-MTyEbchWLQE_Cn-oKTU2G8")
- admin_chat_id: Telegram chat ID for notifications (currently: "5207485332")
Data Flow¶
Input: - Google Drive folder containing documents (DOCX, PDF, Google Docs, JSON) - Webhook requests or manual triggers
Processing: - Document content extraction and text chunking - AI-powered metadata extraction (themes, keywords) - Vector embedding generation - Structured data parsing for financial codes
Output:
- Vector embeddings stored in Supabase documents table
- Budget codes/categories in PostgreSQL tables
- Telegram notifications with processing status
- Metadata including file IDs, themes, keywords, and chunk information
Error Handling¶
- File Download Errors: Continues processing other files if individual downloads fail
- Invalid File Types: Sends notification and skips unsupported files
- Processing Failures: Uses "Continue on Error" for batch operations
- Human Approval: Requires explicit confirmation for destructive operations (deletions)
- Timeout Protection: 15-minute limit on approval requests
Known Limitations¶
- Hardcoded configuration values require manual updates
- Limited to specific Google Drive folder structure
- Requires human approval for each folder processing
- PowerPoint files need conversion to PDF (currently disabled)
- No automatic retry mechanism for failed operations
- Chunk size fixed at 3000 tokens with 200-token overlap
Related Workflows¶
This workflow likely connects to: - AI Assistant query/response workflows that use the generated embeddings - Document management workflows for content updates - Budget management systems that consume the financial codes data
Setup Instructions¶
-
Import Workflow: Copy the workflow JSON into your n8n instance
-
Configure Credentials:
- Set up Google Drive OAuth2 connection
- Configure Google Docs OAuth2 access
- Add OpenAI API key
- Set up Supabase connection with vector database
- Configure PostgreSQL connection
- Create Telegram bot and get API credentials
-
Update Configuration Node:
- Replace
folder_idwith your Google Drive folder ID - Update
admin_chat_idwith your Telegram chat ID
- Replace
-
Database Setup:
- Ensure Supabase has
documentstable with vector support - Create PostgreSQL tables:
budget_codes,budget_categories,budget_intents
- Ensure Supabase has
-
Test Execution:
- Start with manual trigger to verify all connections
- Test with a small folder containing sample documents
- Verify vector storage and metadata extraction
-
Production Deployment:
- Enable webhook trigger for external integrations
- Configure schedule trigger if needed
- Set up monitoring for failed executions