Skip to content

AI Assistant Agent: RAG Input Flow

This workflow serves as the primary knowledge ingestion pipeline for an AI assistant system, automatically processing documents from Google Drive, extracting and chunking content, generating embeddings, and storing them in a vector database to power intelligent document retrieval and question-answering capabilities.

Purpose

No business context provided yet — add a context.md to enrich this documentation.

This workflow enables organizations to build and maintain a searchable knowledge base by automatically processing documents stored in Google Drive. The system transforms various document formats (DOCX, PDF, Google Docs, JSON) into searchable vector embeddings that can be queried by AI assistants to provide accurate, context-aware responses based on organizational knowledge.

How It Works

  1. Trigger Activation: The workflow starts via manual trigger, webhook call, or scheduled execution
  2. Configuration Setup: Loads Google Drive folder ID and admin notification settings
  3. Folder Discovery: Scans the configured Google Drive folder for subfolders containing documents
  4. Human Approval: Requests admin approval via Telegram for each folder to be processed
  5. Document Retrieval: Downloads files from approved folders, supporting multiple formats
  6. Content Extraction: Extracts text content based on file type (Google Docs API for .docx, PDF extraction, etc.)
  7. Text Processing: Chunks documents into manageable segments with configurable overlap
  8. Metadata Extraction: Uses AI to extract themes and keywords from document content
  9. Vector Generation: Creates embeddings using OpenAI's embedding model
  10. Database Storage: Stores vectors and metadata in Supabase vector database
  11. Cleanup Operations: Optionally deletes old document versions with human approval
  12. Notification: Sends completion status via Telegram

Workflow Diagram

graph TD
    A[Manual Trigger/Webhook] --> B[Configuration]
    B --> C[List Subfolders]
    C --> D[Split Subfolders]
    D --> E[Loop Over Items1]
    E --> F[Choose Folder]
    F --> G[If2 - Approval Check]
    G -->|Approved| H[List Files]
    G -->|Declined| E
    H --> I[Mapping]
    I --> J[Collection Name]
    I --> K[File Id List]
    I --> L[Wait for delete flow]
    K --> M[Merge1]
    J --> M
    M --> N[Confirm Delete Vectors]
    N --> O[If - Delete Approval]
    O -->|Approved| P[Delete Old Documents]
    O -->|Declined| Q[Send Declined Message]
    P --> R[Start Upsert]
    Q --> R
    R --> L
    L --> S[Loop Over Items]
    S --> T[Download File From Google Drive]
    T --> U[Switch - File Type]
    U -->|DOCX/MD| V[Google Docs]
    U -->|PDF| W[Extract from File]
    U -->|JSON_FIN| X[Extract from JSON]
    U -->|Invalid| Y[Send Invalid Filetype Message]
    V --> Z[Edit Fields]
    W --> Z
    Z --> AA[Split Out1]
    AA --> BB[Chunking]
    BB --> CC[Extract Meta Data]
    CC --> DD[3.5-turbo]
    DD --> EE[Merge]
    EE --> FF[Data Loader]
    FF --> GG[Supabase Vector Store]
    GG --> HH[Wait]
    HH --> S
    X --> II[Split Out Codes]
    X --> JJ[Split Out Categories]
    X --> KK[Intents]
    II --> LL[Upsert Codes]
    JJ --> MM[Upsert Categories]
    KK --> NN[Split out intents]
    NN --> OO[Upsert Intents]
    LL --> PP[Merge2]
    MM --> PP
    OO --> PP
    PP --> S
    Y --> S
    S -->|Complete| QQ[Send Completed Message]

Triggers

  • Manual Trigger: Click "Test workflow" button for manual execution
  • Webhook: HTTP POST to /webhook/upsert endpoint
  • Schedule Trigger: Daily execution at 12:00 PM (currently disabled)

Nodes Used

Node Type Purpose
Manual Trigger Starts workflow for testing
Webhook Accepts external trigger requests
Set (Configuration) Stores Google Drive folder ID and admin chat ID
Google Drive Lists folders/files and downloads documents
Split Out/Split In Batches Processes multiple items iteratively
Switch Routes files based on type (DOCX, PDF, JSON, etc.)
Google Docs Extracts content from Google Docs
Extract from File Processes PDF and other file formats
Code (Chunking) Splits documents into overlapping text chunks
Information Extractor Uses AI to extract metadata and keywords
OpenAI Chat Model Powers metadata extraction
Embeddings OpenAI Generates vector embeddings
Supabase Vector Store Stores documents in vector database
Postgres Manages budget codes and categories
Telegram Sends notifications and approval requests
If/Merge Controls workflow logic and data combination
Wait Pauses execution between batch processing

External Services & Credentials Required

  • Google Drive OAuth2: Access to Google Drive folders and files
  • Google Docs OAuth2: Read Google Docs content
  • OpenAI API: Generate embeddings and power AI extraction
  • Supabase: Vector database storage
  • PostgreSQL: Structured data storage for budget codes
  • Telegram Bot: User notifications and approvals

Environment Variables

Configuration is handled through the "Configuration" node with hardcoded values: - folder_id: Google Drive folder ID (currently: "1sfTnMGube-MTyEbchWLQE_Cn-oKTU2G8") - admin_chat_id: Telegram chat ID for notifications (currently: "5207485332")

Data Flow

Input: - Google Drive folder containing documents (DOCX, PDF, Google Docs, JSON) - Webhook requests or manual triggers

Processing: - Document content extraction and text chunking - AI-powered metadata extraction (themes, keywords) - Vector embedding generation - Structured data parsing for financial codes

Output: - Vector embeddings stored in Supabase documents table - Budget codes/categories in PostgreSQL tables - Telegram notifications with processing status - Metadata including file IDs, themes, keywords, and chunk information

Error Handling

  • File Download Errors: Continues processing other files if individual downloads fail
  • Invalid File Types: Sends notification and skips unsupported files
  • Processing Failures: Uses "Continue on Error" for batch operations
  • Human Approval: Requires explicit confirmation for destructive operations (deletions)
  • Timeout Protection: 15-minute limit on approval requests

Known Limitations

  • Hardcoded configuration values require manual updates
  • Limited to specific Google Drive folder structure
  • Requires human approval for each folder processing
  • PowerPoint files need conversion to PDF (currently disabled)
  • No automatic retry mechanism for failed operations
  • Chunk size fixed at 3000 tokens with 200-token overlap

This workflow likely connects to: - AI Assistant query/response workflows that use the generated embeddings - Document management workflows for content updates - Budget management systems that consume the financial codes data

Setup Instructions

  1. Import Workflow: Copy the workflow JSON into your n8n instance

  2. Configure Credentials:

    • Set up Google Drive OAuth2 connection
    • Configure Google Docs OAuth2 access
    • Add OpenAI API key
    • Set up Supabase connection with vector database
    • Configure PostgreSQL connection
    • Create Telegram bot and get API credentials
  3. Update Configuration Node:

    • Replace folder_id with your Google Drive folder ID
    • Update admin_chat_id with your Telegram chat ID
  4. Database Setup:

    • Ensure Supabase has documents table with vector support
    • Create PostgreSQL tables: budget_codes, budget_categories, budget_intents
  5. Test Execution:

    • Start with manual trigger to verify all connections
    • Test with a small folder containing sample documents
    • Verify vector storage and metadata extraction
  6. Production Deployment:

    • Enable webhook trigger for external integrations
    • Configure schedule trigger if needed
    • Set up monitoring for failed executions