botserver/docs/src/chapter-03/vector-collections.md

6.9 KiB

Vector Collections

A vector collection is automatically generated from each folder in .gbkb. Each folder becomes a searchable collection that the LLM can use during conversations.

How Collections Work

Each .gbkb folder is automatically:

  1. Scanned for documents (PDF, DOCX, TXT, HTML, MD)
  2. Text extracted from all files
  3. Split into chunks for processing
  4. Converted to vector embeddings using BGE model (replaceable)
  5. Made available for semantic search

Folder Structure

botname.gbkb/
├── policies/        # Becomes "policies" collection
├── procedures/      # Becomes "procedures" collection
└── faqs/           # Becomes "faqs" collection

Using Collections

Simply activate a collection with USE KB:

USE KB "policies"
' The LLM now has access to all documents in the policies folder
' No need to explicitly search - happens automatically during responses

Multiple Collections

Load multiple collections for comprehensive knowledge:

USE KB "policies"
USE KB "procedures" 
USE KB "faqs"
' All three collections are now active
' LLM searches across all when generating responses

Automatic Document Indexing

Documents are indexed automatically when:

  • Files are added to .gbkb folders
  • USE KB is called for the first time
  • The system detects new or modified files

Indexing Flow

     .gbkb/policies/
           │
           ├── vacation.pdf
           ├── handbook.docx
           └── rules.txt
                │
                ▼
    ┌─────────────────────┐
    │   File Detection    │
    │  • New files found  │
    │  • Hash comparison  │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │   Text Extraction   │
    │  • PDF → Text       │
    │  • DOCX → Text      │
    │  • HTML → Text      │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │     Chunking        │
    │  • Split by size    │
    │  • Overlap windows  │
    │  • ~500 tokens each │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │  Generate Embeddings│
    │  • BGE Model        │
    │  • 384 dimensions   │
    │  • Batch processing │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │  Store in Vector DB │
    │  • Vector storage   │
    │  • Metadata tags    │
    │  • Collection index │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │    Ready to Use     │
    │  USE KB "policies"  │
    └─────────────────────┘

Website Indexing

To keep web content updated, schedule regular crawls:

' In update-content.bas
SET SCHEDULE "0 3 * * *"  ' Run daily at 3 AM
ADD WEBSITE "https://example.com/docs"
' Website content is crawled and added to the collection

How Search Works

When USE KB is active:

  1. User asks a question
  2. System automatically searches relevant collections
  3. Finds semantically similar content
  4. Injects relevant chunks into LLM context
  5. LLM generates response using the knowledge

Important: Search happens automatically - you don't need to call any search function. Just activate the KB with USE KB and ask questions naturally.

Search Pipeline

"What's the vacation policy?"
            │
            ▼
    ┌────────────────┐
    │  Query Embed   │
    │  BGE Model     │
    └───────┬────────┘
            │ [0.2, -0.5, 0.8, ...]
            ▼
    ┌────────────────┐
    │  Vector Search │
    │   Vector DB    │
    │  Cosine Sim    │
    └───────┬────────┘
            │
            ├─► Match 1: "Vacation days: 15 annually" (0.92)
            ├─► Match 2: "PTO policy applies to..." (0.87)
            └─► Match 3: "Time off requests via..." (0.83)
                        │
                        ▼
            ┌──────────────────────┐
            │   Context Building   │
            │  • Top 5 matches     │
            │  • 2000 tokens max   │
            │  • Relevance sorted │
            └──────────┬───────────┘
                       │
                       ▼
            ┌──────────────────────┐
            │      LLM Call        │
            │  Context + Question  │
            │  "Based on docs..."  │
            └──────────┬───────────┘
                       │
                       ▼
                 "You get 15 vacation days per year..."

Embeddings Configuration

The system uses BGE embeddings by default:

embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

You can replace BGE with any compatible embedding model by changing the model path in config.csv.

Collection Management

  • USE KB "name" - Activates a collection for the session
  • CLEAR KB - Removes all active collections
  • CLEAR KB "name" - Removes a specific collection

Best Practices

  1. Organize by topic - One folder per subject area
  2. Name clearly - Use descriptive folder names
  3. Update regularly - Schedule website crawls if using web content
  4. Keep files current - System auto-indexes changes
  5. Don't overload - Use only necessary collections per session

Example: Customer Support Bot

support.gbkb/
├── products/        # Product documentation
├── policies/        # Company policies
├── troubleshooting/ # Common issues and solutions
└── contact/         # Contact information

In your dialog:

' Activate all support knowledge
USE KB "products"
USE KB "troubleshooting"
' Bot can now answer product questions and solve issues

Performance Notes

  • Collections are cached for fast access
  • Only active collections consume memory
  • Embeddings are generated once and reused
  • Changes trigger automatic re-indexing

No manual configuration needed - just organize your documents in folders and use USE KB to activate them!