Rodrigo Rodriguez (Pragmatismo) e06cd4d646 - From 1 to 4 validated.

2025-11-23 17:02:22 -03:00

6.9 KiB

Raw Blame History

Vector Collections

A vector collection is automatically generated from each folder in .gbkb. Each folder becomes a searchable collection that the LLM can use during conversations.

How Collections Work

Each .gbkb folder is automatically:

Scanned for documents (PDF, DOCX, TXT, HTML, MD)
Text extracted from all files
Split into chunks for processing
Converted to vector embeddings using BGE model (replaceable)
Made available for semantic search

Folder Structure

botname.gbkb/
├── policies/        # Becomes "policies" collection
├── procedures/      # Becomes "procedures" collection
└── faqs/           # Becomes "faqs" collection

Using Collections

Simply activate a collection with USE KB:

USE KB "policies"
' The LLM now has access to all documents in the policies folder
' No need to explicitly search - happens automatically during responses

Multiple Collections

Load multiple collections for comprehensive knowledge:

USE KB "policies"
USE KB "procedures" 
USE KB "faqs"
' All three collections are now active
' LLM searches across all when generating responses

Automatic Document Indexing

Documents are indexed automatically when:

Files are added to .gbkb folders
USE KB is called for the first time
The system detects new or modified files

Indexing Flow

     .gbkb/policies/
           │
           ├── vacation.pdf
           ├── handbook.docx
           └── rules.txt
                │
                ▼
    ┌─────────────────────┐
    │   File Detection    │
    │  • New files found  │
    │  • Hash comparison  │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │   Text Extraction   │
    │  • PDF → Text       │
    │  • DOCX → Text      │
    │  • HTML → Text      │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │     Chunking        │
    │  • Split by size    │
    │  • Overlap windows  │
    │  • ~500 tokens each │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │  Generate Embeddings│
    │  • BGE Model        │
    │  • 384 dimensions   │
    │  • Batch processing │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │  Store in Vector DB │
    │  • Vector storage   │
    │  • Metadata tags    │
    │  • Collection index │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │    Ready to Use     │
    │  USE KB "policies"  │
    └─────────────────────┘

Website Indexing

To keep web content updated, schedule regular crawls:

' In update-content.bas
SET SCHEDULE "0 3 * * *"  ' Run daily at 3 AM
ADD WEBSITE "https://example.com/docs"
' Website content is crawled and added to the collection

How Search Works

When USE KB is active:

User asks a question
System automatically searches relevant collections
Finds semantically similar content
Injects relevant chunks into LLM context
LLM generates response using the knowledge

Important: Search happens automatically - you don't need to call any search function. Just activate the KB with USE KB and ask questions naturally.

Search Pipeline

"What's the vacation policy?"
            │
            ▼
    ┌────────────────┐
    │  Query Embed   │
    │  BGE Model     │
    └───────┬────────┘
            │ [0.2, -0.5, 0.8, ...]
            ▼
    ┌────────────────┐
    │  Vector Search │
    │   Vector DB    │
    │  Cosine Sim    │
    └───────┬────────┘
            │
            ├─► Match 1: "Vacation days: 15 annually" (0.92)
            ├─► Match 2: "PTO policy applies to..." (0.87)
            └─► Match 3: "Time off requests via..." (0.83)
                        │
                        ▼
            ┌──────────────────────┐
            │   Context Building   │
            │  • Top 5 matches     │
            │  • 2000 tokens max   │
            │  • Relevance sorted │
            └──────────┬───────────┘
                       │
                       ▼
            ┌──────────────────────┐
            │      LLM Call        │
            │  Context + Question  │
            │  "Based on docs..."  │
            └──────────┬───────────┘
                       │
                       ▼
                 "You get 15 vacation days per year..."

Embeddings Configuration

The system uses BGE embeddings by default:

embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

You can replace BGE with any compatible embedding model by changing the model path in config.csv.

Collection Management

USE KB "name" - Activates a collection for the session
CLEAR KB - Removes all active collections
CLEAR KB "name" - Removes a specific collection

Best Practices

Organize by topic - One folder per subject area
Name clearly - Use descriptive folder names
Update regularly - Schedule website crawls if using web content
Keep files current - System auto-indexes changes
Don't overload - Use only necessary collections per session

Example: Customer Support Bot

support.gbkb/
├── products/        # Product documentation
├── policies/        # Company policies
├── troubleshooting/ # Common issues and solutions
└── contact/         # Contact information

In your dialog:

' Activate all support knowledge
USE KB "products"
USE KB "troubleshooting"
' Bot can now answer product questions and solve issues

Performance Notes

Collections are cached for fast access
Only active collections consume memory
Embeddings are generated once and reused
Changes trigger automatic re-indexing

No manual configuration needed - just organize your documents in folders and use USE KB to activate them!

6.9 KiB Raw Blame History