# Vector Collections A **vector collection** is automatically generated from each folder in `.gbkb`. Each folder becomes a searchable collection that the LLM can use during conversations. ## How Collections Work Each `.gbkb` folder is automatically: 1. Scanned for documents (PDF, DOCX, TXT, HTML, MD) 2. Text extracted from all files 3. Split into chunks for processing 4. Converted to vector embeddings using BGE model (replaceable) 5. Made available for semantic search ## Folder Structure ``` botname.gbkb/ ├── policies/ # Becomes "policies" collection ├── procedures/ # Becomes "procedures" collection └── faqs/ # Becomes "faqs" collection ``` ## Using Collections Simply activate a collection with `USE KB`: ```basic USE KB "policies" ' The LLM now has access to all documents in the policies folder ' No need to explicitly search - happens automatically during responses ``` ## Multiple Collections Load multiple collections for comprehensive knowledge: ```basic USE KB "policies" USE KB "procedures" USE KB "faqs" ' All three collections are now active ' LLM searches across all when generating responses ``` ## Automatic Document Indexing Documents are indexed automatically when: - Files are added to `.gbkb` folders - `USE KB` is called for the first time - The system detects new or modified files ### Indexing Flow ``` .gbkb/policies/ │ ├── vacation.pdf ├── handbook.docx └── rules.txt │ ▼ ┌─────────────────────┐ │ File Detection │ │ • New files found │ │ • Hash comparison │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Text Extraction │ │ • PDF → Text │ │ • DOCX → Text │ │ • HTML → Text │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Chunking │ │ • Split by size │ │ • Overlap windows │ │ • ~500 tokens each │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Generate Embeddings│ │ • BGE Model │ │ • 384 dimensions │ │ • Batch processing │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Store in Vector DB │ │ • Vector storage │ │ • Metadata tags │ │ • Collection index │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Ready to Use │ │ USE KB "policies" │ └─────────────────────┘ ``` ## Website Indexing To keep web content updated, schedule regular crawls: ```basic ' In update-content.bas SET SCHEDULE "0 3 * * *" ' Run daily at 3 AM ADD WEBSITE "https://example.com/docs" ' Website content is crawled and added to the collection ``` ## How Search Works When `USE KB` is active: 1. User asks a question 2. System automatically searches relevant collections 3. Finds semantically similar content 4. Injects relevant chunks into LLM context 5. LLM generates response using the knowledge **Important**: Search happens automatically - you don't need to call any search function. Just activate the KB with `USE KB` and ask questions naturally. ### Search Pipeline ``` "What's the vacation policy?" │ ▼ ┌────────────────┐ │ Query Embed │ │ BGE Model │ └───────┬────────┘ │ [0.2, -0.5, 0.8, ...] ▼ ┌────────────────┐ │ Vector Search │ │ Vector DB │ │ Cosine Sim │ └───────┬────────┘ │ ├─► Match 1: "Vacation days: 15 annually" (0.92) ├─► Match 2: "PTO policy applies to..." (0.87) └─► Match 3: "Time off requests via..." (0.83) │ ▼ ┌──────────────────────┐ │ Context Building │ │ • Top 5 matches │ │ • 2000 tokens max │ │ • Relevance sorted │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ LLM Call │ │ Context + Question │ │ "Based on docs..." │ └──────────┬───────────┘ │ ▼ "You get 15 vacation days per year..." ``` ## Embeddings Configuration The system uses BGE embeddings by default: ```csv embedding-url,http://localhost:8082 embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf ``` You can replace BGE with any compatible embedding model by changing the model path in config.csv. ## Collection Management - `USE KB "name"` - Activates a collection for the session - `CLEAR KB` - Removes all active collections - `CLEAR KB "name"` - Removes a specific collection ## Best Practices 1. **Organize by topic** - One folder per subject area 2. **Name clearly** - Use descriptive folder names 3. **Update regularly** - Schedule website crawls if using web content 4. **Keep files current** - System auto-indexes changes 5. **Don't overload** - Use only necessary collections per session ## Example: Customer Support Bot ``` support.gbkb/ ├── products/ # Product documentation ├── policies/ # Company policies ├── troubleshooting/ # Common issues and solutions └── contact/ # Contact information ``` In your dialog: ```basic ' Activate all support knowledge USE KB "products" USE KB "troubleshooting" ' Bot can now answer product questions and solve issues ``` ## Performance Notes - Collections are cached for fast access - Only active collections consume memory - Embeddings are generated once and reused - Changes trigger automatic re-indexing No manual configuration needed - just organize your documents in folders and use `USE KB` to activate them!