Document Indexing

Documents in .gbkb folders are indexed automatically. No manual configuration required.

Automatic Triggers

Indexing occurs when:

Document → Extract Text → Chunk → Embed → Store in Qdrant

Stage	Description
Extract	Pull text from PDF, DOCX, HTML, MD, TXT, CSV
Chunk	Split into ~500 token segments with 50 token overlap
Embed	Generate vectors using BGE model
Store	Save to Qdrant with metadata

Format	Notes
PDF	Full text extraction, OCR for scanned docs
DOCX	Microsoft Word documents
TXT/MD	Plain text and Markdown
HTML	Web pages (text only)
CSV/JSON	Structured data

Schedule regular crawls for web content:

SET SCHEDULE "0 2 * * *"  ' Daily at 2 AM
USE WEBSITE "https://docs.example.com"

In config.csv:

name,value
embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

USE KB "documentation"
' All documents now searchable
' LLM uses this knowledge automatically

Issue	Solution
Documents not found	Check file is in `.gbkb` folder, verify `USE KB` called
Slow indexing	Large PDFs take time; consider splitting documents
Outdated content	Set up scheduled crawls for web content