Changed incorrect references to .vbs files to .bas and corrected USE_WEBSITE keyword naming. Also added missing fields to API response structure and clarified that start.bas is optional for bots.
141 lines
No EOL
3.8 KiB
Markdown
141 lines
No EOL
3.8 KiB
Markdown
# Document Indexing
|
|
|
|
Document indexing in BotServer is automatic. When documents are added to `.gbkb` folders, they are processed and made searchable without any manual configuration.
|
|
|
|
## Automatic Indexing
|
|
|
|
The system automatically indexes documents when:
|
|
- Files are added to any `.gbkb` folder
|
|
- `USE KB` is called for a collection
|
|
- Files are modified or updated
|
|
- `USE WEBSITE` registers websites for crawling (preprocessing) and associates them with sessions (runtime)
|
|
|
|
## How Indexing Works
|
|
|
|
1. **Document Detection** - System scans `.gbkb` folders for files
|
|
2. **Text Extraction** - Content extracted from PDF, DOCX, HTML, MD, TXT
|
|
3. **Chunking** - Text split into manageable segments
|
|
4. **Embedding Generation** - Chunks converted to vectors using BGE model
|
|
5. **Storage** - Vectors stored for semantic search
|
|
|
|
## Supported File Types
|
|
|
|
- **PDF** - Full text extraction
|
|
- **DOCX** - Microsoft Word documents
|
|
- **TXT** - Plain text files
|
|
- **HTML** - Web pages (text only)
|
|
- **MD** - Markdown documents
|
|
- **CSV** - Structured data
|
|
|
|
## Website Indexing
|
|
|
|
To keep web content fresh, schedule regular crawls:
|
|
|
|
```basic
|
|
' In update-docs.bas
|
|
SET SCHEDULE "0 2 * * *" ' Run daily at 2 AM
|
|
|
|
USE WEBSITE "https://docs.example.com"
|
|
' Website is registered for crawling during preprocessing
|
|
' At runtime, it associates the crawled content with the session
|
|
```
|
|
|
|
### Scheduling Options
|
|
|
|
```basic
|
|
SET SCHEDULE "0 * * * *" ' Every hour
|
|
SET SCHEDULE "*/30 * * * *" ' Every 30 minutes
|
|
SET SCHEDULE "0 0 * * 0" ' Weekly on Sunday
|
|
SET SCHEDULE "0 0 1 * *" ' Monthly on the 1st
|
|
```
|
|
|
|
## Real-Time Updates
|
|
|
|
Documents are re-indexed automatically when:
|
|
- File content changes
|
|
- New files appear in folders
|
|
- Files are deleted (removed from index)
|
|
|
|
## Using Indexed Content
|
|
|
|
Once indexed, content is automatically available:
|
|
|
|
```basic
|
|
USE KB "documentation"
|
|
' All documents in the documentation folder are now searchable
|
|
' The LLM will use this knowledge when answering questions
|
|
```
|
|
|
|
You don't need to explicitly search - the system does it automatically when generating responses.
|
|
|
|
## Configuration
|
|
|
|
Indexing uses settings from `config.csv`:
|
|
|
|
```csv
|
|
embedding-url,http://localhost:8082
|
|
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf
|
|
```
|
|
|
|
The BGE embedding model can be replaced with any compatible model.
|
|
|
|
## Performance Optimization
|
|
|
|
The system optimizes indexing by:
|
|
- Processing only changed files
|
|
- Caching embeddings
|
|
- Parallel processing when possible
|
|
- Incremental updates
|
|
|
|
## Example: Knowledge Base Maintenance
|
|
|
|
Structure your knowledge base:
|
|
```
|
|
company.gbkb/
|
|
├── products/
|
|
│ ├── manual-v1.pdf
|
|
│ └── specs.docx
|
|
├── policies/
|
|
│ ├── hr-policy.pdf
|
|
│ └── it-policy.md
|
|
└── news/
|
|
└── updates.html
|
|
```
|
|
|
|
Schedule regular web updates:
|
|
```basic
|
|
' In maintenance.bas
|
|
SET SCHEDULE "0 1 * * *"
|
|
|
|
' Register websites for crawling
|
|
USE WEBSITE "https://company.com/news"
|
|
USE WEBSITE "https://company.com/products"
|
|
' Websites are crawled by background service
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Organize documents** by topic in separate folders
|
|
2. **Schedule updates** for web content
|
|
3. **Keep files updated** - system handles re-indexing
|
|
4. **Monitor folder sizes** - very large collections may impact performance
|
|
5. **Use clear naming** - helps with organization
|
|
|
|
## Troubleshooting
|
|
|
|
### Documents Not Appearing
|
|
- Check file is in a `.gbkb` folder
|
|
- Verify file type is supported
|
|
- Ensure `USE KB` was called for that collection
|
|
|
|
### Slow Indexing
|
|
- Large PDFs may take time to process
|
|
- Consider splitting very large documents
|
|
- Check available system resources
|
|
|
|
### Outdated Content
|
|
- Set up scheduled crawls for web content
|
|
- Ensure files are being updated
|
|
- Check that re-indexing is triggered
|
|
|
|
Remember: Indexing is automatic - just add documents to folders and use `USE KB` to activate them! |