generalbots/botbook/src/10-configuration-deployment/multimodal.md
Rodrigo Rodriguez (Pragmatismo) 037db5c381 feat: Major workspace reorganization and documentation update
- Add comprehensive documentation in botbook/ with 12 chapters
- Add botapp/ Tauri desktop application
- Add botdevice/ IoT device support
- Add botlib/ shared library crate
- Add botmodels/ Python ML models service
- Add botplugin/ browser extension
- Add botserver/ reorganized server code
- Add bottemplates/ bot templates
- Add bottest/ integration tests
- Add botui/ web UI server
- Add CI/CD workflows in .forgejo/workflows/
- Add AGENTS.md and PROD.md documentation
- Add dependency management scripts (DEPENDENCIES.sh/ps1)
- Remove legacy src/ structure and migrations
- Clean up temporary and backup files
2026-04-19 08:14:25 -03:00

247 lines
No EOL
7.8 KiB
Markdown

# Multimodal Configuration
General Bots integrates with botmodels—a Python service for multimodal AI tasks—to enable image generation, video creation, audio synthesis, and vision capabilities directly from BASIC scripts.
<img src="../assets/gb-decorative-header.svg" alt="General Bots" style="max-height: 100px; width: 100%; object-fit: contain;">
## Architecture
```
┌─────────────┐ HTTPS ┌─────────────┐
│ botserver │ ────────────▶ │ botmodels │
│ (Rust) │ │ (Python) │
└─────────────┘ └─────────────┘
│ │
│ BASIC Keywords │ AI Models
│ - IMAGE │ - Stable Diffusion
│ - VIDEO │ - Zeroscope
│ - AUDIO │ - TTS/Whisper
│ - SEE │ - BLIP2
```
When a BASIC script calls a multimodal keyword, botserver forwards the request to botmodels, which runs the appropriate AI model and returns the generated content.
## Configuration
Add these settings to your bot's `config.csv` file to enable multimodal capabilities.
### BotModels Service
| Key | Default | Description |
|-----|---------|-------------|
| `botmodels-enabled` | `false` | Enable botmodels integration |
| `botmodels-host` | `0.0.0.0` | Host address for botmodels service |
| `botmodels-port` | `8085` | Port for botmodels service |
| `botmodels-api-key` | — | API key for authentication |
| `botmodels-https` | `false` | Use HTTPS for connection |
### Image Generation
| Key | Default | Description |
|-----|---------|-------------|
| `image-generator-model` | — | Path to image generation model |
| `image-generator-steps` | `4` | Inference steps (more = higher quality, slower) |
| `image-generator-width` | `512` | Output image width in pixels |
| `image-generator-height` | `512` | Output image height in pixels |
| `image-generator-gpu-layers` | `20` | Layers to offload to GPU |
| `image-generator-batch-size` | `1` | Batch size for generation |
### Video Generation
| Key | Default | Description |
|-----|---------|-------------|
| `video-generator-model` | — | Path to video generation model |
| `video-generator-frames` | `24` | Number of frames to generate |
| `video-generator-fps` | `8` | Output frames per second |
| `video-generator-width` | `320` | Output video width in pixels |
| `video-generator-height` | `576` | Output video height in pixels |
| `video-generator-gpu-layers` | `15` | Layers to offload to GPU |
| `video-generator-batch-size` | `1` | Batch size for generation |
## Example Configuration
```csv
key,value
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
image-generator-gpu-layers,20
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8
```
## BASIC Keywords
Once configured, these keywords become available in your scripts.
### IMAGE
Generate an image from a text prompt:
```basic
file = IMAGE "a sunset over mountains with purple clouds"
SEND FILE TO user, file
```
The keyword returns a path to the generated image file.
### VIDEO
Generate a video from a text prompt:
```basic
file = VIDEO "a rocket launching into space"
SEND FILE TO user, file
```
Video generation is more resource-intensive than image generation. Expect longer processing times.
### AUDIO
Generate speech audio from text:
```basic
file = AUDIO "Hello, welcome to our service!"
SEND FILE TO user, file
```
### SEE
Analyze an image or video and get a description:
```basic
' Describe an image
caption = SEE "/path/to/image.jpg"
TALK caption
' Describe a video
description = SEE "/path/to/video.mp4"
TALK description
```
The SEE keyword uses vision models to understand visual content and return natural language descriptions.
## Starting BotModels
Before using multimodal features, start the botmodels service:
```bash
cd botmodels
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085
```
For production with HTTPS:
```bash
python -m uvicorn src.main:app \
--host 0.0.0.0 \
--port 8085 \
--ssl-keyfile key.pem \
--ssl-certfile cert.pem
```
## BotModels API Endpoints
The botmodels service exposes these REST endpoints:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/image/generate` | POST | Generate image from prompt |
| `/api/video/generate` | POST | Generate video from prompt |
| `/api/speech/generate` | POST | Generate speech from text |
| `/api/speech/totext` | POST | Transcribe audio to text |
| `/api/vision/describe` | POST | Describe an image |
| `/api/vision/describe_video` | POST | Describe a video |
| `/api/vision/vqa` | POST | Visual question answering |
| `/api/health` | GET | Health check |
All endpoints except `/api/health` require the `X-API-Key` header for authentication.
## Model Paths
Configure model paths relative to the botmodels service directory. Typical layout:
```
data/
├── diffusion/
│ ├── sd_turbo_f16.gguf # Stable Diffusion
│ └── zeroscope_v2_576w/ # Zeroscope video
├── tts/
│ └── model.onnx # Text-to-speech
├── whisper/
│ └── model.bin # Speech-to-text
└── vision/
└── blip2/ # Vision model
```
## GPU Acceleration
Both image and video generation benefit significantly from GPU acceleration. Configure GPU layers based on your hardware:
| GPU VRAM | Recommended GPU Layers |
|----------|----------------------|
| 4GB | 8-12 |
| 8GB | 15-20 |
| 12GB+ | 25-35 |
Lower GPU layers if you experience out-of-memory errors.
## Troubleshooting
**"BotModels is not enabled"**
Set `botmodels-enabled=true` in your config.csv.
**Connection refused**
Verify botmodels service is running and check host/port configuration. Test connectivity:
```bash
curl http://localhost:8085/api/health
```
**Authentication failed**
Ensure `botmodels-api-key` in config.csv matches the `API_KEY` environment variable in botmodels.
**Model not found**
Verify model paths are correct and models are downloaded to the expected locations.
**Out of memory**
Reduce `gpu-layers` or `batch-size`. Video generation is particularly memory-intensive.
## Security Considerations
**Use HTTPS in production.** Set `botmodels-https=true` and configure SSL certificates on the botmodels service.
**Use strong API keys.** Generate cryptographically random keys for the `botmodels-api-key` setting.
**Restrict network access.** Limit botmodels service access to trusted hosts only.
**Consider GPU isolation.** Run botmodels on a dedicated GPU server if sharing resources with other services.
## Performance Tips
**Image generation** runs fastest with SD Turbo models and 4-8 inference steps. More steps improve quality but increase generation time linearly.
**Video generation** is the most resource-intensive operation. Keep frame counts low (24-48) for reasonable response times.
**Batch processing** improves throughput when generating multiple items. Increase `batch-size` if you have sufficient GPU memory.
**Caching** generated content when appropriate. If multiple users request similar content, consider storing results.
## See Also
- [LLM Configuration](./llm-config.md) - Language model settings
- [Bot Parameters](./parameters.md) - All configuration options
- [IMAGE Keyword](../04-basic-scripting/keywords.md) - Image generation reference
- [SEE Keyword](../04-basic-scripting/keywords.md) - Vision capabilities