botserver/docs/src/chapter-08-config/llm-config.md

4.9 KiB

LLM Configuration

Configuration for Language Model integration in BotServer, supporting both local GGUF models and external API services.

Local Model Configuration

BotServer is designed to work with local GGUF models by default:

llm-key,none
llm-url,http://localhost:8081
llm-model,../../../../data/llm/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_M.gguf

Model Path

The llm-model parameter accepts:

  • Relative paths: ../../../../data/llm/model.gguf
  • Absolute paths: /opt/models/model.gguf
  • Model names: When using external APIs like gpt-4

Supported Model Formats

  • GGUF: Quantized models for CPU/GPU inference
  • Q3_K_M, Q4_K_M, Q5_K_M: Different quantization levels
  • F16, F32: Full precision models

LLM Server Configuration

Running Embedded Server

BotServer can run its own LLM server:

llm-server,true
llm-server-path,botserver-stack/bin/llm/build/bin
llm-server-host,0.0.0.0
llm-server-port,8081

Server Performance Parameters

llm-server-gpu-layers,0
llm-server-ctx-size,4096
llm-server-n-predict,1024
llm-server-parallel,6
llm-server-cont-batching,true
Parameter Description Impact
llm-server-gpu-layers Layers to offload to GPU 0 = CPU only, higher = more GPU
llm-server-ctx-size Context window size More context = more memory
llm-server-n-predict Max tokens to generate Limits response length
llm-server-parallel Concurrent requests Higher = more throughput
llm-server-cont-batching Continuous batching Improves multi-user performance

Memory Management

llm-server-mlock,false
llm-server-no-mmap,false
  • mlock: Locks model in RAM (prevents swapping)
  • no-mmap: Disables memory mapping (uses more RAM)

Cache Configuration

Basic Cache Settings

llm-cache,false
llm-cache-ttl,3600

Caching reduces repeated LLM calls for identical inputs.

Semantic Cache

llm-cache-semantic,true
llm-cache-threshold,0.95

Semantic caching matches similar (not just identical) queries:

  • threshold: 0.95 = 95% similarity required
  • Lower threshold = more cache hits but less accuracy

External API Configuration

Groq and OpenAI-Compatible APIs

For cloud inference, Groq offers the fastest performance:

llm-key,gsk-your-groq-api-key
llm-url,https://api.groq.com/openai/v1
llm-model,mixtral-8x7b-32768

Local API Servers

llm-key,none
llm-url,http://localhost:8081
llm-model,local-model-name

Configuration Examples

Minimal Local Setup

name,value
llm-url,http://localhost:8081
llm-model,../../../../data/llm/model.gguf

High-Performance Local

name,value
llm-server,true
llm-server-gpu-layers,32
llm-server-ctx-size,8192
llm-server-parallel,8
llm-server-cont-batching,true
llm-cache,true
llm-cache-semantic,true

Low-Resource Setup

name,value
llm-server-ctx-size,2048
llm-server-n-predict,512
llm-server-parallel,2
llm-cache,false
llm-server-mlock,false

External API

name,value
llm-key,sk-...
llm-url,https://api.anthropic.com
llm-model,claude-3
llm-cache,true
llm-cache-ttl,7200

Performance Tuning

For Responsiveness

  • Decrease llm-server-ctx-size
  • Decrease llm-server-n-predict
  • Enable llm-cache
  • Enable llm-cache-semantic

For Quality

  • Increase llm-server-ctx-size
  • Increase llm-server-n-predict
  • Use higher quantization (Q5_K_M or F16)
  • Disable semantic cache or increase threshold

For Multiple Users

  • Enable llm-server-cont-batching
  • Increase llm-server-parallel
  • Enable caching
  • Consider GPU offloading

Model Selection Guidelines

Small Models (1-3B parameters)

  • Fast responses
  • Low memory usage
  • Good for simple tasks
  • Example: DeepSeek-R1-Distill-Qwen-1.5B

Medium Models (7-13B parameters)

  • Balanced performance
  • Moderate memory usage
  • Good general purpose
  • Example: Llama-2-7B, Mistral-7B

Large Models (30B+ parameters)

  • Best quality
  • High memory requirements
  • Complex reasoning
  • Example: Llama-2-70B, Mixtral-8x7B

Troubleshooting

Model Won't Load

  • Check file path exists
  • Verify sufficient RAM
  • Ensure compatible GGUF version

Slow Responses

  • Reduce context size
  • Enable caching
  • Use GPU offloading
  • Choose smaller model

Out of Memory

  • Reduce llm-server-ctx-size
  • Reduce llm-server-parallel
  • Use more quantized model (Q3 instead of Q5)
  • Disable llm-server-mlock

Connection Refused

  • Verify llm-server is true
  • Check port not in use
  • Ensure firewall allows connection

Best Practices

  1. Start Small: Begin with small models and scale up
  2. Use Caching: Enable for production deployments
  3. Monitor Memory: Watch RAM usage during operation
  4. Test Thoroughly: Verify responses before production
  5. Document Models: Keep notes on model performance
  6. Version Control: Track config.csv changes