Thinking signals ({"type":"thinking"} and {"type":"thinking_clear"})
were leaking into the final HTML response when they appeared in the
middle or end of chunks, concatenated with regular content.
The previous check only looked at the start of chunks with
chunk.trim().starts_with('{'), which missed embedded signals.
Solution:
- Use regex to find ALL thinking signal JSON objects anywhere in the chunk
- Send each thinking signal separately to the frontend
- Remove thinking signals from the chunk before content processing
- Skip to next iteration if chunk contained only thinking signals
This prevents thinking signals from appearing in the final HTML output
and ensures they're properly handled by the frontend thinking indicator.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add Debug derive for DriveFileRepository
- Clone etag/last_modified for upsert to avoid move errors
- Fix max fail_count query to handle nullable integer
- Remove excessive trace/debug logging in hot loops
- Fix broadcast_theme_change lock contention by cloning channels before iterating
- Increase default sleep interval from 10s to 30s
- Remove [MODULE] prefixes from log messages
- Fix PDF re-download bug by using only last_modified (not ETag) for change detection
- Re-enable DriveMonitor in bootstrap (was disabled for testing)
- Kimi factory: add max_tokens=16384, temperature=1.0, top_p=1.0,
and chat_template_kwargs.thinking=true for kimi models
- Add chunk count traces in stream_response so we see LLM progress
immediately in logs: 'LLM chunk #N received (len=X)'
- Keep generic stream parser clean — model-specific logic lives in
the request builder (Kimi factory pattern)
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
1. CI: restart system container instead of just systemctl restart botserver
— ensures full env reload, Vault re-auth, DriveMonitor fresh state
2. Health endpoint: add 'commit' field with short git SHA
— build.rs passes BOTSERVER_COMMIT from CI via rustc-env
- Both /health and /api/health now report the running commit
3. WebSocket recv_task: spawn stream_response in separate tokio task
- prevents one hung LLM from freezing all message processing
- each WebSocket connection can now handle multiple messages
concurrently regardless of LLM latency
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
DriveMonitor polling may be consuming resources and interfering with
LLM response delivery. Disabling to isolate the chat pipeline.
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Previously the recv_task awaited stream_response() directly, which
froze the entire WebSocket message receiver while the LLM ran (30s+).
This meant a second user message couldn't be processed until the
first LLM call finished — a race condition that locked the session.
Now stream_response runs in its own tokio::spawn, keeping recv_task
free to handle new messages immediately. Also fixed borrow/lifetime
issue by cloning the response channel sender out of the lock scope.
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Without connect_timeout, reqwest can hang for the full 60s timeout
when the remote server is unreachable (DNS, TCP connect, etc.).
Now fails in 5s max for connection issues, 30s for full request.
This means one user's LLM failure no longer blocks new users for
a full minute — the channel closes quickly and the WebSocket is freed.
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
reqwest::Client::new() has no timeout — when external APIs (NVIDIA,
Groq, etc.) hang or throttle, the request blocks forever, freezing the
entire response pipeline for the user.
Also add std::time::Duration import to llm/mod.rs.
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
The previous fix used Handle::current().block_on() which deadlocks when
the Rhai engine runs on a Tokio worker thread — it blocks the very
thread the async task needs to make progress.
New approach: spawn a dedicated background thread with its own
single-threaded Tokio runtime, communicate via mpsc channel with a
45s timeout. This completely isolates the LLM runtime from the
caller's runtime, eliminating any possibility of thread starvation
or nested-runtime deadlock.
Also remove unused 'trace' import from llm/mod.rs.
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>