- Add comprehensive documentation in botbook/ with 12 chapters - Add botapp/ Tauri desktop application - Add botdevice/ IoT device support - Add botlib/ shared library crate - Add botmodels/ Python ML models service - Add botplugin/ browser extension - Add botserver/ reorganized server code - Add bottemplates/ bot templates - Add bottest/ integration tests - Add botui/ web UI server - Add CI/CD workflows in .forgejo/workflows/ - Add AGENTS.md and PROD.md documentation - Add dependency management scripts (DEPENDENCIES.sh/ps1) - Remove legacy src/ structure and migrations - Clean up temporary and backup files
10 KiB
Document Processing API
botserver provides RESTful endpoints for processing, extracting, and analyzing various document formats including PDFs, Office documents, and images.
Overview
The Document Processing API enables:
- Text extraction from documents
- OCR for scanned documents
- Metadata extraction
- Document conversion
- Content analysis and summarization
Base URL
http://localhost:9000/api/v1/documents
Authentication
All Document Processing API requests require authentication:
Authorization: Bearer <token>
Endpoints
Upload Document
POST /upload
Upload a document for processing.
Request:
- Method:
POST - Content-Type:
multipart/form-data
Form Data:
file- The document fileprocess_options- JSON string of processing options
Example Request:
curl -X POST \
-H "Authorization: Bearer token123" \
-F "file=@document.pdf" \
-F 'process_options={"extract_text":true,"extract_metadata":true}' \
http://localhost:9000/api/v1/documents/upload
Response:
{
"document_id": "doc_abc123",
"filename": "document.pdf",
"size_bytes": 2048576,
"mime_type": "application/pdf",
"status": "processing",
"uploaded_at": "2024-01-15T10:00:00Z"
}
Process Document
POST /process
Process an already uploaded document.
Request Body:
{
"document_id": "doc_abc123",
"operations": [
"extract_text",
"extract_metadata",
"generate_summary",
"extract_entities"
],
"options": {
"language": "en",
"ocr_enabled": true,
"chunk_size": 1000
}
}
Response:
{
"document_id": "doc_abc123",
"process_id": "prc_xyz789",
"status": "processing",
"estimated_completion": "2024-01-15T10:02:00Z"
}
Get Processing Status
GET /process/{process_id}/status
Check the status of document processing.
Response:
{
"process_id": "prc_xyz789",
"document_id": "doc_abc123",
"status": "completed",
"progress": 100,
"completed_at": "2024-01-15T10:01:30Z",
"results_available": true
}
Get Extracted Text
GET /documents/{document_id}/text
Retrieve extracted text from a processed document.
Query Parameters:
page- Specific page number (optional)format- Output format:plain,markdown,html
Response:
{
"document_id": "doc_abc123",
"text": "This is the extracted text from the document...",
"pages": 10,
"word_count": 5420,
"language": "en"
}
Get Document Metadata
GET /documents/{document_id}/metadata
Retrieve metadata from a document.
Response:
{
"document_id": "doc_abc123",
"metadata": {
"title": "Annual Report 2024",
"author": "John Doe",
"created_date": "2024-01-10T08:00:00Z",
"modified_date": "2024-01-14T16:30:00Z",
"pages": 10,
"producer": "Microsoft Word",
"keywords": ["annual", "report", "finance"],
"custom_properties": {
"department": "Finance",
"confidentiality": "Internal"
}
}
}
Generate Summary
POST /documents/{document_id}/summarize
Generate an AI summary of the document.
Request Body:
{
"type": "abstractive",
"length": "medium",
"focus_areas": ["key_points", "conclusions"],
"language": "en"
}
Response:
{
"document_id": "doc_abc123",
"summary": "This document discusses the annual financial performance...",
"key_points": [
"Revenue increased by 15%",
"New market expansion successful",
"Operating costs reduced"
],
"summary_length": 250
}
Extract Entities
POST /documents/{document_id}/entities
Extract named entities from the document.
Request Body:
{
"entity_types": ["person", "organization", "location", "date", "money"],
"confidence_threshold": 0.7
}
Response:
{
"document_id": "doc_abc123",
"entities": [
{
"text": "John Smith",
"type": "person",
"confidence": 0.95,
"occurrences": 5
},
{
"text": "New York",
"type": "location",
"confidence": 0.88,
"occurrences": 3
},
{
"text": "$1.5 million",
"type": "money",
"confidence": 0.92,
"occurrences": 2
}
]
}
Convert Document
POST /documents/{document_id}/convert
Convert document to another format.
Request Body:
{
"target_format": "pdf",
"options": {
"compress": true,
"quality": "high",
"page_size": "A4"
}
}
Response:
{
"document_id": "doc_abc123",
"converted_id": "doc_def456",
"original_format": "docx",
"target_format": "pdf",
"download_url": "/api/v1/documents/doc_def456/download"
}
Search Within Document
POST /documents/{document_id}/search
Search for text within a document.
Request Body:
{
"query": "revenue growth",
"case_sensitive": false,
"whole_words": false,
"regex": false
}
Response:
{
"document_id": "doc_abc123",
"matches": [
{
"page": 3,
"line": 15,
"context": "...the company achieved significant revenue growth in Q4...",
"position": 1247
},
{
"page": 7,
"line": 8,
"context": "...projecting continued revenue growth for next year...",
"position": 3892
}
],
"total_matches": 2
}
Split Document
POST /documents/{document_id}/split
Split a document into multiple parts.
Request Body:
{
"method": "by_pages",
"pages_per_split": 5
}
Response:
{
"document_id": "doc_abc123",
"parts": [
{
"part_id": "part_001",
"pages": "1-5",
"download_url": "/api/v1/documents/part_001/download"
},
{
"part_id": "part_002",
"pages": "6-10",
"download_url": "/api/v1/documents/part_002/download"
}
],
"total_parts": 2
}
Merge Documents
POST /documents/merge
Merge multiple documents into one.
Request Body:
{
"document_ids": ["doc_abc123", "doc_def456", "doc_ghi789"],
"output_format": "pdf",
"preserve_metadata": true
}
Response:
{
"merged_document_id": "doc_merged_xyz",
"source_count": 3,
"total_pages": 30,
"download_url": "/api/v1/documents/doc_merged_xyz/download"
}
Supported Formats
Input Formats
- Documents: PDF, DOCX, DOC, ODT, RTF, TXT
- Spreadsheets: XLSX, XLS, ODS, CSV
- Presentations: PPTX, PPT, ODP
- Images: PNG, JPG, JPEG, GIF, BMP, TIFF
- Web: HTML, XML, MARKDOWN
Output Formats
- Plain Text
- Markdown
- HTML
- JSON
- CSV (for tabular data)
Processing Options
OCR Options
{
"ocr_enabled": true,
"ocr_language": "eng",
"ocr_engine": "tesseract",
"preprocessing": {
"deskew": true,
"remove_noise": true,
"enhance_contrast": true
}
}
Text Extraction Options
{
"preserve_formatting": false,
"extract_tables": true,
"extract_images": false,
"chunk_text": true,
"chunk_size": 1000,
"chunk_overlap": 100
}
Summary Options
{
"summary_type": "extractive",
"summary_length": "medium",
"bullet_points": true,
"include_keywords": true,
"max_sentences": 5
}
Batch Processing
Submit Batch
POST /batch/process
Process multiple documents in batch.
Request Body:
{
"documents": [
{
"document_id": "doc_001",
"operations": ["extract_text", "summarize"]
},
{
"document_id": "doc_002",
"operations": ["extract_entities"]
}
],
"notify_on_completion": true,
"webhook_url": "https://example.com/webhook"
}
Get Batch Status
GET /batch/{batch_id}/status
Check batch processing status.
Response:
{
"batch_id": "batch_abc123",
"total_documents": 10,
"processed": 7,
"failed": 1,
"pending": 2,
"completion_percentage": 70
}
Error Responses
400 Bad Request
{
"error": "unsupported_format",
"message": "File format .xyz is not supported",
"supported_formats": ["pdf", "docx", "txt"]
}
413 Payload Too Large
{
"error": "file_too_large",
"message": "File size exceeds maximum limit",
"max_size_bytes": 52428800,
"provided_size_bytes": 104857600
}
422 Unprocessable Entity
{
"error": "corrupted_file",
"message": "The document appears to be corrupted and cannot be processed"
}
Webhooks
Configure webhooks to receive processing notifications:
{
"event": "document.processed",
"document_id": "doc_abc123",
"status": "completed",
"results": {
"text_extracted": true,
"summary_generated": true,
"entities_extracted": true
}
}
Rate Limits
| Operation | Limit | Window |
|---|---|---|
| Upload Document | 50/hour | Per user |
| Process Document | 100/hour | Per user |
| Generate Summary | 20/hour | Per user |
| Batch Processing | 5/hour | Per user |
Best Practices
- Preprocess Documents: Clean scanned documents before OCR
- Use Appropriate Formats: Choose the right output format for your use case
- Batch Similar Documents: Process similar documents together for efficiency
- Handle Large Files: Use chunking for large documents
- Cache Results: Store processed results to avoid reprocessing
- Monitor Processing: Use webhooks for long-running operations
Integration Examples
Python Example
import requests
# Upload and process document
with open('document.pdf', 'rb') as f:
response = requests.post(
'http://localhost:9000/api/v1/documents/upload',
headers={'Authorization': 'Bearer token123'},
files={'file': f},
data={'process_options': '{"extract_text": true}'}
)
document_id = response.json()['document_id']
# Get extracted text
text_response = requests.get(
f'http://localhost:9000/api/v1/documents/{document_id}/text',
headers={'Authorization': 'Bearer token123'}
)
print(text_response.json()['text'])
Related APIs
- Storage API - Document storage
- ML API - Advanced text analysis
- Knowledge Base API - Document indexing