generalbots/docs/src/chapter-09/web-automation.md

6.8 KiB

Web Automation

The web automation module enables BotServer to interact with websites, extract content, and perform automated browser tasks.

Overview

Web automation features allow bots to:

  • Crawl and index website content
  • Extract structured data from web pages
  • Automate form submissions
  • Capture screenshots
  • Monitor website changes
  • Perform headless browser operations

Configuration

Enable web automation in config.csv:

webAutomationEnabled,true
browserTimeout,30000
maxCrawlDepth,3
userAgent,BotServer/1.0

Features

Website Crawling

The ADD_WEBSITE keyword triggers web crawling:

ADD_WEBSITE "https://example.com"

This will:

  1. Launch headless browser
  2. Navigate to the URL
  3. Extract text content
  4. Follow internal links (respecting robots.txt)
  5. Index content in vector database
  6. Make content searchable via FIND keyword

Content Extraction

Extract specific data from web pages:

url = "https://news.example.com"
content = GET url
headlines = EXTRACT_CSS content, "h2.headline"

Form Automation

Submit forms programmatically:

NAVIGATE "https://example.com/contact"
FILL_FIELD "name", customer_name
FILL_FIELD "email", customer_email
FILL_FIELD "message", inquiry_text
CLICK_BUTTON "submit"
result = GET_PAGE_TEXT()

Screenshot Capture

Capture visual representations:

NAVIGATE "https://example.com/dashboard"
screenshot = CAPTURE_SCREENSHOT()
SAVE_FILE screenshot, "dashboard.png"

Change Monitoring

Monitor websites for updates:

SET_MONITOR "https://example.com/status", "hourly"
ON "website_changed" DO
    changes = GET_CHANGES()
    SEND_MAIL admin_email, "Website Updated", changes
END ON

Crawler Configuration

Crawl Rules

Control crawler behavior:

Setting Description Default
maxDepth Maximum crawl depth 3
maxPages Maximum pages to crawl 100
crawlDelay Delay between requests (ms) 1000
respectRobots Honor robots.txt true
followRedirects Follow HTTP redirects true
includeImages Extract image URLs false
includePDFs Process PDF links true

Selector Strategies

Extract content using CSS selectors:

' Extract specific elements
titles = EXTRACT_CSS page, "h1, h2, h3"
paragraphs = EXTRACT_CSS page, "p"
links = EXTRACT_CSS page, "a[href]"
images = EXTRACT_CSS page, "img[src]"

Or XPath expressions:

' XPath extraction
prices = EXTRACT_XPATH page, "//span[@class='price']"
reviews = EXTRACT_XPATH page, "//div[@class='review-text']"

Browser Automation

Navigation

Control browser navigation:

NAVIGATE "https://example.com"
WAIT_FOR_ELEMENT "#content"
SCROLL_TO_BOTTOM()
BACK()
FORWARD()
REFRESH()

Interaction

Interact with page elements:

CLICK "#login-button"
TYPE "#username", user_credentials
SELECT "#country", "USA"
CHECK "#agree-terms"
UPLOAD_FILE "#document", "report.pdf"

Waiting Strategies

Wait for specific conditions:

WAIT_FOR_ELEMENT "#results"
WAIT_FOR_TEXT "Loading complete"
WAIT_FOR_URL "success"
WAIT_SECONDS 3

Data Processing

Structured Data Extraction

Extract structured data from pages:

products = EXTRACT_TABLE "#product-list"
FOR EACH product IN products
    SAVE_TO_DB product.name, product.price, product.stock
NEXT

Content Cleaning

Clean extracted content:

raw_text = GET_PAGE_TEXT()
clean_text = REMOVE_HTML(raw_text)
clean_text = REMOVE_SCRIPTS(clean_text)
clean_text = NORMALIZE_WHITESPACE(clean_text)

Performance Optimization

Caching

Cache crawled content:

IF NOT IN_CACHE(url) THEN
    content = CRAWL_URL(url)
    CACHE_SET(url, content, "1 hour")
ELSE
    content = CACHE_GET(url)
END IF

Parallel Processing

Process multiple URLs concurrently:

urls = ["url1", "url2", "url3"]
results = PARALLEL_CRAWL(urls, max_workers=5)

Security Considerations

Authentication

Handle authenticated sessions:

LOGIN "https://example.com/login", username, password
cookie = GET_COOKIE("session")
' Use cookie for subsequent requests
NAVIGATE "https://example.com/dashboard"

Rate Limiting

Respect rate limits:

CONFIGURE_CRAWLER(
    rate_limit = 10,  ' requests per second
    user_agent = "BotServer/1.0",
    timeout = 30000
)

Content Filtering

Filter inappropriate content:

content = CRAWL_URL(url)
IF CONTAINS_INAPPROPRIATE(content) THEN
    LOG_WARNING "Inappropriate content detected"
    SKIP_URL(url)
END IF

Error Handling

Handle common web automation errors:

TRY
    content = CRAWL_URL(url)
CATCH "timeout"
    LOG "Page load timeout: " + url
    RETRY_WITH_DELAY(5000)
CATCH "404"
    LOG "Page not found: " + url
    MARK_AS_BROKEN(url)
CATCH "blocked"
    LOG "Access blocked, might need CAPTCHA"
    USE_PROXY()
END TRY

Integration with Knowledge Base

Automatically index crawled content:

ADD_WEBSITE "https://docs.example.com"
' Content is automatically indexed

' Later, search the indexed content
answer = FIND "installation guide"
TALK answer

Monitoring and Logging

Track automation activities:

START_MONITORING()
result = CRAWL_URL(url)
metrics = GET_METRICS()
LOG "Pages crawled: " + metrics.page_count
LOG "Time taken: " + metrics.duration
LOG "Data extracted: " + metrics.data_size

Best Practices

  1. Respect robots.txt: Always honor website crawling rules
  2. Use appropriate delays: Don't overwhelm servers
  3. Handle errors gracefully: Implement retry logic
  4. Cache when possible: Reduce redundant requests
  5. Monitor performance: Track crawling metrics
  6. Secure credentials: Never hardcode passwords
  7. Test selectors: Verify CSS/XPath selectors work
  8. Clean data: Remove unnecessary HTML/scripts
  9. Set timeouts: Prevent infinite waiting
  10. Log activities: Maintain audit trail

Limitations

  • JavaScript-heavy sites may require additional configuration
  • CAPTCHA-protected sites need manual intervention
  • Some sites block automated access
  • Large-scale crawling requires distributed setup
  • Dynamic content may need special handling

Troubleshooting

Page Not Loading

  • Check network connectivity
  • Verify URL is accessible
  • Increase timeout values
  • Check for JavaScript requirements

Content Not Found

  • Verify CSS selectors are correct
  • Check if content is dynamically loaded
  • Wait for elements to appear
  • Use browser developer tools to test

Access Denied

  • Check user agent settings
  • Verify authentication credentials
  • Respect rate limits
  • Consider using proxies

Implementation

The web automation module is located in src/web_automation/ and uses:

  • Headless browser engine for rendering
  • HTML parsing for content extraction
  • Request throttling for rate limiting
  • Vector database for content indexing