Rodrigo Rodriguez (Pragmatismo) baeebc5677 - New docds.

2025-11-22 16:12:32 -03:00

6.8 KiB

Raw Blame History

Web Automation

The web automation module enables BotServer to interact with websites, extract content, and perform automated browser tasks.

Overview

Web automation features allow bots to:

Crawl and index website content
Extract structured data from web pages
Automate form submissions
Capture screenshots
Monitor website changes
Perform headless browser operations

Configuration

Enable web automation in config.csv:

webAutomationEnabled,true
browserTimeout,30000
maxCrawlDepth,3
userAgent,BotServer/1.0

Features

Website Crawling

The ADD_WEBSITE keyword triggers web crawling:

ADD_WEBSITE "https://example.com"

This will:

Launch headless browser
Navigate to the URL
Extract text content
Follow internal links (respecting robots.txt)
Index content in vector database
Make content searchable via FIND keyword

Content Extraction

Extract specific data from web pages:

url = "https://news.example.com"
content = GET url
headlines = EXTRACT_CSS content, "h2.headline"

Form Automation

Submit forms programmatically:

NAVIGATE "https://example.com/contact"
FILL_FIELD "name", customer_name
FILL_FIELD "email", customer_email
FILL_FIELD "message", inquiry_text
CLICK_BUTTON "submit"
result = GET_PAGE_TEXT()

Screenshot Capture

Capture visual representations:

NAVIGATE "https://example.com/dashboard"
screenshot = CAPTURE_SCREENSHOT()
SAVE_FILE screenshot, "dashboard.png"

Change Monitoring

Monitor websites for updates:

SET_MONITOR "https://example.com/status", "hourly"
ON "website_changed" DO
    changes = GET_CHANGES()
    SEND_MAIL admin_email, "Website Updated", changes
END ON

Crawler Configuration

Crawl Rules

Control crawler behavior:

Setting	Description	Default
`maxDepth`	Maximum crawl depth	3
`maxPages`	Maximum pages to crawl	100
`crawlDelay`	Delay between requests (ms)	1000
`respectRobots`	Honor robots.txt	true
`followRedirects`	Follow HTTP redirects	true
`includeImages`	Extract image URLs	false
`includePDFs`	Process PDF links	true

Selector Strategies

Extract content using CSS selectors:

' Extract specific elements
titles = EXTRACT_CSS page, "h1, h2, h3"
paragraphs = EXTRACT_CSS page, "p"
links = EXTRACT_CSS page, "a[href]"
images = EXTRACT_CSS page, "img[src]"

Or XPath expressions:

' XPath extraction
prices = EXTRACT_XPATH page, "//span[@class='price']"
reviews = EXTRACT_XPATH page, "//div[@class='review-text']"

Browser Automation

Control browser navigation:

NAVIGATE "https://example.com"
WAIT_FOR_ELEMENT "#content"
SCROLL_TO_BOTTOM()
BACK()
FORWARD()
REFRESH()

Interaction

Interact with page elements:

CLICK "#login-button"
TYPE "#username", user_credentials
SELECT "#country", "USA"
CHECK "#agree-terms"
UPLOAD_FILE "#document", "report.pdf"

Waiting Strategies

Wait for specific conditions:

WAIT_FOR_ELEMENT "#results"
WAIT_FOR_TEXT "Loading complete"
WAIT_FOR_URL "success"
WAIT_SECONDS 3

Data Processing

Structured Data Extraction

Extract structured data from pages:

products = EXTRACT_TABLE "#product-list"
FOR EACH product IN products
    SAVE_TO_DB product.name, product.price, product.stock
NEXT

Content Cleaning

Clean extracted content:

raw_text = GET_PAGE_TEXT()
clean_text = REMOVE_HTML(raw_text)
clean_text = REMOVE_SCRIPTS(clean_text)
clean_text = NORMALIZE_WHITESPACE(clean_text)

Performance Optimization

Caching

Cache crawled content:

IF NOT IN_CACHE(url) THEN
    content = CRAWL_URL(url)
    CACHE_SET(url, content, "1 hour")
ELSE
    content = CACHE_GET(url)
END IF

Parallel Processing

Process multiple URLs concurrently:

urls = ["url1", "url2", "url3"]
results = PARALLEL_CRAWL(urls, max_workers=5)

Security Considerations

Authentication

Handle authenticated sessions:

LOGIN "https://example.com/login", username, password
cookie = GET_COOKIE("session")
' Use cookie for subsequent requests
NAVIGATE "https://example.com/dashboard"

Rate Limiting

Respect rate limits:

CONFIGURE_CRAWLER(
    rate_limit = 10,  ' requests per second
    user_agent = "BotServer/1.0",
    timeout = 30000
)

Content Filtering

Filter inappropriate content:

content = CRAWL_URL(url)
IF CONTAINS_INAPPROPRIATE(content) THEN
    LOG_WARNING "Inappropriate content detected"
    SKIP_URL(url)
END IF

Error Handling

Handle common web automation errors:

TRY
    content = CRAWL_URL(url)
CATCH "timeout"
    LOG "Page load timeout: " + url
    RETRY_WITH_DELAY(5000)
CATCH "404"
    LOG "Page not found: " + url
    MARK_AS_BROKEN(url)
CATCH "blocked"
    LOG "Access blocked, might need CAPTCHA"
    USE_PROXY()
END TRY

Integration with Knowledge Base

Automatically index crawled content:

ADD_WEBSITE "https://docs.example.com"
' Content is automatically indexed

' Later, search the indexed content
answer = FIND "installation guide"
TALK answer

Monitoring and Logging

Track automation activities:

START_MONITORING()
result = CRAWL_URL(url)
metrics = GET_METRICS()
LOG "Pages crawled: " + metrics.page_count
LOG "Time taken: " + metrics.duration
LOG "Data extracted: " + metrics.data_size

Best Practices

Respect robots.txt: Always honor website crawling rules
Use appropriate delays: Don't overwhelm servers
Handle errors gracefully: Implement retry logic
Cache when possible: Reduce redundant requests
Monitor performance: Track crawling metrics
Secure credentials: Never hardcode passwords
Test selectors: Verify CSS/XPath selectors work
Clean data: Remove unnecessary HTML/scripts
Set timeouts: Prevent infinite waiting
Log activities: Maintain audit trail

Limitations

JavaScript-heavy sites may require additional configuration
CAPTCHA-protected sites need manual intervention
Some sites block automated access
Large-scale crawling requires distributed setup
Dynamic content may need special handling

Troubleshooting

Page Not Loading

Check network connectivity
Verify URL is accessible
Increase timeout values
Check for JavaScript requirements

Content Not Found

Verify CSS selectors are correct
Check if content is dynamically loaded
Wait for elements to appear
Use browser developer tools to test

Access Denied

Check user agent settings
Verify authentication credentials
Respect rate limits
Consider using proxies

Implementation

The web automation module is located in src/web_automation/ and uses:

Headless browser engine for rendering
HTML parsing for content extraction
Request throttling for rate limiting
Vector database for content indexing

6.8 KiB Raw Blame History