Web Crawl
The Web Crawl component allows you to extract content from multiple web pages simultaneously. You can create dynamic URL lists by combining multiple text inputs with a template, similar to the Text Aggregator component.
Credit Cost
Depends on the content of the pages. Crawling the introduction page costs 5 credits, for reference.
Usage
The Web Crawl component has multiple input handles that accept text data, and a single output handle that produces the crawled content in markdown format. You can connect any number of text variables to the input handles and use them in your URLs template using the {{variable}} syntax.
Variable Handling
Variables must be explicitly referenced in the URLs template to be used. Simply connecting a variable to the input handle is not enough - you must use the {{variable}} syntax in the template to include its value. Any connected variable that is not referenced in the URLs template will be ignored.
If a referenced variable contains empty data, that variable will be replaced with an empty string in the URLs.
Properties
URLs
- Type: text
- Description: A template that must evaluate to a valid JSON array of URLs. Use {{variable}} syntax to reference input variables.
- Default: Empty template
Output Format
The component outputs the raw content of all crawled pages in markdown format. The content is processed to:
- Convert HTML to markdown
- Preserve text formatting
- Include headers and lists
- Maintain links
- Remove unnecessary styling
Examples
For input variables:
- domain = "docs.example.com"
- product = "widget"
URLs Template:
[
"https://{{domain}}/{{product}}/overview",
"https://{{domain}}/{{product}}/features"
]
This will crawl both URLs and return their content in markdown format.
Important Notes
- The URLs template must evaluate to a valid JSON array of strings
- All URLs must be valid and accessible
- Some websites may block or rate-limit crawling
- The component respects robots.txt rules