The Web component is an operator component that allows users to scrape websites. It can carry out the following tasks:
#Release Stage
Alpha
#Configuration
The component configuration is defined and maintained here.
#Supported Tasks
#Crawl Website
Crawl the website contents and manipulate html with jquery command. The sequence of jquery commands will be executed in the order of only-main-content, remove-tags, and only-include-tags.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_CRAWL_WEBSITE |
Query (required) | target-url | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
Allowed Domains | allowed-domains | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
Max Number of Pages (required) | max-k | integer | The max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned. |
Include Link Text | include-link-text | boolean | Indicate whether to scrape the link and include the text of the link associated with this page in the 'link-text' field |
Include Link HTML | include-link-html | boolean | Indicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link-html' field |
Only Main Content | only-main-content | boolean | Only return the main content of the page excluding header, nav, footer. |
Remove Tags | remove-tags | array[string] | A list of tags, classes, and ids to remove from the output. If empty, no tags will be removed. Example: 'script, .ad, #footer' |
Only Include Tags | only-include-tags | array[string] | A list of tags, classes, and ids to include in the output. If empty, all tags will be included. Example: 'script, .ad, #footer' |
Timeout | timeout | integer | The time to wait for the page to load in milliseconds. Min 0, Max 60000. |
Output | ID | Type | Description |
---|---|---|---|
Pages | pages | array[object] | The scraped webpages |
#Scrape Sitemap
Scrape the sitemap information
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_SCRAPE_SITEMAP |
Sitemap URL (required) | url | string | The URL of the sitemap to scrape |
Output | ID | Type | Description |
---|---|---|---|
List | list | array | The list of information in a sitemap |
#Scrape Webpage
Scrape the webpage contents and manipulate html with jquery command. The sequence of jquery commands will be executed in the order of only-main-content, remove-tags, and only-include-tags.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_SCRAPE_WEBPAGE |
URL (required) | url | string | The URL to be scrape the webpage contents |
Include HTML | include-html | boolean | Indicate whether to include the raw HTML of the webpage in the output |
Only Main Content | only-main-content | boolean | Only return the main content of the page excluding header, nav, footer. |
Remove Tags | remove-tags | array[string] | A list of tags, classes, and ids to remove from the output. If empty, no tags will be removed. Example: 'script, .ad, #footer' |
Only Include Tags | only-include-tags | array[string] | A list of tags, classes, and ids to include in the output. If empty, all tags will be included. Example: 'script, .ad, #footer' |
Timeout | timeout | integer | The time to wait for the page to load in milliseconds. Min 0, Max 60000. |
Output | ID | Type | Description |
---|---|---|---|
Content | content | string | The scraped plain content without html tags of the webpage |
Markdown | markdown | string | The scraped markdown of the webpage |
HTML (optional) | html | string | The scraped html of the webpage |
Metadata (optional) | metadata | object | The metadata of the webpage |
Links on Page (optional) | links-on-page | array[string] | The list of links on the webpage |
#Example Recipes
Recipe for the Web scraper pipeline.
version: v1betacomponent: scraper: type: web task: TASK_SCRAPE_WEBPAGE input: include-html: false only-main-content: true url: ${variable.url}variable: url: title: url description: The url of the web you want to scrape instill-format: stringoutput: markdown: title: markdown value: ${scraper.output.markdown}