Web

The Web component is an operator component that allows users to scrape websites. It can carry out the following tasks:

#Release Stage

Alpha

#Configuration

The component configuration is defined and maintained here.

#Supported Tasks

#Crawl Website

Crawl the website contents and manipulate html with jquery command. The sequence of jquery commands will be executed in the order of only-main-content, remove-tags, and only-include-tags.

InputIDTypeDescription
Task ID (required)taskstringTASK_CRAWL_WEBSITE
Query (required)target-urlstringThe root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on.
Allowed Domainsallowed-domainsarray[string]A list of domains that are allowed to be scraped. If empty, all domains are allowed.
Max Number of Pages (required)max-kintegerThe max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned.
Include Link Textinclude-link-textbooleanIndicate whether to scrape the link and include the text of the link associated with this page in the 'link-text' field
Include Link HTMLinclude-link-htmlbooleanIndicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link-html' field
Only Main Contentonly-main-contentbooleanOnly return the main content of the page excluding header, nav, footer.
Remove Tagsremove-tagsarray[string]A list of tags, classes, and ids to remove from the output. If empty, no tags will be removed. Example: 'script, .ad, #footer'
Only Include Tagsonly-include-tagsarray[string]A list of tags, classes, and ids to include in the output. If empty, all tags will be included. Example: 'script, .ad, #footer'
TimeouttimeoutintegerThe time to wait for the page to load in milliseconds. Min 0, Max 60000.
OutputIDTypeDescription
Pagespagesarray[object]The scraped webpages

#Scrape Sitemap

Scrape the sitemap information

InputIDTypeDescription
Task ID (required)taskstringTASK_SCRAPE_SITEMAP
Sitemap URL (required)urlstringThe URL of the sitemap to scrape
OutputIDTypeDescription
ListlistarrayThe list of information in a sitemap

#Scrape Webpage

Scrape the webpage contents and manipulate html with jquery command. The sequence of jquery commands will be executed in the order of only-main-content, remove-tags, and only-include-tags.

InputIDTypeDescription
Task ID (required)taskstringTASK_SCRAPE_WEBPAGE
URL (required)urlstringThe URL to be scrape the webpage contents
Include HTMLinclude-htmlbooleanIndicate whether to include the raw HTML of the webpage in the output
Only Main Contentonly-main-contentbooleanOnly return the main content of the page excluding header, nav, footer.
Remove Tagsremove-tagsarray[string]A list of tags, classes, and ids to remove from the output. If empty, no tags will be removed. Example: 'script, .ad, #footer'
Only Include Tagsonly-include-tagsarray[string]A list of tags, classes, and ids to include in the output. If empty, all tags will be included. Example: 'script, .ad, #footer'
TimeouttimeoutintegerThe time to wait for the page to load in milliseconds. Min 0, Max 60000.
OutputIDTypeDescription
ContentcontentstringThe scraped plain content without html tags of the webpage
MarkdownmarkdownstringThe scraped markdown of the webpage
HTML (optional)htmlstringThe scraped html of the webpage
Metadata (optional)metadataobjectThe metadata of the webpage
Links on Page (optional)links-on-pagearray[string]The list of links on the webpage

#Example Recipes

Recipe for the Web scraper pipeline.


version: v1beta
component:
scraper:
type: web
task: TASK_SCRAPE_WEBPAGE
input:
include-html: false
only-main-content: true
url: ${variable.url}
variable:
url:
title: url
description: The url of the web you want to scrape
instill-format: string
output:
markdown:
title: markdown
value: ${scraper.output.markdown}