Web

The Web component is an operator component that allows users to scrape websites. It can carry out the following tasks:

#Release Stage

Alpha

#Configuration

The component definition and tasks are defined in the definition.json and tasks.json files respectively.

#Supported Tasks

#Crawl Site

This task involves systematically navigating through a website, starting from a designated page (typically the homepage), and following internal links to discover and retrieve page titles and URLs. The process is limited to 120 seconds and only collects links and titles from multiple pages; it does not extract the content of the pages themselves. If you need to collect specific content from individual pages, please use the Scrape Page task instead.

InputIDTypeDescription
Task ID (required)taskstringTASK_CRAWL_SITE
URL (required)urlstringThe root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on.
Allowed Domainsallowed-domainsarray[string]A list of domains that are allowed to be scraped. If empty, all domains are allowed.
Max Number of Pages (required)max-kintegerMax-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more.
TimeouttimeoutintegerThe time to wait for a page to load in milliseconds. Min 0, Max 60000. Please notice the timeout here is set for each page rather than the whole crawl task.
Max Depthmax-depthintegerMax Depth specifies how deep the crawler will navigate from the root URL. If max depth is set to 1, the crawler will only scrape the root URL and will not follow any links to other pages. If max depth is set to 0, the crawler will scrape all reachable pages until the total number of scraped pages reaches max-k. If both max-k and max depth are defined, the crawler will prioritize the max-k setting when determining how many pages to scrape.
FilterfilterobjectFiltering based on regular expression. The URL will be crawled if it matches either include-pattern or not match exclude-pattern. When both include-pattern and exclude-pattern are empty, all URLs will be crawled. It will process exclude-pattern first, then include-pattern. When exclude-pattern is not empty, only URLs that do not match exclude-pattern will be crawled. When include-pattern is not empty, only URLs that match include-pattern will be crawled.
Input Objects in Crawl Site

Filter

Filtering based on regular expression. The URL will be crawled if it matches either include-pattern or not match exclude-pattern. When both include-pattern and exclude-pattern are empty, all URLs will be crawled. It will process exclude-pattern first, then include-pattern. When exclude-pattern is not empty, only URLs that do not match exclude-pattern will be crawled. When include-pattern is not empty, only URLs that match include-pattern will be crawled.

FieldField IDTypeNote
Exclude Patternexclude-patternstringWhen the URL is matched, the URL will not be crawled.
Include Patterninclude-patternstringWhen the URL is matched, the URL will be crawled.
OutputIDTypeDescription
Pagespagesarray[object]The link and title of webpages crawled by the crawler.
Output Objects in Crawl Site

Pages

FieldField IDTypeNote
LinklinkstringThe full URL to which the webpage link is pointing, e.g., http://www.example.com/foo/bar.
TitletitlestringThe title of a webpage link in plain text.

#Regex Match Sample

  • Match example1.com, example2.com, but not examplea.com


    example[\\d].com

  • Match exampleA.com, exampleB.com, exampleC.com, but not examplea.com


    example[A|B|C].com

  • Match all subdomains of example.com (blog.example.com, dev.example.com, etc.)


    .*\\.example\\.com

  • Match URLs with specific file extensions (pdf, doc, docx)


    .*\\.(pdf|doc|docx)$

  • Match URLs containing specific keywords in the path


    .*(blog|news|article).*

  • Match URLs with specific port numbers (8080 or 8443)


    .*:(8080|8443)($|/.*)

  • Match secure URLs (https only)


    ^https://.*

  • Match specific country top-level domains


    .*\\.(uk|fr|de)$

  • Match URLs without query parameters


    ^[^?]*$

  • Match URLs with specific query parameters


    .*[?&]id=[0-9]+.*

#Scrape Pages

This task focuses on extracting specific data from targeted webpages by parsing its HTML structure. Unlike crawling, which navigates across multiple pages, scraping retrieves content only from the specified page. After scraping, the data can be further processed using a defined jQuery in a specified sequence. The sequence of jQuery filtering data will be executed in the order of only-main-content, remove-tags, and only-include-tags. Refer to the jQuery Syntax Examples for more details on how to filter and manipulate the data. To avoid a single URL failure from affecting all requests, we will not return an error when an individual URL fails. Instead, we will return all contents that are successfully scraped.

InputIDTypeDescription
Task ID (required)taskstringTASK_SCRAPE_PAGES
URLs (required)urlsarray[string]The URLs to scrape the webpage contents.
Scrape Method (required)scrape-methodstringDefines the method used for web scraping. Available options include 'http' for standard HTTP-based scraping and 'chrome-simulator' for scraping through a simulated Chrome browser environment.
Enum values
  • http
  • chrome-simulator
Include HTMLinclude-htmlbooleanIndicate whether to include the raw HTML of the webpage in the output. If you want to include the raw HTML, set this to true.
Only Main Contentonly-main-contentbooleanOnly return the main content of the page by excluding the content of the tag of header, nav, footer.
Remove Tagsremove-tagsarray[string]A list of tags, classes, and ids to remove from the output. You can use jQuery to remove data. If empty, no tags will be removed. Example: 'script, .ad, #footer'. Please check the jQuery Syntax Examples.
Only Include Tagsonly-include-tagsarray[string]A list of tags, classes, and ids to include in the output. You can use jQuery to include data. If empty, all tags will be included. Example: 'script, .ad, #footer'. Please check the jQuery Syntax Examples.
TimeouttimeoutintegerThis parameter specifies the time to wait for a page to load, measured in milliseconds. The minimum value is 0, and the maximum value is 60,000. Please note that if you set a short timeout, the page may not fully load. Conversely, setting a long timeout could significantly increase the time it takes for the task to complete. This timeout setting applies only to the Chrome simulator.
OutputIDTypeDescription
Pagespagesarray[object]A list of page objects that have been scraped.
Output Objects in Scrape Pages

Pages

FieldField IDTypeNote
ContentcontentstringThe scraped plain content without html tags of the webpage.
HTMLhtmlstringThe scraped html of the webpage.
Links on Pagelinks-on-pagearrayThe list of links on the webpage.
MarkdownmarkdownstringThe scraped markdown of the webpage.
MetadatametadataobjectThe metadata of the webpage.

Metadata

FieldField IDTypeNote
DescriptiondescriptionstringThe description of the webpage.
Source URLsource-urlstringThe source URL of the webpage.
TitletitlestringThe title of the webpage.

jQuery Syntax Examples

  • Element Selector: Targets all instances of a specific HTML tag. Example: div Extracts all <div> elements.

  • ID Selector: Targets a single element by its unique id attribute. Example: #header Extracts the element with the id of header.

  • Class Selector: Targets all elements with a specific class name. Example: .button Extracts all elements with the class button.

  • Attribute Selector: Targets elements based on the presence or value of an attribute. Example: [type="text"] Extracts all elements with a type attribute equal to text.

  • Descendant Selector: Targets elements that are nested within other elements. Example: div p Extracts all <p> elements that are inside <div> elements.

#About Dynamic Content

TASK_SCRAPE_PAGES supports fetching dynamic content from web pages by simulating user behaviours, such as scrolling down. The initial implementation includes the following capabilities:

Scrolling:

  • Mimics user scrolling down the page to load additional content dynamically.

Future enhancements will include additional user interactions, such as:

  • Clicking: Simulate mouse clicks on specified elements.
  • Taking Screenshots: Capture screenshots of the current view.
  • Keyboard Actions: Simulate key presses and other keyboard interactions.

TASK_SCRAPE_PAGES aims to provide a robust framework for interacting with web pages and extracting dynamic content effectively.

#Scrape Sitemap

This task extracts data directly from a website’s sitemap. A sitemap is typically an XML file that lists all URLs and other relevant metadata, providing a structured overview of the site’s pages. This method efficiently gathers key information from the sitemap without navigating through the site’s internal pages.

InputIDTypeDescription
Task ID (required)taskstringTASK_SCRAPE_SITEMAP
URL (required)urlstringThe URL of the sitemap to scrape.
OutputIDTypeDescription
ListlistarrayThe list of information in a sitemap.

#Example Recipes


version: v1beta
variable:
url:
title: URL
format: string
component:
crawler:
type: web
input:
url: ${variable.url}
allowed-domains:
max-k: 30
timeout: 1000
max-depth: 0
condition:
task: TASK_CRAWL_SITE
json-filter:
type: json
input:
json-value: ${crawler.output.pages}
jq-filter: .[] | ."link"
condition:
task: TASK_JQ
scraper:
type: web
input:
urls: ${json-filter.output.results}
scrape-method: http
include-html: false
only-main-content: true
remove-tags:
only-include-tags:
timeout: 0
condition:
task: TASK_SCRAPE_PAGES
output:
pages:
title: Pages
value: ${crawler.output.pages}
links:
title: Links
value: ${json-filter.output.results}
scraper-pages:
title: Scraper Pages
value: ${scraper.output.pages}