The Web component is an operator component that allows users to scrape websites. It can carry out the following tasks:
#Release Stage
Alpha
#Configuration
The component definition and tasks are defined in the definition.json and tasks.json files respectively.
#Supported Tasks
#Crawl Site
This task involves systematically navigating through a website, starting from a designated page (typically the homepage), and following internal links to discover and retrieve page titles and URLs. The process is limited to 120 seconds and only collects links and titles from multiple pages; it does not extract the content of the pages themselves. If you need to collect specific content from individual pages, please use the Scrape Page task instead.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_CRAWL_SITE |
URL (required) | url | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
Allowed Domains | allowed-domains | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
Max Number of Pages (required) | max-k | integer | Max-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more. |
Timeout | timeout | integer | The time to wait for a page to load in milliseconds. Min 0, Max 60000. Please notice the timeout here is set for each page rather than the whole crawl task. |
Max Depth | max-depth | integer | Max Depth specifies how deep the crawler will navigate from the root URL. If max depth is set to 1, the crawler will only scrape the root URL and will not follow any links to other pages. If max depth is set to 0, the crawler will scrape all reachable pages until the total number of scraped pages reaches max-k. If both max-k and max depth are defined, the crawler will prioritize the max-k setting when determining how many pages to scrape. |
Filter | filter | object | Filtering based on regular expression. The URL will be crawled if it matches either include-pattern or not match exclude-pattern. When both include-pattern and exclude-pattern are empty, all URLs will be crawled. It will process exclude-pattern first, then include-pattern. When exclude-pattern is not empty, only URLs that do not match exclude-pattern will be crawled. When include-pattern is not empty, only URLs that match include-pattern will be crawled. |
Input Objects in Crawl Site
Filter
Filtering based on regular expression. The URL will be crawled if it matches either include-pattern or not match exclude-pattern. When both include-pattern and exclude-pattern are empty, all URLs will be crawled. It will process exclude-pattern first, then include-pattern. When exclude-pattern is not empty, only URLs that do not match exclude-pattern will be crawled. When include-pattern is not empty, only URLs that match include-pattern will be crawled.
Field | Field ID | Type | Note |
---|---|---|---|
Exclude Pattern | exclude-pattern | string | When the URL is matched, the URL will not be crawled. |
Include Pattern | include-pattern | string | When the URL is matched, the URL will be crawled. |
Output | ID | Type | Description |
---|---|---|---|
Pages | pages | array[object] | The link and title of webpages crawled by the crawler. |
Output Objects in Crawl Site
Pages
Field | Field ID | Type | Note |
---|---|---|---|
Link | link | string | The full URL to which the webpage link is pointing, e.g., http://www.example.com/foo/bar. |
Title | title | string | The title of a webpage link in plain text. |
#Regex Match Sample
-
Match example1.com, example2.com, but not examplea.com
example[\\d].com -
Match exampleA.com, exampleB.com, exampleC.com, but not examplea.com
example[A|B|C].com -
Match all subdomains of example.com (blog.example.com, dev.example.com, etc.)
.*\\.example\\.com -
Match URLs with specific file extensions (pdf, doc, docx)
.*\\.(pdf|doc|docx)$ -
Match URLs containing specific keywords in the path
.*(blog|news|article).* -
Match URLs with specific port numbers (8080 or 8443)
.*:(8080|8443)($|/.*) -
Match secure URLs (https only)
^https://.* -
Match specific country top-level domains
.*\\.(uk|fr|de)$ -
Match URLs without query parameters
^[^?]*$ -
Match URLs with specific query parameters
.*[?&]id=[0-9]+.*
#Scrape Pages
This task focuses on extracting specific data from targeted webpages by parsing its HTML structure. Unlike crawling, which navigates across multiple pages, scraping retrieves content only from the specified page. After scraping, the data can be further processed using a defined jQuery in a specified sequence. The sequence of jQuery filtering data will be executed in the order of only-main-content
, remove-tags
, and only-include-tags
. Refer to the jQuery Syntax Examples for more details on how to filter and manipulate the data. To avoid a single URL failure from affecting all requests, we will not return an error when an individual URL fails. Instead, we will return all contents that are successfully scraped.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_SCRAPE_PAGES |
URLs (required) | urls | array[string] | The URLs to scrape the webpage contents. |
Scrape Method (required) | scrape-method | string | Defines the method used for web scraping. Available options include 'http' for standard HTTP-based scraping and 'chrome-simulator' for scraping through a simulated Chrome browser environment. Enum values
|
Include HTML | include-html | boolean | Indicate whether to include the raw HTML of the webpage in the output. If you want to include the raw HTML, set this to true. |
Only Main Content | only-main-content | boolean | Only return the main content of the page by excluding the content of the tag of header, nav, footer. |
Remove Tags | remove-tags | array[string] | A list of tags, classes, and ids to remove from the output. You can use jQuery to remove data. If empty, no tags will be removed. Example: 'script, .ad, #footer'. Please check the jQuery Syntax Examples. |
Only Include Tags | only-include-tags | array[string] | A list of tags, classes, and ids to include in the output. You can use jQuery to include data. If empty, all tags will be included. Example: 'script, .ad, #footer'. Please check the jQuery Syntax Examples. |
Timeout | timeout | integer | This parameter specifies the time to wait for a page to load, measured in milliseconds. The minimum value is 0, and the maximum value is 60,000. Please note that if you set a short timeout, the page may not fully load. Conversely, setting a long timeout could significantly increase the time it takes for the task to complete. This timeout setting applies only to the Chrome simulator. |
Output | ID | Type | Description |
---|---|---|---|
Pages | pages | array[object] | A list of page objects that have been scraped. |
Output Objects in Scrape Pages
Pages
Field | Field ID | Type | Note |
---|---|---|---|
Content | content | string | The scraped plain content without html tags of the webpage. |
HTML | html | string | The scraped html of the webpage. |
Links on Page | links-on-page | array | The list of links on the webpage. |
Markdown | markdown | string | The scraped markdown of the webpage. |
Metadata | metadata | object | The metadata of the webpage. |
Metadata
Field | Field ID | Type | Note |
---|---|---|---|
Description | description | string | The description of the webpage. |
Source URL | source-url | string | The source URL of the webpage. |
Title | title | string | The title of the webpage. |
jQuery Syntax Examples
-
Element Selector: Targets all instances of a specific HTML tag. Example:
div
Extracts all<div>
elements. -
ID Selector: Targets a single element by its unique
id
attribute. Example:#header
Extracts the element with theid
ofheader
. -
Class Selector: Targets all elements with a specific class name. Example:
.button
Extracts all elements with the classbutton
. -
Attribute Selector: Targets elements based on the presence or value of an attribute. Example:
[type="text"]
Extracts all elements with atype
attribute equal totext
. -
Descendant Selector: Targets elements that are nested within other elements. Example:
div p
Extracts all<p>
elements that are inside<div>
elements.
#About Dynamic Content
TASK_SCRAPE_PAGES
supports fetching dynamic content from web pages by simulating user behaviours, such as scrolling down. The initial implementation includes the following capabilities:
Scrolling:
- Mimics user scrolling down the page to load additional content dynamically.
Future enhancements will include additional user interactions, such as:
- Clicking: Simulate mouse clicks on specified elements.
- Taking Screenshots: Capture screenshots of the current view.
- Keyboard Actions: Simulate key presses and other keyboard interactions.
TASK_SCRAPE_PAGES
aims to provide a robust framework for interacting with web pages and extracting dynamic content effectively.
#Scrape Sitemap
This task extracts data directly from a website’s sitemap. A sitemap is typically an XML file that lists all URLs and other relevant metadata, providing a structured overview of the site’s pages. This method efficiently gathers key information from the sitemap without navigating through the site’s internal pages.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_SCRAPE_SITEMAP |
URL (required) | url | string | The URL of the sitemap to scrape. |
Output | ID | Type | Description |
---|---|---|---|
List | list | array | The list of information in a sitemap. |
#Example Recipes
version: v1betavariable: url: title: URL format: stringcomponent: crawler: type: web input: url: ${variable.url} allowed-domains: max-k: 30 timeout: 1000 max-depth: 0 condition: task: TASK_CRAWL_SITE json-filter: type: json input: json-value: ${crawler.output.pages} jq-filter: .[] | ."link" condition: task: TASK_JQ scraper: type: web input: urls: ${json-filter.output.results} scrape-method: http include-html: false only-main-content: true remove-tags: only-include-tags: timeout: 0 condition: task: TASK_SCRAPE_PAGESoutput: pages: title: Pages value: ${crawler.output.pages} links: title: Links value: ${json-filter.output.results} scraper-pages: title: Scraper Pages value: ${scraper.output.pages}