The Text Extraction Operator enables users to extract text from URLs, local file paths and base64 encoded file contents. It supports formats like - doc, docx, odt, pages, pdf, pptx, rtf, xml, html, txt.
#Release Stage
Alpha
#Operator Configuration
The operator configuration is used for setting up the input data and parameters of this component. The configuration is configured in pipeline recipe, please refer to pipeline for more details.
Field | Type | Note |
---|---|---|
task* | string | TASK_EXTRACT_FROM_PATH or TASK_EXTRACT_FROM_FILE |
path* | string | Path of the content. Could be a local file path or web URL. Required when task is TASK_EXTRACT_FROM_PATH |
file_contents* | string | Base64 encoded file. Required when task is TASK_EXTRACT_FROM_FILE |
content_type* | string | Mime type of the file. Required when task is TASK_EXTRACT_FROM_FILE. eg. application/pdf, text/html etc |
#No-code Setup
#Low-code Setup
This is a sample configuration
in the pipeline recipe.
{ "configuration": { "task": "TASK_EXTRACT_FROM_PATH", "path": "{ start.path }" }}
When you send the request, you can use this request format
curl --location 'http://localhost:8080/vdp/v1alpha/users/<user-id>/pipelines/<pipeline-id>/trigger' \--header 'Content-Type: application/json' \--header 'Authorization: Bearer <api_token>' \--data '{ "inputs": [ { "task": "TASK_EXTRACT_FROM_PATH", "path": "https://instill.tech" } ]}'
For other operations, please refer to the VDP Protobufs.