Text Extraction

The Text Extraction Operator enables users to extract text from URLs, local file paths and base64 encoded file contents. It supports formats like - doc, docx, odt, pages, pdf, pptx, rtf, xml, html, txt.

#Release Stage

Alpha

#Operator Configuration

The operator configuration is used for setting up the input data and parameters of this component. The configuration is configured in pipeline recipe, please refer to pipeline for more details.

FieldTypeNote
task*stringTASK_EXTRACT_FROM_PATH or TASK_EXTRACT_FROM_FILE
path*stringPath of the content. Could be a local file path or web URL. Required when task is TASK_EXTRACT_FROM_PATH
file_contents*stringBase64 encoded file. Required when task is TASK_EXTRACT_FROM_FILE
content_type*stringMime type of the file. Required when task is TASK_EXTRACT_FROM_FILE. eg. application/pdf, text/html etc

#No-code Setup

#Low-code Setup

This is a sample configuration in the pipeline recipe.


{
"configuration": {
"task": "TASK_EXTRACT_FROM_PATH",
"path": "{ start.path }"
}
}

When you send the request, you can use this request format


curl --location 'http://localhost:8080/vdp/v1alpha/users/<user-id>/pipelines/<pipeline-id>/trigger' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <api_token>' \
--data '{
"inputs": [
{
"task": "TASK_EXTRACT_FROM_PATH",
"path": "https://instill.tech"
}
]
}'

For other operations, please refer to the VDP Protobufs.

Last updated: 1/2/2024, 4:47:25 PM