Text

The Text component is an operator component that allows users to extract and manipulate text from different sources. It can carry out the following tasks:

#Release Stage

Alpha

#Configuration

The component configuration is defined and maintained here.

#Supported Tasks

#Convert To Text

Convert document to text.

InputIDTypeDescription
Task ID (required)taskstringTASK_CONVERT_TO_TEXT
Document (required)docstringBase64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text
OutputIDTypeDescription
BodybodystringPlain text converted from the document
MetametaobjectMetadata extracted from the document
MSecsmsecsnumberTime taken to convert the document
ErrorerrorstringError message if any during the conversion process

#Chunk Text

Chunk text with different strategies

InputIDTypeDescription
Task ID (required)taskstringTASK_CHUNK_TEXT
Text (required)textstringText to be chunked
Strategy (required)strategyobjectChunking strategy
OutputIDTypeDescription
Token Count (optional)token-countintegerTotal count of tokens in the input text
Text Chunkstext-chunksarray[object]Text chunks after splitting
Number of Text Chunkschunk-numintegerTotal number of output text chunks

#Chunking Strategy

There are three strategies available for chunking text in Text Component:

    1. Token
    1. Recursive
    1. Markdown

#Token

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

ParameterTypeDescription
chunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens
chunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks
model-namestringThe name of the model used for tokenization
allowed-specialarray of stringsA list of special tokens that are allowed within chunks
disallowed-specialarray of stringsA list of special tokens that should not appear within chunks

#Recursive

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

ParameterTypeDescription
chunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens
chunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks
model-namestringThe name of the model used for tokenization
separatorsarray of stringsA list of strings representing the separators used to split the text
keep-separatorbooleanA flag indicating whether to keep the separator characters at the beginning or end of chunks

#Markdown

This text splitter is specially designed for Markdown format.

ParameterTypeDescription
chunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens
chunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks
model-namestringThe name of the model used for tokenization
code-blocksbooleanA flag indicating whether code blocks should be treated as a single unit

#Text Chunks in Output

ParameterTypeDescription
teststringThe text chunk
start-positionintegerThe starting position of the text chunk in the original text
end-positionintegerThe ending position of the text chunk in the original text

Last updated: 7/16/2024, 8:11:38 AM