The Text component is an operator component that allows users to extract and manipulate text from different sources. It can carry out the following tasks:
#Release Stage
Alpha
#Configuration
The component definition and tasks are defined in the definition.json and tasks.json files respectively.
#Supported Tasks
#Chunk Text
Chunk text with different strategies
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_CHUNK_TEXT |
Text (required) | text | string | Text to be chunked. |
Strategy (required) | strategy | object | Chunking strategy. |
Input Objects in Chunk Text
Strategy
Chunking strategy.
Field | Field ID | Type | Note |
---|---|---|---|
Setting | setting | object | Chunk Setting. |
The setting
Object
Setting
setting
must fulfill one of the following schemas:
Token
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.
Field | Field ID | Type | Note |
---|---|---|---|
Allowed Special Tokens | allowed-special | array | A list of special tokens that are allowed within chunks. |
Chunk Method | chunk-method | string | Must be "Token" |
Chunk Overlap | chunk-overlap | integer | Determines the number of tokens that overlap between consecutive chunks. |
Chunk Size | chunk-size | integer | Specifies the maximum size of each chunk in terms of the number of tokens. |
Disallowed Special Tokens | disallowed-special | array | A list of special tokens that should not appear within chunks. |
Model | model-name | string | The name of the model used for tokenization. Enum values
|
Recursive
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", "", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
Field | Field ID | Type | Note |
---|---|---|---|
Chunk Method | chunk-method | string | Must be "Recursive" |
Chunk Overlap | chunk-overlap | integer | Determines the number of tokens that overlap between consecutive chunks. |
Chunk Size | chunk-size | integer | Specifies the maximum size of each chunk in terms of the number of tokens. |
Keep Separator | keep-separator | boolean | A flag indicating whether to keep the separator characters at the beginning or end of chunks. |
Model | model-name | string | The name of the model used for tokenization. Enum values
|
Separators | separators | array | A list of strings representing the separators used to split the text. |
Markdown
This text splitter is specially designed for Markdown format.
Field | Field ID | Type | Note |
---|---|---|---|
Chunk Method | chunk-method | string | Must be "Markdown" |
Chunk Overlap | chunk-overlap | integer | Determines the number of tokens that overlap between consecutive chunks. |
Chunk Size | chunk-size | integer | Specifies the maximum size of each chunk in terms of the number of tokens. |
Code Blocks | code-blocks | boolean | A flag indicating whether code blocks should be treated as a single unit. |
Model | model-name | string | The name of the model used for tokenization. Enum values
|
Output | ID | Type | Description |
---|---|---|---|
Token Count | token-count | integer | Total count of tokens in the original input text. |
Text Chunks | text-chunks | array[object] | Text chunks after splitting. |
Number of Text Chunks | chunk-num | integer | Total number of output text chunks. |
Token Count Chunks | chunks-token-count | integer | Total count of tokens in the output text chunks. |
Output Objects in Chunk Text
Text Chunks
Field | Field ID | Type | Note |
---|---|---|---|
End Position | end-position | integer | The ending position of the chunk in the original text. |
Start Position | start-position | integer | The starting position of the chunk in the original text. |
Text | text | string | Text chunk after splitting. |
Token Count | token-count | integer | Count of tokens in a chunk. |