Text | Documentation

The Text component is an operator component that allows users to extract and manipulate text from different sources. It can carry out the following tasks:

Chunk Text

#Release Stage

Alpha

#Configuration

The component definition and tasks are defined in the definition.json and tasks.json files respectively.

#Supported Tasks

#Chunk Text

Chunk text with different strategies

Input	ID	Type	Description
Task ID (required)	`task`	string	`TASK_CHUNK_TEXT`
Text (required)	`text`	string	Text to be chunked
Strategy (required)	`strategy`	object	Chunking strategy

Input Objects in Chunk Text

Strategy

Chunking strategy

Field	Field ID	Type	Note
Setting	`setting`	object	Chunk Setting

The setting Object

Setting

setting must fulfill one of the following schemas:

`Token`

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

Field	Field ID	Type	Note
Allowed Special Tokens	`allowed-special`	array	A list of special tokens that are allowed within chunks.
Chunk Method	`chunk-method`	string	Must be `"Token"`
Chunk Overlap	`chunk-overlap`	integer	Determines the number of tokens that overlap between consecutive chunks
Chunk Size	`chunk-size`	integer	Specifies the maximum size of each chunk in terms of the number of tokens
Disallowed Special Tokens	`disallowed-special`	array	A list of special tokens that should not appear within chunks.
Model	`model-name`	string	The name of the model used for tokenization. Enum values `gpt-4` `gpt-3.5-turbo` `text-davinci-003` `text-davinci-002` `text-davinci-001` `text-curie-001` `text-babbage-001` `text-ada-001` `davinci` `curie` `babbage` `ada` `code-davinci-002` `code-davinci-001` `code-cushman-002` `code-cushman-001` `davinci-codex` `cushman-codex` `text-davinci-edit-001` `code-davinci-edit-001` `text-embedding-ada-002` `text-similarity-davinci-001` `text-similarity-curie-001` `text-similarity-babbage-001` `text-similarity-ada-001` `text-search-davinci-doc-001` `text-search-curie-doc-001` `text-search-babbage-doc-001` `text-search-ada-doc-001` `code-search-babbage-code-001` `code-search-ada-code-001` `gpt2`

`Recursive`

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", "", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

Field	Field ID	Type	Note
Chunk Method	`chunk-method`	string	Must be `"Recursive"`
Chunk Overlap	`chunk-overlap`	integer	Determines the number of tokens that overlap between consecutive chunks
Chunk Size	`chunk-size`	integer	Specifies the maximum size of each chunk in terms of the number of tokens
Keep Separator	`keep-separator`	boolean	A flag indicating whether to keep the separator characters at the beginning or end of chunks
Model	`model-name`	string	The name of the model used for tokenization. Enum values `gpt-4` `gpt-3.5-turbo` `text-davinci-003` `text-davinci-002` `text-davinci-001` `text-curie-001` `text-babbage-001` `text-ada-001` `davinci` `curie` `babbage` `ada` `code-davinci-002` `code-davinci-001` `code-cushman-002` `code-cushman-001` `davinci-codex` `cushman-codex` `text-davinci-edit-001` `code-davinci-edit-001` `text-embedding-ada-002` `text-similarity-davinci-001` `text-similarity-curie-001` `text-similarity-babbage-001` `text-similarity-ada-001` `text-search-davinci-doc-001` `text-search-curie-doc-001` `text-search-babbage-doc-001` `text-search-ada-doc-001` `code-search-babbage-code-001` `code-search-ada-code-001` `gpt2`
Separators	`separators`	array	A list of strings representing the separators used to split the text.

`Markdown`

This text splitter is specially designed for Markdown format.

Field	Field ID	Type	Note
Chunk Method	`chunk-method`	string	Must be `"Markdown"`
Chunk Overlap	`chunk-overlap`	integer	Determines the number of tokens that overlap between consecutive chunks
Chunk Size	`chunk-size`	integer	Specifies the maximum size of each chunk in terms of the number of tokens
Code Blocks	`code-blocks`	boolean	A flag indicating whether code blocks should be treated as a single unit
Model	`model-name`	string	The name of the model used for tokenization. Enum values `gpt-4` `gpt-3.5-turbo` `text-davinci-003` `text-davinci-002` `text-davinci-001` `text-curie-001` `text-babbage-001` `text-ada-001` `davinci` `curie` `babbage` `ada` `code-davinci-002` `code-davinci-001` `code-cushman-002` `code-cushman-001` `davinci-codex` `cushman-codex` `text-davinci-edit-001` `code-davinci-edit-001` `text-embedding-ada-002` `text-similarity-davinci-001` `text-similarity-curie-001` `text-similarity-babbage-001` `text-similarity-ada-001` `text-search-davinci-doc-001` `text-search-curie-doc-001` `text-search-babbage-doc-001` `text-search-ada-doc-001` `code-search-babbage-code-001` `code-search-ada-code-001` `gpt2`

Output	ID	Type	Description
Token Count	`token-count`	integer	Total count of tokens in the original input text
Text Chunks	`text-chunks`	array[object]	Text chunks after splitting
Number of Text Chunks	`chunk-num`	integer	Total number of output text chunks
Token Count Chunks	`chunks-token-count`	integer	Total count of tokens in the output text chunks

Output Objects in Chunk Text

Text Chunks

Field	Field ID	Type	Note
End Position	`end-position`	integer	The ending position of the chunk in the original text
Start Position	`start-position`	integer	The starting position of the chunk in the original text
Text	`text`	string	Text chunk after splitting
Token Count	`token-count`	integer	Count of tokens in a chunk