Text

The Text component is an operator component that allows users to extract and manipulate text from different sources. It can carry out the following tasks:

#Release Stage

Alpha

#Configuration

The component definition and tasks are defined in the definition.json and tasks.json files respectively.

#Supported Tasks

#Chunk Text

Chunk text with different strategies

InputIDTypeDescription
Task ID (required)taskstringTASK_CHUNK_TEXT
Text (required)textstringText to be chunked.
Strategy (required)strategyobjectChunking strategy.
Input Objects in Chunk Text

Strategy

Chunking strategy.

FieldField IDTypeNote
SettingsettingobjectChunk Setting.
The setting Object

Setting

setting must fulfill one of the following schemas:

Token

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

FieldField IDTypeNote
Allowed Special Tokensallowed-specialarrayA list of special tokens that are allowed within chunks.
Chunk Methodchunk-methodstringMust be "Token"
Chunk Overlapchunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks.
Chunk Sizechunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens.
Disallowed Special Tokensdisallowed-specialarrayA list of special tokens that should not appear within chunks.
Modelmodel-namestringThe name of the model used for tokenization.
Enum values
  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001
  • davinci
  • curie
  • babbage
  • ada
  • code-davinci-002
  • code-davinci-001
  • code-cushman-002
  • code-cushman-001
  • davinci-codex
  • cushman-codex
  • text-davinci-edit-001
  • code-davinci-edit-001
  • text-embedding-ada-002
  • text-similarity-davinci-001
  • text-similarity-curie-001
  • text-similarity-babbage-001
  • text-similarity-ada-001
  • text-search-davinci-doc-001
  • text-search-curie-doc-001
  • text-search-babbage-doc-001
  • text-search-ada-doc-001
  • code-search-babbage-code-001
  • code-search-ada-code-001
  • gpt2
Recursive

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", "", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

FieldField IDTypeNote
Chunk Methodchunk-methodstringMust be "Recursive"
Chunk Overlapchunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks.
Chunk Sizechunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens.
Keep Separatorkeep-separatorbooleanA flag indicating whether to keep the separator characters at the beginning or end of chunks.
Modelmodel-namestringThe name of the model used for tokenization.
Enum values
  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001
  • davinci
  • curie
  • babbage
  • ada
  • code-davinci-002
  • code-davinci-001
  • code-cushman-002
  • code-cushman-001
  • davinci-codex
  • cushman-codex
  • text-davinci-edit-001
  • code-davinci-edit-001
  • text-embedding-ada-002
  • text-similarity-davinci-001
  • text-similarity-curie-001
  • text-similarity-babbage-001
  • text-similarity-ada-001
  • text-search-davinci-doc-001
  • text-search-curie-doc-001
  • text-search-babbage-doc-001
  • text-search-ada-doc-001
  • code-search-babbage-code-001
  • code-search-ada-code-001
  • gpt2
SeparatorsseparatorsarrayA list of strings representing the separators used to split the text.
Markdown

This text splitter is specially designed for Markdown format.

FieldField IDTypeNote
Chunk Methodchunk-methodstringMust be "Markdown"
Chunk Overlapchunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks.
Chunk Sizechunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens.
Code Blockscode-blocksbooleanA flag indicating whether code blocks should be treated as a single unit.
Modelmodel-namestringThe name of the model used for tokenization.
Enum values
  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001
  • davinci
  • curie
  • babbage
  • ada
  • code-davinci-002
  • code-davinci-001
  • code-cushman-002
  • code-cushman-001
  • davinci-codex
  • cushman-codex
  • text-davinci-edit-001
  • code-davinci-edit-001
  • text-embedding-ada-002
  • text-similarity-davinci-001
  • text-similarity-curie-001
  • text-similarity-babbage-001
  • text-similarity-ada-001
  • text-search-davinci-doc-001
  • text-search-curie-doc-001
  • text-search-babbage-doc-001
  • text-search-ada-doc-001
  • code-search-babbage-code-001
  • code-search-ada-code-001
  • gpt2
OutputIDTypeDescription
Token Counttoken-countintegerTotal count of tokens in the original input text.
Text Chunkstext-chunksarray[object]Text chunks after splitting.
Number of Text Chunkschunk-numintegerTotal number of output text chunks.
Token Count Chunkschunks-token-countintegerTotal count of tokens in the output text chunks.
Output Objects in Chunk Text

Text Chunks

FieldField IDTypeNote
End Positionend-positionintegerThe ending position of the chunk in the original text.
Start Positionstart-positionintegerThe starting position of the chunk in the original text.
TexttextstringText chunk after splitting.
Token Counttoken-countintegerCount of tokens in a chunk.