Text

The Text component is an operator component that allows users to extract and manipulate text from different sources. It can carry out the following tasks:

#Release Stage

Alpha

#Configuration

The component definition and tasks are defined in the definition.json and tasks.json files respectively.

#Supported Tasks

#Chunk Text

Chunk text with different strategies

InputIDTypeDescription
Task ID (required)taskstringTASK_CHUNK_TEXT
Text (required)textstringText to be chunked
Strategy (required)strategyobjectChunking strategy
Input Objects in Chunk Text

Strategy

Chunking strategy

FieldField IDTypeNote
SettingsettingobjectChunk Setting
The setting Object

Setting

setting must fulfill one of the following schemas:

Token

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

FieldField IDTypeNote
Allowed Special Tokensallowed-specialarrayA list of special tokens that are allowed within chunks.
Chunk Methodchunk-methodstringMust be "Token"
Chunk Overlapchunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks
Chunk Sizechunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens
Disallowed Special Tokensdisallowed-specialarrayA list of special tokens that should not appear within chunks.
Modelmodel-namestringThe name of the model used for tokenization.
Enum values
  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001
  • davinci
  • curie
  • babbage
  • ada
  • code-davinci-002
  • code-davinci-001
  • code-cushman-002
  • code-cushman-001
  • davinci-codex
  • cushman-codex
  • text-davinci-edit-001
  • code-davinci-edit-001
  • text-embedding-ada-002
  • text-similarity-davinci-001
  • text-similarity-curie-001
  • text-similarity-babbage-001
  • text-similarity-ada-001
  • text-search-davinci-doc-001
  • text-search-curie-doc-001
  • text-search-babbage-doc-001
  • text-search-ada-doc-001
  • code-search-babbage-code-001
  • code-search-ada-code-001
  • gpt2
Recursive

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", "", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

FieldField IDTypeNote
Chunk Methodchunk-methodstringMust be "Recursive"
Chunk Overlapchunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks
Chunk Sizechunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens
Keep Separatorkeep-separatorbooleanA flag indicating whether to keep the separator characters at the beginning or end of chunks
Modelmodel-namestringThe name of the model used for tokenization.
Enum values
  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001
  • davinci
  • curie
  • babbage
  • ada
  • code-davinci-002
  • code-davinci-001
  • code-cushman-002
  • code-cushman-001
  • davinci-codex
  • cushman-codex
  • text-davinci-edit-001
  • code-davinci-edit-001
  • text-embedding-ada-002
  • text-similarity-davinci-001
  • text-similarity-curie-001
  • text-similarity-babbage-001
  • text-similarity-ada-001
  • text-search-davinci-doc-001
  • text-search-curie-doc-001
  • text-search-babbage-doc-001
  • text-search-ada-doc-001
  • code-search-babbage-code-001
  • code-search-ada-code-001
  • gpt2
SeparatorsseparatorsarrayA list of strings representing the separators used to split the text.
Markdown

This text splitter is specially designed for Markdown format.

FieldField IDTypeNote
Chunk Methodchunk-methodstringMust be "Markdown"
Chunk Overlapchunk-overlapintegerDetermines the number of tokens that overlap between consecutive chunks
Chunk Sizechunk-sizeintegerSpecifies the maximum size of each chunk in terms of the number of tokens
Code Blockscode-blocksbooleanA flag indicating whether code blocks should be treated as a single unit
Modelmodel-namestringThe name of the model used for tokenization.
Enum values
  • gpt-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001
  • davinci
  • curie
  • babbage
  • ada
  • code-davinci-002
  • code-davinci-001
  • code-cushman-002
  • code-cushman-001
  • davinci-codex
  • cushman-codex
  • text-davinci-edit-001
  • code-davinci-edit-001
  • text-embedding-ada-002
  • text-similarity-davinci-001
  • text-similarity-curie-001
  • text-similarity-babbage-001
  • text-similarity-ada-001
  • text-search-davinci-doc-001
  • text-search-curie-doc-001
  • text-search-babbage-doc-001
  • text-search-ada-doc-001
  • code-search-babbage-code-001
  • code-search-ada-code-001
  • gpt2
OutputIDTypeDescription
Token Counttoken-countintegerTotal count of tokens in the original input text
Text Chunkstext-chunksarray[object]Text chunks after splitting
Number of Text Chunkschunk-numintegerTotal number of output text chunks
Token Count Chunkschunks-token-countintegerTotal count of tokens in the output text chunks
Output Objects in Chunk Text

Text Chunks

FieldField IDTypeNote
End Positionend-positionintegerThe ending position of the chunk in the original text
Start Positionstart-positionintegerThe starting position of the chunk in the original text
TexttextstringText chunk after splitting
Token Counttoken-countintegerCount of tokens in a chunk