Introduction | Documentation

#About

💧 Instill VDP (Versatile Data Pipeline) is a powerful tool designed to build end-to-end unstructured data pipelines. It leverages the capabilities of 3rd-party data, AI, and applications functionalities and seamlessly connect with ⚗️ Instill Model and 💾 Instill Artifact via ⚙️ Instill Component components.

#Pipeline Recipe

A pipeline is defined by a recipe, which is essentially a JSON object composed of a run method, variable, output and multiple components:

Here is a sample recipe where the input is a prompt, and the output is a text.

{
  "on": {},
  "variable": {
    "prompt": {
      "title": "input",
      "instillFormat": "string"
    }
  },
  "output": {
    "text": {
      "title": "text",
      "value": "${op-0.output.texts[0]}"
    }
  },
  "component": {
    "op-0": {
      "type": "openai",
      "task": "TASK_TEXT_GENERATION",
      "input": {
        "model": "gpt-3.5-turbo",
        "n": 2,
        "prompt": "${variable.prompt}",
        "response-format": {
          "type": "text"
        },
        "system-message": "You are a helpful assistant.",
        "temperature": 1,
        "top-p": 1
      },
      "setup": {
        "api-key": "${secret.my-openai-key}"
      }
    }
  }
}

Within the recipe, we have four essential fields, on, variable, output and component.

#On

Indicates how the pipeline is run.

#Run on Request

VDP supports running a pipeline via HTTP and gRPC protocols in SYNC or ASYNC mode. The user doesn't need to set up the configuration in the on field. Every pipeline can run if the user provides variable in the request.

SYNC Mode: Responds to a request synchronously, sending back the result to the user immediately after the data is processed. This mode is suitable for real-time inference scenarios where low latency is crucial.
ASYNC Mode: Performs asynchronous workload. The user runs the pipeline with an asynchronous request and only receives an acknowledgment response. This mode is ideal for use cases that require long-running workloads.

Please refer to the Run Pipeline page for more details.

#Run on Scheduler (Coming Soon)

A pipeline run by a scheduler performs scheduled workloads to regularly run the pipeline.

#Run on Event (Coming Soon)

A pipeline run by a Pub-Sub or Message Queue service.

#Run on Application (Coming Soon)

A pipeline run by an application event.

#Variable

Set up the variables of the pipeline so every component can reference the data from the variables:

title: The title of the variable, which can be displayed on the Console.
description: A description of the variable, which can be displayed on the Console.
instillFormat: The instill format of the variable. Users can use this to control the data format. Please refer to Instill Format for more details.

#Output

Set up ths output of the pipeline:

title: The title of the output, which can be displayed on the Console.
description: A description of the output, which can be displayed on the Console.
value: The value of the output, which can be referenced from any component data.

#Component

Consists of multiple components, which can be a generic, AI, data, application, or operator component.

Each component must have a unique component ID, e.g., openai-0. This component ID should follow the RFC1034 rule, which consists of alphabets, numbers, and hyphens.
For a AI, data, application or operator, we need to set up these fields:
- type: Indicates which type of component it is.
- task: Indicates which task of the component. The task will be listed in the component definition tasks field.
- input: Sets up the input data. The schema is described in the spec.componentSpecification field.
- setup: Defines how the component is setup, e.g. configure an api-key for the connection. The schema is described in the spec.componentSpecification field.
- condition: (optional) Sets up the condition for whether this component will be executed or not.
For a generic.iterator, we need to set up these fields:
- input: The array input.
- outputElements: Sets up the output of this iterator.
- component: An array of components that are executed inside the iterator.

#Data Flow

#Reference from Data

Within pipeline components, we use a reference syntax to set up the data flow, i.e., the input field of each component:

A Reference employs special syntax enclosed in single curly brackets. For example, you can
1. Reference data from the pipeline variable: ${variable.KEY}, you can add any fields you want in the recipe.variable.
2. Reference data from the component: ${comp-id.input.KEY} or ${comp-id.output.KEY}, the available data fields are described in the componentSpecification of each component.
It functions as a variable reference, copying the value from an upstream component to the data input while preserving the original data type.

When run by request, we also need to set up the response data of the request. You can add any fields you want in the recipe.output.

If you don't want to use data reference, you can also set up a constant value in the recipe.

When utilizing batch running in 💧 Instill VDP, where each component processes an array of inputs, the rendered data input for this component, with batch 2 as an example, appears as follows:

{
  "inputs": [
    {
      "prompt": "What is Instill AI building?"
    },
    {
      "prompt": "What is Instill Core?"
    }
  ]
}

Please refer to the API document for instructions on how to run a pipeline.

#Reference from Secrets

Besides reference data from the pipeline input or component settings, we can also reference secrets from the secret management system. The value of these secret fields will be kept out of the recipe. So when the pipeline is shared or published, the keys will always be protected. Examples: "api-key": "${secrets.my-openai-key}". Please refer to Secret Management for more details.

#Control Flow

Within pipeline components, control flow is facilitated through setting the condition in each component, which serves as the means to specify the condition determining whether this component will be executed or not.

Example Configuration

Let's begin with an example.

{
  "condition": "${variable.a-condition-str} == \"TARGET_CONDITION_STR\""
}

This condition field allows us to define the condition setting. 💧 Instill VDP will interpret and use this configuration to decide whether this component will be executed or not. We have two types of conditions: "condition on value" and "condition on status".

#Condition on Value

We support the following syntax for the condition:

Logic
- &&
- ||
Comparison
- <
- >
- <=
- >=
- ==
- !=
Not
- !
Parentheses
- ()

Examples:

Here are some examples for the condition field:

Condition on string value

{ "condition": "${variable.a-condition-str} == \"TARGET_CONDITION_STR\"" }
Condition on number value

{ "condition": "${variable.a-condition-num} > 1" }
Condition on boolean value

{ "condition": "${variable.a-condition-bool}" }
Complex condition

{ "condition": "(${variable.a-condition-bool} && ${comp-a.output.x} == 1) || ${comp-b.output.y} < 1" }
Always false

{ "condition": "false" }

#Condition on Status

Besides Condition on Value, we also support Condition on Status. We provide a boolean value called status.completed, which allows you to condition on whether a component's execution is completed or not.

Examples:

Here are some examples for the condition field:

The component will be executed after component comp-a is completed; you can use this to control the execution order of components.

{ "condition": "${comp-a.status.completed}" }
You can also combine it with the value condition.

{ "condition": "${variable.a-condition-bool} || ${comp-a.status.completed}" }