AI task

The goal of VDP is to streamline the end-to-end unstructured data flow, with the transform component being able to flexibly import AI models to process unstructured data for a specific task for Vision, Language and more.

Intrigued? Refer to Prepare Models to learn about how to prepare your models for AI tasks supported by VDP.

#Standardise AI tasks

In a data pipeline, model is the core component designed to solve a specific AI task. By standardising the data format of model outputs into AI tasks,

  • model in a pipeline is modularized: you can freely switch to use different models in a pipeline as long as the model is designed for the same task;
  • VDP produces a stream of data from models with standard format for use in a data integration or ETL pipeline.

At the moment, VDP defines the data interface for popular tasks:

  • Image classification - classify images into predefined categories
  • Object detection - detect and localise multiple objects in images
  • Keypoint detection - detect and localise multiple keypoints of objects in images
  • OCR (Optical Character Recognition) - detect and recognise text in images
  • Instance segmentation - detect, localise and delineate multiple objects in images
  • The list is growing ... 🌱

The above tasks focus on analysing and understanding the content of data in the same way as human does. The goal is to make a computer/device provide description for the data as complete and accurate as possible. These primitive tasks are the foundation for building many real-world industrial AI applications. Each task is described in depth in the respective section below.

If you'd like to support for a new task, you can create a topic in Discussions, or request it in the #vdp channel on Discord.

#How to standardise

#Standardise via Protocol Buffers

Currently, the model output is converted to standard format based on the AI task outputs maintained in Protobuf.

#Standardise via VDP Protocol

The VDP Protocol describes the data schema of AI task output in order to standardise an ETL pipeline for unstructured data. The data produced by the model component and passed to destination component of a pipeline is done via serialized JSON messages for inter-process communication.


"$schema": http://json-schema.org/draft-07/schema#
"$id": https://github.com/instill-ai/vdp/blob/main/protocol/vdp_protocol.yaml
title: VDP Protocol
type: object
description: VDP Protocol structs
additionalProperties: true
anyOf:
- required:
- classification
- required:
- detection
- required:
- keypoint
- required:
- ocr
- required:
- instance_segmentation
- required:
- unspecified
properties:
classification:
description: "Classify into pre-defined categories"
"$ref": "#/definitions/Classification"
detection:
description: "Detect and localise multiple objects"
"$ref": "#/definitions/Detection"
keypoint:
description: "Detect and localise keypoints of multiple objects"
"$ref": "#/definitions/Keypoint"
ocr:
description: "Detect, localise and recognise texts"
"$ref": "#/definitions/Ocr"
instance_segmentation:
description: "Detect, localise and delineate multiple objects"
"$ref": "#/definitions/InstanceSegmentation"
unspecified:
description: "Unspecified task with output in the free form"
"$ref": "#/definitions/Unspecified"

To be more specific, the above protocol defines the AI task output for one input image in a batch produced by the corresponding model instance.

The protocol is still under development. Stay tuned on how the protocol will evolve.

#Image classification

Image classification is a Vision task to assign a single pre-defined category label to an entire input image. Generally, an image classification model takes an image as the input, and outputs a prediction about what category this image belongs to and a confidence score (usually between 0 and 1) representing the likelihood that the prediction is correct.

Image classification task
Image classification task

{
"classification": {
"category": "golden retriever",
"score": 0.98
}
}

#Object detection

Object detection is a Vision task to localise multiple objects of pre-defined categories in an input image. Generally, an object detection model receives an image as the input, and outputs bounding boxes with category labels and confidence scores on detected objects.

Object detection task
Object detection task

{
"detection": {
"objects": [
{
"category": "dog",
"score": 0.97,
"bounding_box": {
"top": 102,
"left": 324,
"width": 208,
"height": 405
}
},
...
]
}
}

#Keypoint detection

Keypoint detection task a Vision task to localise multiple objects by identifying their pre-defined keypoints, for example, identifying the keypoints of human body: nose, eyes, ears, shoulders, elbows, wrists, hips, knees and ankles. Normally, a keypoint detection task takes an image as the input, and outputs the coordinates and visibility of keypoints with bounding boxes and confidence scores on detected objects.

Keypoint detection task
Keypoint detection task

{
"keypoint": {
"objects": [
{
"keypoints": [
{
"v": 0.53722847,
"x": 542.82764,
"y": 86.63817
},
{
"v": 0.634061,
"x": 553.0073,
"y": 79.440636
},
...
],
"score": 0.94,
"bounding_box": {
"top": 86,
"left": 185,
"width": 571,
"height": 203
}
},
...
]
}
}

#Optical Character Recognition (OCR)

OCR is a Vision task to localise and recognise text in an input image. The task can be done in two steps by multiple models: a text detection model to detect bounding boxes containing text and a text recognition model to process typed or handwritten text within each bounding box into machine readable text. Alternatively, there are deep learning models that can accomplish the task in one single step.

OCR task
OCR task

{
"ocr": {
"objects": [
{
"text": "ENDS",
"score": 0.99,
"bounding_box": {
"top": 298,
"left": 279,
"width": 134,
"height": 59
}
},
{
"text": "PAVEMENT",
"score": 0.99,
"bounding_box": {
"top": 228,
"left": 216,
"width": 255,
"height": 65
}
}
]
}
}

#Instance Segmentation

Instance segmentation is a Vision task to detect and delineate multiple objects of pre-defined categories in an input image. Normally, the task takes an image as the input, and outputs uncompressed run-length encoding (RLE) representations (a variable-length comma-delimited string), with bounding boxes, category labels and confidence scores on detected objects.

Instance segmentation task
Instance segmentation task

Run-length encoding (RLE) is an efficient form to store binary masks. It is commonly used to encode the location of foreground objects in segmentation. We adopt the uncompressed RLE definition used in the COCO dataset. It divides a binary mask (must in colume-major order) into a series of piecewise constant regions and for each piece simply stores the length of that piece.

Examples of encoding masks into RLEs and decoding masks encoded via RLEs
Examples of encoding masks into RLEs and decoding masks encoded via RLEs

The above image shows examples of encoding masks into RLEs and decoding masks encoded via RLEs. Note that the odd counts in the RLEs are always the numbers of zeros.


{
"instance_segmentation": {
"objects": [
{
"rle": "2918,12,382,33,...",
"score": 0.99,
"bounding_box": {
"top": 95,
"left": 320,
"width": 215,
"height": 406
},
"category": "dog"
},
{
"rle": "34,18,230,18,...",
"score": 0.97,
"bounding_box": {
"top": 194,
"left": 130,
"width": 197,
"height": 248
},
"category": "dog"
}
]
}
}

#What if my task is not standardised by VDP yet?

VDP is very flexible and allows you to import models even if your task is not standardised yet or the output of the model can't be converted to the format of supported AI tasks. The model will be classified as an Unspecified task. Send an image to the model as the input, VDP will

  • check the config.pbtxt model configuration file to extract the output names, datatypes and shapes of the model outputs,
  • and wrap these information along with the raw model output in a standard format.
Unspecified task
Unspecified task

{
"unspecified": {
"raw_outputs": [
{
"data": [0.85, 0.1, 0.05],
"data_type": "FP32",
"name": "output_scores",
"shape": [3]
},
{
"data": ["dog", "cat", "rabbit"],
"data_type": "BYTES",
"name": "output_labels",
"shape": [3]
}
]
}
}

Last updated: 1/17/2023, 12:41:24 AM