When you create a model in ⚗️ Instill Model, it's necessary to define the standardized AI Task that the model falls under.
In a data pipeline, a model serves as a critical component, designed to tackle a specific AI Task. By standardizing the data format of model outputs into AI tasks, models become modular: you can interchange different model sources as an AI component in a 💧 Instill VDP pipeline as long as they're designed for the same AI Task. ⚗️ Instill Model also adheres to the standard format of AI Tasks for data integration in the 💧 Instill VDP pipelines.
Currently, ⚗️ Instill Model outlines the data interface for popular tasks:
Image Classification
: Categorizing images into predefined classesObject Detection
: Identifying and localizing multiple objects in imagesKeypoint Detection
: Identifying and localizing multiple keypoints of objects in imagesOCR (Optical Character Recognition)
: Identifying and recognizing text in imagesInstance Segmentation
: Identifying, localizing, and outlining multiple objects in imagesSemantic Segmentation
: Categorizing image pixels into predefined classesText to Image
: Generating images from input text promptsImage to Image
: Generating images from input image promptsText Generation
: Generating texts from input text promptsText Generation Chat
: Generating chat style texts from input text promptsVisual Question Answering
: Generating chat style texts from input text and image prompts- The list is expanding ... 🌱
The tasks listed above concentrate on analyzing and understanding the content of unstructured data in a manner similar to human cognition. The objective is to enable a computer/device to provide a description for the data that is as comprehensive and accurate as possible. These primitive tasks form the basis for building numerous real-world industrial AI applications. Each task is elaborated in the respective section below.
#Image Classification
Image Classification is a Vision task to assign a single pre-defined category label to an entire input image. Generally, an Image Classification model takes an image as the input, and outputs a prediction about what category this image belongs to and a confidence score (usually between 0 and 1) representing the likelihood that the prediction is correct.
{ "task": "TASK_CLASSIFICATION", "task_outputs": [ { "classification": { "category": "golden retriever", "score": 0.98 } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
MobileNet v2 | GitHub, GitHub-DVC | ONNX | ✅ | ✅ |
Vision Transformer (ViT) | Hugging Face | ONNX | ✅ | ❌ |
#Object Detection
Object Detection is a Vision task to localise multiple objects of pre-defined categories in an input image. Generally, an Object Detection model receives an image as the input, and outputs bounding boxes with category labels and confidence scores on detected objects.
{ "task": "TASK_DETECTION", "task_outputs": [ { "detection": { "objects": [ { "category": "dog", "score": 0.97, "bounding_box": { "top": 102, "left": 324, "width": 208, "height": 405 } }, ... ] } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
YOLOv4 | GitHub-DVC | ONNX | ✅ | ✅ |
YOLOv7 | GitHub-DVC | ONNX | ✅ | ✅ |
#Keypoint Detection
Keypoint Detection task a Vision task to localise multiple objects by identifying their pre-defined keypoints, for example, identifying the keypoints of human body: nose, eyes, ears, shoulders, elbows, wrists, hips, knees and ankles. Normally, a Keypoint Detection task takes an image as the input, and outputs the coordinates and visibility of keypoints with bounding boxes and confidence scores on detected objects.
{ "task": "TASK_KEYPOINT", "task_outputs": [ { "keypoint": { "objects": [ { "keypoints": [ { "v": 0.53722847, "x": 542.82764, "y": 86.63817 }, { "v": 0.634061, "x": 553.0073, "y": 79.440636 }, ... ], "score": 0.94, "bounding_box": { "top": 86, "left": 185, "width": 571, "height": 203 } }, ... ] } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
YOLOv7 W6 Pose | GitHub-DVC | ONNX | ✅ | ✅ |
#Optical Character Recognition (OCR)
OCR is a Vision task to localise and recognise text in an input image. The task can be done in two steps by multiple models: a text detection model to detect bounding boxes containing text and a text recognition model to process typed or handwritten text within each bounding box into machine readable text. Alternatively, there are deep learning models that can accomplish the task in one single step.
{ "task": "TASK_OCR", "task_outputs": [ { "ocr": { "objects": [ { "text": "ENDS", "score": 0.99, "bounding_box": { "top": 298, "left": 279, "width": 134, "height": 59 } }, { "text": "PAVEMENT", "score": 0.99, "bounding_box": { "top": 228, "left": 216, "width": 255, "height": 65 } } ] } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
PSNet + EasyOCR | GitHub-DVC | ONNX | ✅ | ✅ |
#Instance Segmentation
Instance Segmentation is a Vision task to detect and delineate multiple objects of pre-defined categories in an input image. Normally, the task takes an image as the input, and outputs uncompressed run-length encoding (RLE) representations (a variable-length comma-delimited string), with bounding boxes, category labels and confidence scores on detected objects.
Run-length encoding (RLE) is an efficient form to store binary masks. It is commonly used to encode the location of foreground objects in segmentation. We adopt the uncompressed RLE definition used in the COCO dataset. It divides a binary mask (must in column-major order) into a series of piecewise constant regions and for each piece simply stores the length of that piece.
The above image shows examples of encoding masks into RLEs and decoding masks encoded via RLEs. Note that the odd counts in the RLEs are always the numbers of zeros.
Check out functions to encode masks into RLEs and decode masks encoded via RLEs.
{ "task": "TASK_INSTANCE_SEGMENTATION", "task_outputs": [ { "instance_segmentation": { "objects": [ { "rle": "2918,12,382,33,...", "score": 0.99, "bounding_box": { "top": 95, "left": 320, "width": 215, "height": 406 }, "category": "dog" }, { "rle": "34,18,230,18,...", "score": 0.97, "bounding_box": { "top": 194, "left": 130, "width": 197, "height": 248 }, "category": "dog" } ] } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
Mask RCNN | GitHub-DVC | PyTorch | ✅ | ✅ |
#Semantic Segmentation
Semantic Segmentation is a Vision task of assigning a class label to every pixel in the image. Normally, the task takes an image as the input, and outputs segmentation mask (RLE) representations (a variable-length comma-delimited string) for each group of pixel objects and category of the group objects.
{ "task": "TASK_SEMANTIC_SEGMENTATION", "task_outputs": [ { "semantic_segmentation": { "stuffs": [ { "rle": "2918,12,382,33,...", "category": "person" }, { "rle": "34,18,230,18,...", "category": "sky" }, ... ] } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
Lite R-ASPP based on MobileNetV3 | GitHub-DVC | ONNX | ✅ | ✅ |
#Text to Image
Text to Image is a Generative AI Task to generate images from text inputs. Generally, the task takes descriptive text prompts as the input, and outputs generated images in Base64 format based on the text prompts.
{ "task": "TASK_TEXT_TO_IMAGE", "task_outputs": [ { "text_to_image": { "images": ["/9j/4AAQSkZJRgABAQAAAQABAAD/..."] } } ]}
In above example, the generated images is a list of Base64 encoded images. To obtain the images, we need to decode Base64 as below snippet code.
import base64import numpy as np# Decode the first image resultbase64_image = out['text_to_image']['images'][0]image = base64.b64decode(base64_image)# Save the decoded imagefilename = 'text_to_image.jpg'with open(filename, 'wb') as f:f.write(image)
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
Stable Diffusion | GitHub-DVC, Local-CPU, Local-GPU | ONNX | ✅ | ✅ |
Stable Diffusion XL | GitHub-DVC | PyTorch | ❌ | ✅ |
#Text Generation
Text Generation is a Generative AI Task to generate new text from text inputs. Generally, the task takes incomplete text prompts as the input, and produces new text based on the prompts. The task can fill in incomplete sentences or even generate full stories given the first words.
{ "task": "TASK_TEXT_GENERATION", "task_outputs": [ { "text_generation": { "text": "The winds of change are blowing strong, bring new beginnings, righting wrongs. The world around us is constantly turning, and with each sunrise, our spirits are yearning." } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
Llama2 | GitHub-DVC | Transformer | ❌ | ✅ |
Code Llama | GitHub-DVC | Transformer | ❌ | ✅ |
Llama3-instruct | GitHub-DVC | Transformer | ❌ | ✅ |
Depending on your internet speed, importing LLM models will take a while.
Some models only supports GPU deployment. By default, ⚗️ Instill Model can access all your GPUs.
#Text Generation Chat
Text Generation Chat is a Generative AI Task to generate new text from text inputs in chat style. Generally, the task takes a series of conversation as the input, and produces new response on the prompts. The task can perform conversation and even answer question based on previous context.
{ "task": "TASK_TEXT_GENERATION_CHAT", "task_outputs": [ { "text_generation_chat": { "text": "What a delicate situation!\n\nI must advise that it's generally not a good idea to..." } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
Llama2 Chat | GitHub-DVC | Transformer | ❌ | ✅ |
MosaicML MPT | GitHub-DVC | Transformer | ❌ | ✅ |
Mistral | GitHub-DVC | Transformer | ❌ | ✅ |
Zephyr-7b | GitHub-DVC | Transformer | ✅ | ✅ |
Depending on your internet speed, importing LLM models will take a while.
Some models only supports GPU deployment. By default, ⚗️ Instill Model can access all your GPUs.
#Visual Question Answering
Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.
{ "task": "TASK_VISUAL_QUESTION_ANSWERING", "task_outputs": [ { "visual_question_answering": { "text": "The image appears to show a close-up view of a plant's leaf or a similar plant part." } } ]}
Available models
Model | Sources | Framework | CPU | GPU |
---|---|---|---|---|
Llava-1-6 | GitHub-DVC | Transformer | ❌ | ✅ |
Depending on your internet speed, importing LLM models will take a while.
Some models support only GPU deployment. By default, ⚗️ Instill Model can access all your GPUs.
#Unspecified Task
⚗️ Instill Model is very flexible and allows models even if the task is not
standardised yet or the output of the model can't be converted to the format of
supported AI Tasks. The model will be classified as an Unspecified
task.
{ "unspecified": { "raw_outputs": [ { "data": [0.85, 0.1, 0.05], "data_type": "FP32", "name": "output_scores", "shape": [3] }, { "data": ["dog", "cat", "rabbit"], "data_type": "BYTES", "name": "output_labels", "shape": [3] } ] }}
#Suggest a New Task
Currently, the model output is converted to standard format based on the AI Task outputs maintained in Protobuf.
If you'd like to support for a new task, you can create an issue or request it in the #give-feedback channel on Discord.