Back

Introducing VDP: open-source unstructured data ETL

Versatile Data Pipeline (VDP) is the single point of unstructured data integration, where users can sync unstructured data from anywhere into centralised warehouses or applications, just like how the modern data stack handles structured data.

Xiaofei Du's github avatar

Published by

Xiaofei Du

on 8/23/2022

The theme of this tutorial

#What is VDP

A few months ago, we introduced unstructured data ETL, the missing piece in modern data stack.

When people say they are data-driven, most of the time it means they are driven by structured data. Although 80% of the world's data are unstructured, the reality is that unstructured data are more difficult to analyse and not a lot of companies know or have the resources to deal with them. We can help with this. Specifically, we have built Versatile Data Pipeline (VDP), a general and modularised ETL infrastructure for unstructured data, to effectively tackle the problem.

VDP streamlines the end-to-end unstructured data processing pipeline:

  • Extract unstructured data from pre-built data sources such as cloud/on-prem storage, or IoT devices
  • Transform it into analysable or meaningful data representations by AI models imported from various ML platforms
  • Load the transformed data into warehouses, applications, or other destinations

We believe VDP is the future for unstructured data ETL, where developers won't need to build their own data connectors, high-maintenance model serving platform or ELT pipeline automation tool.

Our mission is to make VDP the single point of unstructured data integration, so users can sync unstructured data from anywhere into centralised warehouses or applications and focus on gaining insights across all data sources, just like how the modern data stack handles structured data. Check out highlights and core concepts if you want to learn more about how VDP works.

To benefit a broader community, we release VDP under the open-source Apache license 2.0. Check it out here. We've made it easy to get started with VDP on your local machine and Kubernetes (coming soon). Click here to get started.

If you want to chat about VDP or share your use cases, come and hang out with us in our Discord community.

#What is VDP not?

Many brilliant MLOps platforms/tools providing AI solutions have emerged in the last few years. Most of the tools are built from a model-centric perspective and fall into the following categories:

  • General ML platforms for model training, experiment tracking, model deployment, etc.
  • Platforms that serve a specific vertical, such as E-commerce, and manufacturing.
  • Platforms that focus on a single component of MLOps, such as data labelling, dataset preparation, and model serving.

VDP is built from a data-driven perspective. Although the AI model is the most critical component in an unstructured data ETL pipeline, the ultimate goal of VDP is to streamline the end-to-end unstructured data flow, with the transform component being able to flexibly import AI models from different sources. Please see the detailed FAQ page.

#Open-source and cloud versions

VDP is in Alpha and under active and heavy development. Check out our open roadmap. If you have any questions or feature requests, open a topic in the VDP Discussions or hop into our Discord to get help from an active and friendly community!

Our team is working hard to build out a fully-managed cloud product for VDP.

  • Painless setup
  • Maintenance-free infrastructure
  • Start for free, pay as you grow

Interested in trying it out? Join the waitlist today and we'll keep you posted on the progress!

Last updated: 12/23/2022, 7:00:08 AM