💧 Visual Data Preparation (VDP) is an open-source tool to seamlessly integrate Vision AI with the modern data stack

Last update: Jan 5, 2023

Comments: 13

Get Early Access

Visual Data Preparation (VDP) is an open-source tool to streamline the end-to-end visual data processing pipeline:

Ingest unstructured visual data from data sources such as data lakes or IoT devices;
Transform visual data to meaningful structured data representations by Vision AI models;
Load the structured data into warehouses, applications, or other destinations.

The goal of VDP is to seamlessly bring Vision AI into the modern data stack with a standardised framework. Check our blog post Missing piece in modern data stack: visual data preparation on how this tool is proposed to streamline unstructured visual data processing across different stakeholders.

How VDP works
Quick start
Documentation
Community support
License

Code in the main branch tracks under-development progress towards the next release and may not work as expected. If you are looking for a stable alpha version, please use latest release.

How VDP works

The core concept of VDP is pipeline. A pipeline is an end-to-end workflow that automates a sequence of tasks to process visual data. Each pipeline consists of three ordered components:

data source: where the pipeline starts. It connects the source of image and video data to be processed.
model: a deployed Vision AI model to process the ingested visual data and generate structured outputs
data destination: where to send the structured outputs

Based on the mode of a pipeline, it will ingest and process the visual data, send the outputs to the destination every time the trigger event occurs.

We use data connector as a general term to represent data source and data destination. Please find the supported data connectors here.

Quick start

Download and run VDP locally

Execute the following commands to start pre-built images with all the dependencies:

$ git clone https://github.com/instill-ai/vdp.git && cd vdp

# Build instill/vdp:dev local development image
$ make build

# Launch all services.
$ make all

⚠️ Downloading the Triton server image will take a while, but it should be just a one-time effort.

Run the samples to trigger an object detection pipeline

We provide sample codes on how to build and trigger an object detection pipeline. Run it with the local VDP:

$ cd examples-go

# Download a YOLOv4 ONNX model for object detection task (GPU not required)
$ curl -o yolov4-onnx-cpu.zip https://artifacts.instill.tech/vdp/sample-models/yolov4-onnx-cpu.zip

# [optional] Download a test image or use your own images
$ curl -o dog.jpg https://artifacts.instill.tech/dog.jpg

# Deploy the model
$ go run deploy-model/main.go --model-path yolov4-onnx-cpu.zip --model-name yolov4

# Test the model
$ go run test-model/main.go --model-name yolov4 --test-image dog.jpg

# Create an object detection pipeline
$ go run create-pipeline/main.go --pipeline-name hello-pipeline --model-name yolov4

# Trigger the pipeline by using the same test image
$ go run trigger-pipeline/main.go --pipeline-name hello-pipeline --test-image dog.jpg

Create a pipeline with your own models

Please follow the guideline "Prepare your own model to deploy on VDP ". Based on the above sample codes, you can deploy a prepared model and create your own pipeline.

Clean up

To clean up all running services:

$ make prune

Documentation

The gRPC protocols in protobufs provide the single source of truth for the VDP APIs. To view the generated OpenAPI spec on http://localhost:3000:

$ make doc

Community support

For general help using VDP, you can use one of these channels:

GitHub (bug reports, feature requests, project discussions and contributions)
Discord (live discussion with the community and the Instill AI Team)

License

See the LICENSE file for licensing information.

Owner

Instill AI

Visual data preparation made for all - Empower modern data stack, tapping the value of unstructured visual data with our open source community.

https://github.com/instill-ai/vdp https://www.instill.tech

Comments

any default value for .env?
any example value for .env ?

if missing , displayed some warning messages as following

WARNING: The TRITONCONDAENV_IMAGE_TAG variable is not set. Defaulting to a blank string. WARNING: The TRITONSERVER_IMAGE_TAG variable is not set. Defaulting to a blank string. WARNING: The REDIS_IMAGE_TAG variable is not set. Defaulting to a blank string.

ERROR: build path vdp/dev/pipeline-backend either does not exist, is not accessible, or is not a valid URL. and will not helping with quick start.
docs(pipeline): add SYNC and ASYNC diagram and section for connectors
Because

we need a document for users to get to know more about how the pipeline works

This commit

add SYNC and ASYNC diagram and section for connectors

close #52
Wrong model type

The model type is mistakenly set as tensorrt. https://github.com/instill-ai/vdp/blob/814ea12e94b7ebb4656c7208f5c160c646f8523a/examples-go/deploy-model/main.go#L65
feat: add console e2e test into vdp
Because

The console integration test is ready

This commit

add console e2e test into vdp

update ITMODE=true when running integration-test. When the flag ITMODE is enabled, the integration-test use dummy models instead of pulling from GitHub, HuggingFace, or ArtiVC to reduce the impact of the internet connection.

co-author: @Phelan164
test: update integration test
Because

need integration test to make sure feature work correctly

This commit

add integration test for model-backend and update for pipeline-backend

closes #67

closes #69

Limitation: error code is not covered in this PR. It will be added during working this ticket

Note: docker image for model-backend is dev which should update when have a new release
chore: add pipeine-backend integration test and refactor example and docker-compose
Because

we need an integration test for our ci

there is a wrong depend configuration in docker-compose

refactor example to be more self-contained

This commit

add condition when waiting for pg_sql ready

add pipeline integration test by k6

add argument when creating pipeline and correct README.md
[doc]: There has no make build command
Issue

According to the quick start of the repo: if you want to develop locally, you could do the following

$ git clone https://github.com/instill-ai/vdp.git && cd vdp # Build instill/vdp:dev local development image $ make build # Launch all services. $ make all

But actually, there has no make build command in the Makefile

https://github.com/instill-ai/vdp/blob/main/Makefile#L15

Only exist make dev and make all.

Solution

Please update the readme quick start guideline.
[release] v0.1.4-alpha
Protobufs

[x] https://github.com/instill-ai/protobufs/issues/17

[x] https://github.com/instill-ai/protobufs/issues/33

model-backend

[x] https://github.com/instill-ai/model-backend/issues/44

[x] https://github.com/instill-ai/model-backend/issues/33

[x] https://github.com/instill-ai/model-backend/issues/45

[x] https://github.com/instill-ai/model-backend/issues/30

pipeline-backend

[x] https://github.com/instill-ai/pipeline-backend/issues/32

[x] https://github.com/instill-ai/pipeline-backend/issues/33

vdp

[x] Use the latest docker images in vdp
Quick start sample code bug
Deploy the model

If I run the following script in quick start

# Deploy the model go run deploy-model/main.go --model-path yolov4-onyx-cpu.zip --model-name yolov4

I get a model with model name yolov4 with 1 version.

2022/02/21 00:16:13 model has been created, the response is: id:1 name:"yolov4" full_Name:"local-user/yolov4" cv_task:DETECTION versions:{version:1 model_id:1 description:"YoloV4 for object detection" created_at:{seconds:1645402547 nanos:80792000} updated_at:{seconds:1645402547 nanos:80807000}}

The response missed the status of version 1 of the model.

If I run the above script the second time, I get a model with model name yolov4 with 2 versions.

bash 2022/02/21 00:20:07 model has been created, the response is: id:1 name:"yolov4" full_Name:"local-user/yolov4" cv_task:DETECTION versions:{version:1 model_id:1 description:"YoloV4 for object detection" created_at:{seconds:1645402547 nanos:80792000} updated_at:{seconds:1645402625 nanos:977017000} status:ONLINE} versions:{version:2 model_id:1 description:"YoloV4 for object detection" created_at:{seconds:1645402779 nanos:961661000} updated_at:{seconds:1645402779 nanos:961692000}}

The response only includes the status of version 1, but no status of version 2.

Test the model

# Test the model go run test-model/main.go --model-name yolov4 --test-image dog.jpg --model-version 2

Get response

2022/02/21 00:24:15 error when triggering predict: rpc error: code = Code(400) desc = {"status":400,"title":"PredictModel","detail":"Model is offline"}

Note: shouldn't we use status code 422 instead of 400 for the above scenario?

But when GET /models/yolov4

{ "id": 1, "name": "yolov4", "full_Name": "local-user/yolov4", "cv_task": "DETECTION", "versions": [ { "version": 1, "model_id": 1, "description": "YoloV4 for object detection", "created_at": "2022-02-21T00:15:47.080792Z", "updated_at": "2022-02-21T00:20:09.486272Z", "status": "ONLINE" }, { "version": 2, "model_id": 1, "description": "YoloV4 for object detection", "created_at": "2022-02-21T00:19:39.961661Z", "updated_at": "2022-02-21T00:20:09.486272Z", "status": "ONLINE" } ] }

The response shows both model versions are online.
docs: refactor doc structure
Because

The root README.md is too long and needs to be restructured

This commit

replace trigger mechanism of a data source with pipeline mode concept and add the docs in docs/pipeline-mode.md.

move prepare-you-own-model doc to docs/model.md

add check-yaml in pre-commit

update redoc service name to redoc_openapi

The pipeline mode is determined by the combination of data source and destination. It describes how an end-to-end pipeline processes its workload.
chore(main): release 0.3.0-alpha
:robot: I have created a release beep boop

Product Updates

Announcement 📣

VDP (originally, Visual Data Preparation) is officially renamed to Versatile Data Pipeline.

We have realised that as a general ETL infrastructure, VDP is in fact capable of processing all kinds of unstructured data. We should not limit its usage to only visual data but to more general versatile data. In addition, the term, Data Preparation, has been misleading for users often thinking it has to do with only data labelling, cleaning, or wrangling. In our vision, while VDP should involve data preparation in its MLOps practice, it should not be conceptually confined to only data preparation. VDP does more than that and is focused on the overall effectiveness of the unstructured data ETL with a data-centric paradigm. The end form of a VDP infrastructure is nothing but a data pipeline. The term Data Pipeline is more precise to capture the core concept of VDP hence renamed as Versatile Data Pipeline.

Features ✨

support new task Instance segmentation. Check out the Streamlit example

VDP (0.3.0-alpha)

Features

support Instance segmentation task 0476f59

add console e2e test into vdp (#148) (a779a11)

add instance segmentation example (#167)(c341e0c)

Bug Fixes

fix wrong triton environment when deploying HuggingFace models (#150) (b2fda36)

use COCO RLE format for instance segmentation (4d10e46)

update model output protocol (e6ea88d)

Pipeline-backend (0.9.3-alpha)

Bug Fixes

fix pipeline trigger model hanging (https://github.com/instill-ai/pipeline-backend/issues/80) (7ba58e5)

Connector-backend (0.7.2-alpha)

Bug Fixes

fix connector empty description update (0bc3086)

Model-backend (0.10.0-alpha)

Features

support instance segmentation task (https://github.com/instill-ai/model-backend/issues/183) (d28cfdc)

support async deploy and undeploy model instance (https://github.com/instill-ai/model-backend/issues/192) (ed36dc7)

support semantic segmentation (https://github.com/instill-ai/model-backend/issues/203) (f22262c)

Bug Fixes

allow updating emtpy description for a model (https://github.com/instill-ai/model-backend/issues/177) (100ec84)

HuggingFace batching bug in preprocess model (b1582e8)

model instance state update to unspecified state (https://github.com/instill-ai/model-backend/issues/206) (14c87d5)

panic error with nil object (https://github.com/instill-ai/model-backend/issues/208) (a342113)

Console

Features

extend the time span of our user cookie (https://github.com/instill-ai/console/issues/289) (76a6f99)

finish integration test and make it stable (https://github.com/instill-ai/console/issues/281) (3fd8d21)

replace prism.js with code-hike (https://github.com/instill-ai/console/issues/292) (cb61708)

unify the gap between elements in every table (https://github.com/instill-ai/console/issues/291) (e743820)

update console request URL according to new protobuf (https://github.com/instill-ai/console/issues/287) (fa7ecc3)

add hg model id field at model_instance page (https://github.com/instill-ai/console/issues/300) (31a6eab)

cleanup connector after test (https://github.com/instill-ai/console/issues/295) (f9c8e4c)

disable html report (https://github.com/instill-ai/console/issues/297) (689f50d)

enhance the warning of the resource id field (https://github.com/instill-ai/console/issues/303) (6c4aa4f)

make playwright output dot on CI (https://github.com/instill-ai/console/issues/293) (e5c2958)

support model-backend async long run operation (https://github.com/instill-ai/console/issues/309) (f795ce8)

update e2e test (https://github.com/instill-ai/console/issues/313) (88bf0cd) update how we test model detail page (https://github.com/instill-ai/console/issues/310) (04c83a1)

wipe out all data after test (https://github.com/instill-ai/console/issues/296) (e4085dd)

Bug Fixes

fix pipeline e2e not stable (https://github.com/instill-ai/console/issues/285) (a26e599)

fix set-cookie api route issue due to wrong domain name (https://github.com/instill-ai/console/issues/284) (c3efcdd)

This PR was generated with Release Please. See documentation.
chore(main): release 0.3.1-alpha
:robot: I have created a release beep boop

0.3.1-alpha (2023-01-08)

Bug Fixes

fix docker-compose (#174) (e40d607)

fix typo in makefile (#172) (c861afd)

This PR was generated with Release Please. See documentation.
Support async model inference

At the moment, a pipeline in async mode still relies on sync model inference and only asynchronously write to the destination. We should make the model-backend trigger endpoints use Temporal workflow. The pipeline-backend should implement Temporal workflow to request model-backend async inference as well.

💧 Visual Data Preparation (VDP) is an open-source tool to seamlessly integrate Vision AI with the modern data stack

Website | Community | Blog

Get Early Access

Table of contents

How VDP works

Quick start

Download and run VDP locally

Run the samples to trigger an object detection pipeline

Create a pipeline with your own models

Clean up

Documentation

Community support

License

Owner

Instill AI

Comments

any default value for .env?

docs(pipeline): add SYNC and ASYNC diagram and section for connectors

Wrong model type

feat: add console e2e test into vdp

test: update integration test

chore: add pipeine-backend integration test and refactor example and docker-compose

[doc]: There has no make build command

Issue

Solution

[release] v0.1.4-alpha

Protobufs

model-backend

pipeline-backend

vdp

Quick start sample code bug

Deploy the model

Test the model

docs: refactor doc structure

chore(main): release 0.3.0-alpha

:robot: I have created a release beep boop

Product Updates

Announcement 📣

Features ✨

VDP (0.3.0-alpha)

Features

Bug Fixes

Pipeline-backend (0.9.3-alpha)

Bug Fixes

Connector-backend (0.7.2-alpha)

Bug Fixes

Model-backend (0.10.0-alpha)

Features

Bug Fixes

Console

Features

Bug Fixes

chore(main): release 0.3.1-alpha

:robot: I have created a release beep boop

0.3.1-alpha (2023-01-08)

Bug Fixes

Support async model inference

Related tags

An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.

go-opa-validate is an open-source lib that evaluates OPA (open policy agent) policy against JSON or YAML data.

The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Plugin for Helm to integrate the sigstore ecosystem

🔮 ✈️ to integrate OPA Gatekeeper's new ExternalData feature with cosign to determine whether the images are valid by verifying their signatures

Vilicus is an open source tool that orchestrates security scans of container images(docker/oci) and centralizes all results into a database for further analysis and metrics.

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster

Horusec is an open source tool that improves identification of vulnerabilities in your project with just one command.

:rocket: Modern cross-platform HTTP load-testing tool written in Go

k6 is a modern load testing tool for developers and testers in the DevOps era.

Bubbly is an open-source platform that gives you confidence in your continuous release process.

Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers Benchmark specification

KubeCube is an open source enterprise-level container platform

Devtron is an open source software delivery workflow for kubernetes written in go.

TriggerMesh open source event-driven integration platform powered by Kubernetes and Knative.

Fleet - Open source device management, built on osquery.