💧 Visual Data Preparation (VDP) is an open-source tool to seamlessly integrate Vision AI with the modern data stack

Instill AI - Visual Data Preparation Made for All

Website | Community | Blog

Get Early Access


Visual Data Preparation (VDP) is an open-source tool to streamline the end-to-end visual data processing pipeline:

  1. Ingest unstructured visual data from data sources such as data lakes or IoT devices;
  2. Transform visual data to meaningful structured data representations by Vision AI models;
  3. Load the structured data into warehouses, applications, or other destinations.

The goal of VDP is to seamlessly bring Vision AI into the modern data stack with a standardised framework. Check our blog post Missing piece in modern data stack: visual data preparation on how this tool is proposed to streamline unstructured visual data processing across different stakeholders.

Table of contents

Code in the main branch tracks under-development progress towards the next release and may not work as expected. If you are looking for a stable alpha version, please use latest release.

How VDP works

The core concept of VDP is pipeline. A pipeline is an end-to-end workflow that automates a sequence of tasks to process visual data. Each pipeline consists of three ordered components:

  1. data source: where the pipeline starts. It connects the source of image and video data to be processed.
  2. model: a deployed Vision AI model to process the ingested visual data and generate structured outputs
  3. data destination: where to send the structured outputs

Based on the mode of a pipeline, it will ingest and process the visual data, send the outputs to the destination every time the trigger event occurs.

We use data connector as a general term to represent data source and data destination. Please find the supported data connectors here.

Quick start

Download and run VDP locally

Execute the following commands to start pre-built images with all the dependencies:

$ git clone https://github.com/instill-ai/vdp.git && cd vdp

# Build instill/vdp:dev local development image
$ make build

# Launch all services.
$ make all

⚠️ Downloading the Triton server image will take a while, but it should be just a one-time effort.

Run the samples to trigger an object detection pipeline

We provide sample codes on how to build and trigger an object detection pipeline. Run it with the local VDP:

$ cd examples-go

# Download a YOLOv4 ONNX model for object detection task (GPU not required)
$ curl -o yolov4-onnx-cpu.zip https://artifacts.instill.tech/vdp/sample-models/yolov4-onnx-cpu.zip

# [optional] Download a test image or use your own images
$ curl -o dog.jpg https://artifacts.instill.tech/dog.jpg

# Deploy the model
$ go run deploy-model/main.go --model-path yolov4-onnx-cpu.zip --model-name yolov4

# Test the model
$ go run test-model/main.go --model-name yolov4 --test-image dog.jpg

# Create an object detection pipeline
$ go run create-pipeline/main.go --pipeline-name hello-pipeline --model-name yolov4

# Trigger the pipeline by using the same test image
$ go run trigger-pipeline/main.go --pipeline-name hello-pipeline --test-image dog.jpg

Create a pipeline with your own models

Please follow the guideline "Prepare your own model to deploy on VDP ". Based on the above sample codes, you can deploy a prepared model and create your own pipeline.

Clean up

To clean up all running services:

$ make prune

Documentation

The gRPC protocols in protobufs provide the single source of truth for the VDP APIs. To view the generated OpenAPI spec on http://localhost:3000:

$ make doc

Community support

For general help using VDP, you can use one of these channels:

  • GitHub (bug reports, feature requests, project discussions and contributions)
  • Discord (live discussion with the community and the Instill AI Team)

License

See the LICENSE file for licensing information.

Owner
Instill AI
Visual data preparation made for all - Empower modern data stack, tapping the value of unstructured visual data with our open source community.
Instill AI
Comments
  • any default value for .env?

    any default value for .env?

    any example value for .env ?

    if missing , displayed some warning messages as following

    WARNING: The TRITONCONDAENV_IMAGE_TAG variable is not set. Defaulting to a blank string.
    WARNING: The TRITONSERVER_IMAGE_TAG variable is not set. Defaulting to a blank string.
    WARNING: The REDIS_IMAGE_TAG variable is not set. Defaulting to a blank string.
    

    ERROR: build path vdp/dev/pipeline-backend either does not exist, is not accessible, or is not a valid URL. and will not helping with quick start.

  • docs(pipeline): add SYNC and ASYNC diagram and section for connectors

    docs(pipeline): add SYNC and ASYNC diagram and section for connectors

    Because

    • we need a document for users to get to know more about how the pipeline works

    This commit

    • add SYNC and ASYNC diagram and section for connectors
    • close #52
  • Wrong model type

    Wrong model type

    The model type is mistakenly set as tensorrt. https://github.com/instill-ai/vdp/blob/814ea12e94b7ebb4656c7208f5c160c646f8523a/examples-go/deploy-model/main.go#L65

  • feat: add console e2e test into vdp

    feat: add console e2e test into vdp

    Because

    • The console integration test is ready

    This commit

    • add console e2e test into vdp
    • update ITMODE=true when running integration-test. When the flag ITMODE is enabled, the integration-test use dummy models instead of pulling from GitHub, HuggingFace, or ArtiVC to reduce the impact of the internet connection.

    co-author: @Phelan164

  • test: update integration test

    test: update integration test

    Because

    • need integration test to make sure feature work correctly

    This commit

    • add integration test for model-backend and update for pipeline-backend
    • closes #67
    • closes #69

    Limitation: error code is not covered in this PR. It will be added during working this ticket

    Note: docker image for model-backend is dev which should update when have a new release

  • chore: add pipeine-backend integration test and refactor example and docker-compose

    chore: add pipeine-backend integration test and refactor example and docker-compose

    Because

    • we need an integration test for our ci
    • there is a wrong depend configuration in docker-compose
    • refactor example to be more self-contained

    This commit

    • add condition when waiting for pg_sql ready
    • add pipeline integration test by k6
    • add argument when creating pipeline and correct README.md
  • [doc]: There has no make build command

    [doc]: There has no make build command

    Issue

    • According to the quick start of the repo: if you want to develop locally, you could do the following
    $ git clone https://github.com/instill-ai/vdp.git && cd vdp
    
    # Build instill/vdp:dev local development image
    $ make build
    
    # Launch all services.
    $ make all
    

    But actually, there has no make build command in the Makefile

    https://github.com/instill-ai/vdp/blob/main/Makefile#L15

    Only exist make dev and make all.

    Solution

    Please update the readme quick start guideline.

  • [release] v0.1.4-alpha

    [release] v0.1.4-alpha

    Protobufs

    • [x] https://github.com/instill-ai/protobufs/issues/17
    • [x] https://github.com/instill-ai/protobufs/issues/33

    model-backend

    • [x] https://github.com/instill-ai/model-backend/issues/44
    • [x] https://github.com/instill-ai/model-backend/issues/33
    • [x] https://github.com/instill-ai/model-backend/issues/45
    • [x] https://github.com/instill-ai/model-backend/issues/30

    pipeline-backend

    • [x] https://github.com/instill-ai/pipeline-backend/issues/32
    • [x] https://github.com/instill-ai/pipeline-backend/issues/33

    vdp

    • [x] Use the latest docker images in vdp
  • Quick start sample code bug

    Quick start sample code bug

    Deploy the model

    If I run the following script in quick start

    # Deploy the model
    go run deploy-model/main.go --model-path yolov4-onyx-cpu.zip --model-name yolov4
    

    I get a model with model name yolov4 with 1 version.

    2022/02/21 00:16:13 model has been created, the response is: id:1 name:"yolov4" full_Name:"local-user/yolov4" cv_task:DETECTION versions:{version:1 model_id:1 description:"YoloV4 for object detection" created_at:{seconds:1645402547 nanos:80792000} updated_at:{seconds:1645402547 nanos:80807000}}

    The response missed the status of version 1 of the model.

    If I run the above script the second time, I get a model with model name yolov4 with 2 versions.

    bash 2022/02/21 00:20:07 model has been created, the response is: id:1 name:"yolov4" full_Name:"local-user/yolov4" cv_task:DETECTION versions:{version:1 model_id:1 description:"YoloV4 for object detection" created_at:{seconds:1645402547 nanos:80792000} updated_at:{seconds:1645402625 nanos:977017000} status:ONLINE} versions:{version:2 model_id:1 description:"YoloV4 for object detection" created_at:{seconds:1645402779 nanos:961661000} updated_at:{seconds:1645402779 nanos:961692000}}

    The response only includes the status of version 1, but no status of version 2.

    Test the model

    # Test the model
    go run test-model/main.go --model-name yolov4 --test-image dog.jpg --model-version 2
    

    Get response

    2022/02/21 00:24:15 error when triggering predict: rpc error: code = Code(400) desc = {"status":400,"title":"PredictModel","detail":"Model is offline"}
    

    Note: shouldn't we use status code 422 instead of 400 for the above scenario?

    But when GET /models/yolov4

    {
        "id": 1,
        "name": "yolov4",
        "full_Name": "local-user/yolov4",
        "cv_task": "DETECTION",
        "versions": [
            {
                "version": 1,
                "model_id": 1,
                "description": "YoloV4 for object detection",
                "created_at": "2022-02-21T00:15:47.080792Z",
                "updated_at": "2022-02-21T00:20:09.486272Z",
                "status": "ONLINE"
            },
            {
                "version": 2,
                "model_id": 1,
                "description": "YoloV4 for object detection",
                "created_at": "2022-02-21T00:19:39.961661Z",
                "updated_at": "2022-02-21T00:20:09.486272Z",
                "status": "ONLINE"
            }
        ]
    }
    

    The response shows both model versions are online.

  • docs: refactor doc structure

    docs: refactor doc structure

    Because

    • The root README.md is too long and needs to be restructured

    This commit

    • replace trigger mechanism of a data source with pipeline mode concept and add the docs in docs/pipeline-mode.md.
    • move prepare-you-own-model doc to docs/model.md
    • add check-yaml in pre-commit
    • update redoc service name to redoc_openapi

    The pipeline mode is determined by the combination of data source and destination. It describes how an end-to-end pipeline processes its workload.

  • chore(main): release 0.3.0-alpha

    chore(main): release 0.3.0-alpha

    :robot: I have created a release beep boop

    Product Updates

    Announcement 📣

    • VDP (originally, Visual Data Preparation) is officially renamed to Versatile Data Pipeline.

    We have realised that as a general ETL infrastructure, VDP is in fact capable of processing all kinds of unstructured data. We should not limit its usage to only visual data but to more general versatile data. In addition, the term, Data Preparation, has been misleading for users often thinking it has to do with only data labelling, cleaning, or wrangling. In our vision, while VDP should involve data preparation in its MLOps practice, it should not be conceptually confined to only data preparation. VDP does more than that and is focused on the overall effectiveness of the unstructured data ETL with a data-centric paradigm. The end form of a VDP infrastructure is nothing but a data pipeline. The term Data Pipeline is more precise to capture the core concept of VDP hence renamed as Versatile Data Pipeline.

    Features ✨

    VDP (0.3.0-alpha)

    Features

    Bug Fixes

    • fix wrong triton environment when deploying HuggingFace models (#150) (b2fda36)
    • use COCO RLE format for instance segmentation (4d10e46)
    • update model output protocol (e6ea88d)

    Pipeline-backend (0.9.3-alpha)

    Bug Fixes

    • fix pipeline trigger model hanging (https://github.com/instill-ai/pipeline-backend/issues/80) (7ba58e5)

    Connector-backend (0.7.2-alpha)

    Bug Fixes

    • fix connector empty description update (0bc3086)

    Model-backend (0.10.0-alpha)

    Features

    • support instance segmentation task (https://github.com/instill-ai/model-backend/issues/183) (d28cfdc)
    • support async deploy and undeploy model instance (https://github.com/instill-ai/model-backend/issues/192) (ed36dc7)
    • support semantic segmentation (https://github.com/instill-ai/model-backend/issues/203) (f22262c)

    Bug Fixes

    • allow updating emtpy description for a model (https://github.com/instill-ai/model-backend/issues/177) (100ec84)
    • HuggingFace batching bug in preprocess model (b1582e8)
    • model instance state update to unspecified state (https://github.com/instill-ai/model-backend/issues/206) (14c87d5)
    • panic error with nil object (https://github.com/instill-ai/model-backend/issues/208) (a342113)

    Console

    Features

    • extend the time span of our user cookie (https://github.com/instill-ai/console/issues/289) (76a6f99)
    • finish integration test and make it stable (https://github.com/instill-ai/console/issues/281) (3fd8d21)
    • replace prism.js with code-hike (https://github.com/instill-ai/console/issues/292) (cb61708)
    • unify the gap between elements in every table (https://github.com/instill-ai/console/issues/291) (e743820)
    • update console request URL according to new protobuf (https://github.com/instill-ai/console/issues/287) (fa7ecc3)
    • add hg model id field at model_instance page (https://github.com/instill-ai/console/issues/300) (31a6eab)
    • cleanup connector after test (https://github.com/instill-ai/console/issues/295) (f9c8e4c)
    • disable html report (https://github.com/instill-ai/console/issues/297) (689f50d)
    • enhance the warning of the resource id field (https://github.com/instill-ai/console/issues/303) (6c4aa4f)
    • make playwright output dot on CI (https://github.com/instill-ai/console/issues/293) (e5c2958)
    • support model-backend async long run operation (https://github.com/instill-ai/console/issues/309) (f795ce8)
    • update e2e test (https://github.com/instill-ai/console/issues/313) (88bf0cd) update how we test model detail page (https://github.com/instill-ai/console/issues/310) (04c83a1)
    • wipe out all data after test (https://github.com/instill-ai/console/issues/296) (e4085dd)

    Bug Fixes

    • fix pipeline e2e not stable (https://github.com/instill-ai/console/issues/285) (a26e599)
    • fix set-cookie api route issue due to wrong domain name (https://github.com/instill-ai/console/issues/284) (c3efcdd)

    This PR was generated with Release Please. See documentation.

  • chore(main): release 0.3.1-alpha

    chore(main): release 0.3.1-alpha

  • Support async model inference

    Support async model inference

    At the moment, a pipeline in async mode still relies on sync model inference and only asynchronously write to the destination. We should make the model-backend trigger endpoints use Temporal workflow. The pipeline-backend should implement Temporal workflow to request model-backend async inference as well.

An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers
An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Developer-oriented Continuous Delivery Product ⁣ English | 简体中文 Table of Contents Zadig Table of Contents What is Zadig Quick start How to use? How to

Oct 19, 2021
Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.
Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.

Kstone 中文 Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd

Dec 27, 2022
go-opa-validate is an open-source lib that evaluates OPA (open policy agent) policy against JSON or YAML data.
go-opa-validate is an open-source lib that evaluates OPA (open policy agent) policy against JSON or YAML data.

go-opa-validate go-opa-validate is an open-source lib that evaluates OPA (open policy agent) policy against JSON or YAML data. Installation Usage Cont

Nov 17, 2022
The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Elastalert Operator for Kubernetes The Elastalert Operator is an implementation of a Kubernetes Operator. Getting started Firstly, learn How to use el

Jun 28, 2022
Plugin for Helm to integrate the sigstore ecosystem

helm-sigstore Plugin for Helm to integrate the sigstore ecosystem. Search, upload and verify signed Helm Charts in the Rekor Transparency Log. Info he

Dec 21, 2022
🔮 ✈️ to integrate OPA Gatekeeper's new ExternalData feature with cosign to determine whether the images are valid by verifying their signatures

cosign-gatekeeper-provider To integrate OPA Gatekeeper's new ExternalData feature with cosign to determine whether the images are valid by verifying i

Dec 8, 2022
Vilicus is an open source tool that orchestrates security scans of container images(docker/oci) and centralizes all results into a database for further analysis and metrics.
Vilicus is an open source tool that orchestrates security scans of container images(docker/oci) and centralizes all results into a database for further analysis and metrics.

Vilicus Table of Contents Overview How does it work? Architecture Development Run deployment manually Usage Example of analysis Overview Vilicus is an

Dec 6, 2022
Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster
Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster

Kube-Knark Project Trace your kubernetes runtime !! Kube-Knark is an open source tracer uses pcap & ebpf technology to perform runtime tracing on a de

Sep 19, 2022
Horusec is an open source tool that improves identification of vulnerabilities in your project with just one command.
Horusec is an open source tool that improves identification of vulnerabilities in your project with just one command.

Table of contents 1. About 2. Getting started 2.1. Requirements 2.2. Installation 3. Usage 3.1. CLI Usage 3.2. Using Docker 3.3. Older versions 3.4. U

Jan 7, 2023
:rocket: Modern cross-platform HTTP load-testing tool written in Go
:rocket: Modern cross-platform HTTP load-testing tool written in Go

English | 中文 Cassowary is a modern HTTP/S, intuitive & cross-platform load testing tool built in Go for developers, testers and sysadmins. Cassowary d

Dec 29, 2022
k6 is a modern load testing tool for developers and testers in the DevOps era.
k6 is a modern load testing tool for developers and testers in the DevOps era.

k6 is a modern load testing tool, building on our years of experience in the load and performance testing industry. It provides a clean, approachable scripting API, local and cloud execution, and flexible configuration.

Jan 8, 2023
Bubbly is an open-source platform that gives you confidence in your continuous release process.
Bubbly is an open-source platform that gives you confidence in your continuous release process.

Bubbly Bubbly - Release Readiness in a Bubble Bubbly emerged from a need that many lean software teams practicing Continuous Integration and Delivery

Nov 29, 2022
Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers Benchmark specification
Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers  Benchmark specification

lxd-probe Scan your Linux container runtime !! Lxd-Probe is an open source audit scanner who perform audit check on a linux container manager and outp

Dec 26, 2022
KubeCube is an open source enterprise-level container platform
KubeCube is an open source enterprise-level container platform

KubeCube English | 中文文档 KubeCube is an open source enterprise-level container platform that provides enterprises with visualized management of Kuberne

Jan 4, 2023
Devtron is an open source software delivery workflow for kubernetes written in go.
Devtron is an open source software delivery workflow for kubernetes written in go.

Devtron is an open source software delivery workflow for kubernetes written in go.

Jan 8, 2023
TriggerMesh open source event-driven integration platform powered by Kubernetes and Knative.

TriggerMesh open source event-driven integration platform powered by Kubernetes and Knative. TriggerMesh allows you to declaratively define event flows between sources and targets as well as add even filter, splitting and processing using functions.

Dec 30, 2022
Fleet - Open source device management, built on osquery.
Fleet - Open source device management, built on osquery.

Fleet - Open source device management, built on osquery.

Dec 30, 2022