Sharing intermediate data between systems in modern big data and AI workflows can be challenging, often causing significant bottlenecks in such jobs. Let's consider the following fraud detection pipeline:
From the pipeline, we observed:
Users usually prefer to program with dedicated computing systems for different tasks in the same applications, such as SQL and Python.
Integrating a new computing system into production environments demands high technical effort to align with existing production environments in terms of I/O, failover, etc.
Data could be polymorphic. Non-relational data, such as tensors, dataframes (in Pandas) and graphs/networks (in GraphScope) are becoming increasingly prevalent. Tables and SQL may not be the best way to store, exchange, or process them.
Transforming the data back and forth between different systems as "tables" could result in a significant overhead.
Saving/loading the data to/from the external storage requires numerous memory copies and incurs high IO costs.
Vineyard (v6d) is an in-memory immutable data manager that offers out-of-the-box high-level abstraction and zero-copy sharing for distributed data in big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.
Vineyard shares immutable data across different systems using shared memory without extra overheads, eliminating the overhead of serialization/deserialization and IO when exchanging immutable data between systems.
Vineyard defines a metadata-payload separated data model to capture the payload commonalities and method commonalities between sharable objects in different programming languages and different computing systems in a unified way.
The :ref:`VCDL` (Vineyard Component Description Language) is specifically designed to annotate sharable members and methods, enabling automatic generation of boilerplate code for minimal integration effort.
In many big data analytical tasks, a substantial portion of the workload consists of boilerplate routines that are unrelated to the core computation. These routines include various IO adapters, data partition strategies, and migration jobs. Due to different data structure abstractions across systems, these routines are often not easily reusable, leading to increased complexity and redundancy.
Vineyard provides common manipulation routines for immutable data as drivers, which extend the capabilities of data structures by registering appropriate drivers. This enables out-of-the-box reuse of boilerplate components across diverse computation jobs.
Vineyard provides efficient distributed data sharing in cloud-native environments by embracing cloud-native big data processing. Kubernetes helps Vineyard leverage the scale-in/out and scheduling abilities of Kubernetes.
.. panels:: :header: text-center :container: container-lg pb-4 :column: col-lg-4 col-md-4 col-sm-4 col-xs-12 p-2 :body: text-center .. link-button:: # :type: url :text: Object manager :classes: btn-block stretched-link Put and get arbitrary objects using Vineyard, in a zero-copy way! --- .. link-button:: # :type: url :text: Cross-system sharing :classes: btn-block stretched-link Share large objects across computing systems. --- .. link-button:: # :type: url :text: Data orchestration :classes: btn-block stretched-link Vineyard coordinates the flow of objects and jobs on Kubernetes based on data-aware scheduling.
.. panels:: :header: text-center :column: col-lg-12 p-2 .. link-button:: notes/getting-started :type: ref :text: User Guides :classes: btn-block stretched-link ^^^^^^^^^^^^ Get started with Vineyard. --- .. link-button:: notes/cloud-native/deploy-kubernetes :type: ref :text: Deploy on Kubernetes :classes: btn-block stretched-link ^^^^^^^^^^^^ Deploy Vineyard on Kubernetes and accelerate big-data analytical workflows on cloud-native infrastructures. --- .. link-button:: tutorials/tutorials :type: ref :text: Tutorials :classes: btn-block stretched-link ^^^^^^^^^^^^ Explore use cases and tutorials where Vineyard can bring added value. --- .. link-button:: notes/developers :type: ref :text: Getting Involved :classes: btn-block stretched-link ^^^^^^^^^^^^ Get involved and become part of the Vineyard community. --- .. link-button:: notes/developers/faq :type: ref :text: FAQ :classes: btn-block stretched-link ^^^^^^^^^^^^ Frequently asked questions and discussions during the adoption of Vineyard.
- Wenyuan Yu, Tao He, Lei Wang, Ke Meng, Ye Cao, Diwen Zhu, Sanhong Li, Jingren Zhou. Vineyard: Optimizing Data Sharing in Data-Intensive Analytics. ACM SIG Conference on Management of Data (SIGMOD), industry, 2023. .
Vineyard is a CNCF sandbox project and is made successful by its community.
.. toctree:: :maxdepth: 1 :caption: User Guides :hidden: notes/getting-started.rst notes/architecture.rst notes/key-concepts.rst
.. toctree:: :maxdepth: 1 :caption: Cloud-Native :hidden: notes/cloud-native/deploy-kubernetes.rst notes/cloud-native/vineyard-operator.rst Command-line tool <notes/cloud-native/vineyardctl.md>
.. toctree:: :maxdepth: 1 :caption: Tutorials :hidden: tutorials/data-processing.rst tutorials/kubernetes.rst tutorials/extending.rst
.. toctree:: :maxdepth: 1 :caption: Integration :hidden: notes/integration-bigdata.rst notes/integration-orchestration.rst
.. toctree:: :maxdepth: 2 :caption: API Reference :hidden: notes/references.rst
.. toctree:: :maxdepth: 1 :caption: Developer Guides :hidden: notes/developers.rst notes/developers/faq.rst